当我浏览循环时，这个网页是如何阻止我的，而不是直接访问它时？

Question

我试图刮一组网页。当我直接从一个网页上抓取时，我可以访问html。但是，当我遍历pd数据帧来刮取一组网页时，即使是只有一行的数据框，我看到一个截断的html，无法提取我想要的数据。

Iterating through a dataframe of 1 row:

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

first_names = pd.Series(['Robert'], index = [0])
last_names = pd.Series(['McCoy'], index = [0])
names = pd.DataFrame(columns = ['first_name', 'last_name'])
names['first_name'] = first_names
names['last_name'] = last_names

freq = []

for first_name, last_name in names.iterrows():
    url = "https://zbmath.org/authors/?q={}+{}".format(first_name, 
    last_name)
    r = requests.get(url)
    html = BeautifulSoup(r.text)
    html=str(html)
    frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
    freq.append(frequency)

print(freq)

[[]]

Accessing the webpage directly. Same code, but now not blocked.

url = "https://zbmath.org/authors/?q=robert+mccoy"
r = requests.get(url)
html = BeautifulSoup(r.text)
html=str(html)
frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
freq.append(frequency)

print(freq)

[[], ['10', '8', '6', '5', '3', '3', '2', '2', '2', '2', '2', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']]

如何循环浏览多个网页但不会被阻止？

Answer 1

Iterrows返回一个（索引，（列））元组，因此解决方案是稍微区别地解析它：

for _,(first_name, last_name) in names.iterrows():
    url = "https://zbmath.org/authors/?q={}+{}".format(first_name, 
    last_name)
    r = requests.get(url)
    html = BeautifulSoup(r.text)
    html=str(html)
    frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
    freq.append(frequency)

print(freq)

当我浏览循环时，这个网页是如何阻止我的，而不是直接访问它时？

问题描述投票：0回答：1

1个回答

最新问题

当我浏览循环时，这个网页是如何阻止我的，而不是直接访问它时？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1