通过迭代pd数据帧来刮擦多个网页

Question

我试图通过迭代Pandas数据框（“名称”）来抓取一组网页，该数据框包含要插入网页URL的名字和姓氏。

我已经设置了空列表（“collab”，“freq”），以填充从每个网页中提取的数据。当我只抓取一个网页时，我的代码成功提取数据以填充这些列表。但是，如果我遍历多个网页，我最终会得到空列表。

我觉得问题在于我的for循环。谁能帮我弄清楚出了什么问题？

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests
import time

first_names = pd.Series(['Ernest', 'Archie', 'Donald', 'Charles', 
'Ralph'], index = [0, 1, 2, 3, 4])
last_names = pd.Series(['Anderson', 'Higdon', 'Rock', 'Thorne', 
'Tripp'], index = [0, 1, 2, 3, 4])
names = pd.DataFrame(columns = ['first_name', 'last_name'])
names['first_name'] = first_names
names['last_name'] = last_names

collab = []
freq = []

for first_name, last_name in names.iterrows():
    url = "https://zbmath.org/authors/?q={}+{}".format(first_name, 
    last_name)
    r = requests.get(url)
    html = BeautifulSoup(r.content)
    coauthors = html.find_all('div', class_='facet1st')
    coauthors=str(coauthors)
    frequency = re.findall('Joint\sPublications">(.*?)</a>', coauthors)
    freq.append(frequency)
    collaborators = re.findall('Author\sProfile">\n(.*?)</a>', 
    coauthors)
    collaborators = [x.strip(' ') for x in collaborators]
    collab.append(collaborators)
    time.sleep(1)

print(collab)

[[], [], [], [], []]

Answer 1

重新分配要小心。虽然re.findall()返回一个列表，并且collaborators用于列表推导，其结果随后被分配给collaborators，如果/当这些数据结构在被读取时被修改时会产生麻烦。虽然这不是主要问题。

collab是列表的空列表，因为页面抓取不成功。

根本问题是使用DataFrame.iterrows()。该文档显示它返回一个索引，然后返回数据，这将是蟒蛇期望的。

因此，请检查您从names.iterrows()循环的返回值。

DataFrame.iterrows返回'row'，这真是一个pandas.Series。所以，你需要从系列中访问first_name和last_name。

解包值first_name和last_name是：

for first_name, last_name in names.iterrows():
    print (i, type(data))

0 <class 'pandas.core.series.Series'>
1 <class 'pandas.core.series.Series'>
2 <class 'pandas.core.series.Series'>
3 <class 'pandas.core.series.Series'>
4 <class 'pandas.core.series.Series'>

last_name是一个系列，你会发现first_name和last_name

这个文档https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html显示了iterrows()的用法，并且它的用户可能希望考虑itertuples()。

通过迭代pd数据帧来刮擦多个网页

问题描述投票：0回答：1

1个回答

最新问题

通过迭代pd数据帧来刮擦多个网页

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1