我是Python编程的初学者。我在python中使用bs4模块练习网页抓取。
我从网页中提取了一些字段,但它只提取了13个项目,而网页有超过13个项目。我无法理解为什么其他项目没有被提取出来。
另一件事是我想提取网页上每个项目的联系电话和电子邮件地址,但它们可以在项目的相应链接中找到。我是初学者,坦率地说,我一直坚持如何访问和抓取给定网页中每个项目的单个网页的链接。请告诉我在哪里做错了,如果可能的话,建议做什么。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
res = requests.post('https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A')
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
records = []
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
records.append({'Names': name, 'Position': position})
df = pd.DataFrame(records,columns=['Names','Position'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)
我已经完成了上面的代码,只是提取每个项目的名称和位置,但它只删除了13条记录,但记录的数量超过了网页中的记录。我无法编写任何代码来提取每个记录的联系号码和电子邮件地址,因为它存在于每个项目的单个页面中,因为我已经卡住了。
Excel工作表如下所示:
该网站在滚动时动态加载列表,但您可以跟踪ajax
请求,并直接解析数据:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []
with requests.Session() as s:
for i in range(1, 22):
res = s.get(f'https://www.nelsonalexander.com.au/real-estate-agents/page/{i}/?ajax=1&agent=A', headers=headers)
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
records.append({'Names': name, 'Position': position, 'Phone': phone})
df = pd.DataFrame(records,columns=['Names','Position', 'Phone'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)
我确信电子邮件不在DOM中。我对@drec4s代码做了一些修改,直到没有条目(动态)。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import itertools
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []
with requests.Session() as s:
for i in itertools.count():
res = s.get('https://www.nelsonalexander.com.au/real-estate-agents/page/{}/?ajax=1&agent=A'.format(i), headers=headers)
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
if(len(data) > 0):
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
records.append({'Names': name, 'Position': position, 'Phone': phone})
print({'Names': name, 'Position': position, 'Phone': phone})
else:
break