Webscraper使用bs4不能给出正确的结果

问题描述 投票:0回答:1

我正尝试在此处刮擦现场亿万富翁净资产表> https://www.bloomberg.com/billionaires/

到目前为止,这是我的代码。我得到的只是[]作为python shell的结果。

“ findAll”一定有问题,我认为我没有使用正确的标记行。

试图仅使用“查找”

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

#Open page and grab html
my_url = ('https://www.bloomberg.com/billionaires/')
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#HTML Parser.
page_soup = soup(page_html, 'html.parser')

table = []

#Find table.
ele_table = page_soup.findAll('div',{'class':'dvz-content'})

print(ele_table)

我希望表格可以打印出来,因此可以将其保存为CSV文件。

python-3.7
1个回答
0
投票

数据是动态加载的。如果提供正确的标题,则可以从脚本标签中提取。正则表达式输出所需的信息并使用json库进行解析。将此交给熊猫写给csv

from bs4 import BeautifulSoup as bs
import requests, re, json
import pandas as pd

headers = {
    'user-agent': 'Mozilla/5.0',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'if-none-match': 'W/^\\^5dbb59e6-91b10^\\^',
    'if-modified-since': 'Thu, 31 Oct 2019 22:02:14 GMT' # this may be safeguard for caching. Consider if add dynamically.
}

p = re.compile(r'window.top500 = (.*);')
r = requests.get('https://www.bloomberg.com/billionaires/', headers = headers)
data = json.loads(p.findall(r.text)[0])
df = pd.DataFrame(data)
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False)

示例输出:

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.