在Python中抓取HTML表

Question

我正在尝试抓取SEC报告页面，以获取一些代码的基本信息。

这是Apple的示例URL-https://sec.report/CIK/0000320193

页面内是“公司详细信息”表，其中包含基本信息。我本质上是在尝试仅刮擦IRS号，公司成立州和地址。

我很酷，只需抓取这张图并将其保存到PD Df中即可。我对网页抓取非常陌生，因此正在寻找一些技巧来使这项工作有效！下面是我的代码，但是提取面板主体后，我不知道要去哪里。谢谢大家！

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143.html', headers = headers)
page.content

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

soup.find_all(class_='panel-body')

Answer 1

代替BeautifoulSoup试试lxml包，对我来说，用xpath语句查找元素更容易：

import requests
from lxml import html

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143', headers=headers)

raw_html = html.fromstring(page.text)

irs = raw_html.xpath('//tr[./td[contains(text(),"IRS Number")]]/td[2]/text()')[0]

state_incorp = raw_html.xpath('//tr[./td[contains(text(),"State of Incorporation")]]/td[2]/text()')

address = raw_html.xpath('//tr[./td[contains(text(),"Business Address")]]/td[2]/text()')[0]

在Python中抓取HTML表

问题描述投票：0回答：1

1个回答

最新问题

在Python中抓取HTML表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1