beautifulsoup 相关问题

Beautiful Soup是一个用于解析HTML / XML的Python包。此软件包的最新版本是版本4，导入为bs4。

使用 Selenium 进行页面导航

出于个人兴趣，我想在以下网页上抓取汽车评论 www.cardekho.com/user-reviews/maruti-alto-800 我成功地用下面的代码在第一页上抓取了评论......

selenium-webdriver web-scraping beautifulsoup

回答 1 投票 0

［网页抓取］使用selenium进行页面导航

出于个人兴趣，我想在以下网页上进行报废汽车评论 www.cardekho.com/user-reviews/maruti-alto-800 我使用以下代码成功取消了第一页上的评论...

selenium-webdriver web-scraping beautifulsoup

回答 1 投票 0

尝试从雅虎财经网络抓取 S&P500 数据，但尽管格式正确但无法检索

我一直在尝试从雅虎财经网络抓取数据，特别是标准普尔 500 指数的历史数据，其网页网址为“https://finance.yahoo.com/quote/%5EGSPC/history/?period1=157407.. .

python web-scraping beautifulsoup

回答 1 投票 0

使用 BeautifulSoup 查找多个具有相同类的 div 中的所有“a”标签

我想在具有相同类的多个div中找到所有“a”元素。从 bs4 导入 BeautifulSoup links = soup.find_all("div", class_="va-columns").find_all("a"...

python python-3.x beautifulsoup

回答 1 投票 0

使用BeautifulSoup从文本框中提取标题

我正在尝试使用 beautiful soup 编写代码，打印此网页上左手灰色框中的链接文本。在这种情况下，代码应该返回 ** 结界抱石一个...

python html beautifulsoup

回答 1 投票 0

使用雅虎财经的 beautifulsoup 进行屏幕抓取适用于除一只股票之外的所有股票

我已经尝试了好几天来解决这个问题，但已经没有想法了。我正在使用 Python3 和 Beautifulsoup 从雅虎财经获取股票价格。它适用于大约一百种不同的情况...

python beautifulsoup

回答 1 投票 0

从元内容中提取文本

让我们假设我们有以下网站：第比利斯的房屋价格我已经实现了我的代码片段及其相应的结果： div_class =content.find_all("...

python web-scraping beautifulsoup

回答 1 投票 0

使用 BeautifulSoup 如何从具有多个类的元素中删除单个类？

我希望从具有多个类名的元素中删除单个类名，如下所示：我希望从具有多个类名的元素中删除单个类名，如下所示： <li class="name1 name2 name3"> <a href="http://www.somelink.com">link</a> </li> 我可以使用 BeautifulSoup 通过以下方式删除类： soup.find(class_="name3")["class"] = "" 但这会删除所有课程，而不仅仅是我想失去的课程。从你的html中，你可以看到， print soup.find(class_="name3").attrs {'class': ['name1', 'name2', 'name3']} 因此，soup.find(class_="name3")['class']只返回一个列表。您可以从中删除元素，就像您可以从列表中删除元素一样。喜欢， soup.find(class_="name3")["class"].remove('name1') 这将删除您想要失去的课程。您可以使用生成器表达式来重建您想要保留的类名 s = 'name1 name2 name3' s = ' '.join(i for i in s.split() if i != 'name3') >>> s 'name1 name2'

python beautifulsoup

回答 2 投票 0

提取跨度值

我觉得我已经很接近了，但在几个小时没有进展之后我正在尝试这里。我想抓取跨度值并将它们分配给变量或列表以进行进一步处理。导入请求从 BS4 导入

python pandas beautifulsoup

回答 1 投票 0

我无法使用Try-除了py

我有一个页面想要抓取，但并不总是可以抓取我希望代码 24/7 运行所以我做了这个导入请求导入响应从 bs4 导入 BeautifulSoup 我...

python debugging web-scraping beautifulsoup try-except

回答 1 投票 0

使用 beautiful soup + python 从网站上抓取元素很困难：为什么？

网站：https://www.wingsforlife.com/uk/ 我正在努力从上述网站上抓取文章标题和链接。标题名称的示例为“推动治愈的新颖资助模式”...

python web-scraping beautifulsoup

回答 1 投票 0

使用 python 从网站上的隐藏选项卡进行网页抓取

我正在使用 bs4 和 selenium 来抓取土地拍卖网站（URL https://bid.hertz.ag/ui/auctions/112571/14320874），但我无法抓取拍卖日期和附件链接在 '

python selenium-webdriver web-scraping beautifulsoup dynamic

回答 1 投票 0

Python 请求处理带有点的 YahooFinance URL 时出错

我想用我为雅虎财经股票页面编写的解析脚本向您展示这个随机问题。导入请求从 bs4 导入 BeautifulSoup headers ={'User-Agent':'Mozilla/5.0 (Windows N...

python beautifulsoup python-requests yahoo-finance

回答 1 投票 0

如何修复 ValueError：无法设置列不匹配的行 |美丽的汤

我收到错误： ValueError：无法设置列不匹配的行从维基百科上抓取时。见下文。我该如何解决这个问题？从 bs4 导入 BeautifulSoup 将 pandas 导入为 pd 导入

python pandas dataframe web-scraping beautifulsoup

回答 2 投票 0

ValueError：无法设置列不匹配的行--beautifulSoup

我在从维基百科抓取时收到“ValueError：无法设置列不匹配的行错误”。见下文。我该如何解决这个问题？从 bs4 导入 BeautifulSoup 将 pandas 导入为 pd 导入请求...

python beautifulsoup jupyter-notebook

回答 1 投票 0

为什么 BeautifulSoup find_all() 方法在 HTML 注释标记后停止？

我正在使用BeautifulSoup来解析这个网站： https://www.baseball-reference.com/postseason/1905_WS.shtml 在网站内部，有以下元素我正在使用 BeautifulSoup 来解析这个网站： https://www.baseball-reference.com/postseason/1905_WS.shtml 网站内有以下元素 <div id="all_post_pitching_NYG" class="table_wrapper"> 该元素作为包装器应包含以下元素： <div class="section_heading assoc_post_pitching_NYG as_controls" id="post_pitching_NYG_sh"> <div class="placeholder"></div> 很长的 HTML 注释 <div class="topscroll_div assoc_post_pitching_NYG"> <div class="table_container is_setup" id="div_post_pitching_NYG"> <div class="footer no_hide_long" id="tfooter_post_pitching_NYG"> 我一直在使用： response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.content, "html.parser") pitching = soup.find_all("div", id=lambda x: x and x.startswith("all_post_pitching_"))[0] for div in pitching: print(div) 但是它只会打印非常长的绿色 HTML 注释，然后它就不会打印 (4) 或更长的时间。我做错了什么？检查特殊字符串： Tag、NavigableString 和 BeautifulSoup 几乎涵盖了您在 HTML 或 XML 文件中看到的所有内容，但还有一些剩余的部分。您可能会遇到的主要问题是评论。一个简单的解决方案可能是替换 HTML 字符串中的注释字符，以将其显示为 BeautifulSoup: import requests from bs4 import BeautifulSoup soup = BeautifulSoup( requests.get('https://www.baseball-reference.com/postseason/1905_WS.shtml').text.replace('','') ) pitching = soup.select('div[id^="all_post_pitching_"]')[0] for e,div in enumerate(pitching.select('div'),1): print(e,div) 更具体的替代方法是使用 bs4.Comment

python web-scraping beautifulsoup python-requests

回答 1 投票 0

BeautifulSoup find_all() 方法在 HTML 注释标记后停止

我正在使用BeautifulSoup来解析这个网站：https://www.baseball-reference.com/postseason/1905_WS.shtml 网站里面有一个我正在使用 BeautifulSoup 来解析这个网站：https://www.baseball-reference.com/postseason/1905_WS.shtml 网站内有一行<div id="all_post_pitching_NYG" class="table_wrapper">。里面有一个 (1) <div class="section_heading assoc_post_pitching_NYG has_controls" id="post_pitching_NYG_sh"> (2) <div class="placeholder"></div> (3) 很长的 HTML 注释 (4) <div class="topscroll_div assoc_post_pitching_NYG"> (5) <div class="table_container is_setup" id="div_post_pitching_NYG"> (6) <div class="footer no_hide_long" id="tfooter_post_pitching_NYG"> 我一直在用 response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.content, "html.parser") pitching = soup.find_all("div", id=lambda x: x and x.startswith("all_post_pitching_"))[0] for div in pitching: print(div) 但它只会打印非常长的绿色 HTML 注释，然后它永远不会打印 (4) 或更长的时间。我做错了什么？预先感谢！检查特殊字符串： Tag、NavigableString 和 BeautifulSoup 几乎涵盖了您在 HTML 或 XML 文件中看到的所有内容，但还有一些剩余的部分。您可能会遇到的主要问题是评论。一个简单的解决方案可能是替换 HTML 字符串中的注释字符，以将其显示为 BeautifulSoup: soup = BeautifulSoup( requests.get('https://www.baseball-reference.com/postseason/1905_WS.shtml').text.replace('','') ) 更具体的替代方法是使用 bs4.Comment

python beautifulsoup

回答 1 投票 0

如何使用 beautiful soup 从 HTML 内容中选择特定的 div 或 pragraph 标签？

我正在使用 beautiful soup 从 HTML 数据中提取一些文本内容。我有一个 div 和几个段落标签，最后一段是版权信息，带有版权徽标、年份和...

python beautifulsoup

回答 1 投票 0

如何将带有嵌套表格的表格的html转换为docx？

我想使用单元格中包含嵌套表格的表格转换任何 html。当我尝试执行此操作时，嵌套表所在的行后面会出现其他行。行数...

html beautifulsoup docx python-docx

回答 1 投票 0

使用 BeautifulSoup 解析带有子节点的 SEC EDGAR XML 表单数据

我正在尝试使用漂亮的 soup 和 xml 从 SEC 的 N-PORT-P/A 表格中抓取个人基金持有量。典型的提交如下所示，[链接在此][1]，如下所示：我正在尝试使用 beautiful soup 和 xml 从 SEC 的 N-PORT-P/A 表格中抓取个人基金持有量。典型的提交如下所示，[链接在此][1]，如下所示： <edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <headerData> <submissionType>NPORT-P/A</submissionType> <isConfidential>false</isConfidential> <accessionNumber>0001145549-23-004025</accessionNumber> <filerInfo> <filer> <issuerCredentials> <cik>0001618627</cik> <ccc>XXXXXXXX</ccc> </issuerCredentials> </filer> <seriesClassInfo> <seriesId>S000048029</seriesId> <classId>C000151492</classId> </seriesClassInfo> </filerInfo> </headerData> <formData> <genInfo> ... </genInfo> <fundInfo> ... </fundInfo> <invstOrSecs> <invstOrSec> <name>ARROW BIDCO LLC</name> <lei>549300YHZN08M0H3O128</lei> <title>Arrow Bidco LLC</title> <cusip>042728AA3</cusip> <identifiers> <isin value="US042728AA35"/> </identifiers> <balance>115000.000000000000</balance> <units>PA</units> <curCd>USD</curCd> <valUSD>114754.170000000000</valUSD> <pctVal>0.3967552449</pctVal> <payoffProfile>Long</payoffProfile> <assetCat>DBT</assetCat> <issuerCat>CORP</issuerCat> <invCountry>US</invCountry> <isRestrictedSec>N</isRestrictedSec> <fairValLevel>2</fairValLevel> <debtSec> <maturityDt>2024-03-15</maturityDt> <couponKind>Fixed</couponKind> <annualizedRt>9.500000000000</annualizedRt> <isDefault>N</isDefault> <areIntrstPmntsInArrs>N</areIntrstPmntsInArrs> <isPaidKind>N</isPaidKind> </debtSec> <securityLending> <isCashCollateral>N</isCashCollateral> <isNonCashCollateral>N</isNonCashCollateral> <isLoanByFund>N</isLoanByFund> </securityLending> </invstOrSec> Arrow Bidco LLC 是投资组合中的债券，其一些特征包含在文件中（CUSIP、CIK、余额、到期日等）。我正在寻找迭代每个单独的证券 (investOrSec) 并收集数据框中每个证券的特征的最佳方法。我当前使用的代码是： import numpy as np import pandas as pd import requests from bs4 import BeautifulSoup header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"} n_port_file = requests.get("https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml", headers=header, verify=False) n_port_file_xml = n_port_file.content soup = BeautifulSoup(n_port_file_xml,'xml') names = soup.find_all('name') lei = soup.find_all('lei') title = soup.find_all('title') cusip = soup.find_all('cusip') .... maturityDt = soup.find_all('maturityDt') couponKind = soup.find_all('couponKind') annualizedRt = soup.find_all('annualizedRt') 然后迭代每个列表，根据每行中的值创建一个数据框。 fixed_income_data = [] for i in range(0,len(names)): rows = [names[i].get_text(),lei[i].get_text(), title[i].get_text(),cusip[i].get_text(), balance[i].get_text(),units[i].get_text(), pctVal[i].get_text(),payoffProfile[i].get_text(), assetCat[i].get_text(),issuerCat[i].get_text(), invCountry[i].get_text(),couponKind[i].get_text() ] fixed_income_data.append(rows) fixed_income_df = pd.DataFrame(equity_data,columns = ['name', 'lei', 'title', 'cusip', 'balance', 'units', 'pctVal', 'payoffProfile', 'assetCat', 'issuerCat', 'invCountry' 'maturityDt', 'couponKind', 'annualizedRt' ], dtype = float) 当包含所有信息时，这种方法效果很好，但通常有一个变量未被考虑在内。表单的一部分可能是空白的，或者发行人类别可能未正确填写，从而导致 IndexError。该投资组合包含我能够解析的 127 种证券，但可能缺少单一证券的年化回报率，从而失去了整齐创建数据框的能力。此外，对于同时持有固定收益和股票证券的投资组合，股票证券不会返回 DebtSecs 子项的信息。有没有一种方法可以迭代这些数据，同时以最简单的方式清理它？即使为权益证券未引用的 DebtSec 子项添加“NaN”也是有效的响应。任何帮助将不胜感激！ [1]：https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml 在我看来，这是处理问题的最佳方法。一般来说，EDGAR 文件非常难以解析，因此以下内容可能适用于其他文件，也可能不适用于其他文件，即使来自同一文件管理器也是如此。为了让自己更轻松，因为这是一个 XML 文件，所以您应该使用 xml 解析器和 xpath。鉴于您要创建一个数据框，最合适的工具是 pandas read_xml() 方法。因为 XML 是嵌套的，所以您需要创建两个不同的数据帧并将它们连接起来（也许其他人对如何处理它有更好的想法）。最后，虽然 read_xml() 可以直接从 url 读取，但在这种情况下，EDGAR 需要使用用户代理，这意味着您还需要使用 requests 库。所以，大家一起： #import required libraries import pandas as pd import requests url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml' #set headers with a user-agent headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"} req = requests.get(url, headers=headers) #define the columns you want to drop (based on the data in your question) to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending'] #the filing uses namespaces (too complicated to get into here), so you need to define that as well namespaces = {"nport": "http://www.sec.gov/edgar/nport"} #create the first df, for the securities which are debt instruments invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1) #crete the 2nd df, for the debt details: debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3] #finally, concatenate the two into one df: pd.concat([invest, debt], axis=1) 这应该输出您的 126 种债务证券（请原谅格式）： lei title cusip balance units pctVal payoffProfile assetCat issuerCat invCountry maturityDt couponKind annualizedRt 0 ARROW BIDCO LLC 549300YHZN08M0H3O128 Arrow Bidco LLC 042728AA3 115000.00 PA 0.396755 Long DBT CORP US 2024-03-15 Fixed 9.50000 1 CD&R SMOKEY BUYER INC NaN CD&R Smokey Buyer Inc 12510CAA9 165000.00 PA 0.505585 Long DBT CORP US 2025-07-15 Fixed 6.75000 然后您可以使用最终的 df、添加或删除列等您可以使用 MIT 许可的 datamule 包来完成此操作，该包可以处理下载和解析。免责声明：我是开发商。 from datamule import Filing, Downloader from pathlib import Path import os downloader = Downloader() downloader.download(form='NPORT-P',output_dir='NPORT',date=('2001-01-01','2024-11-01')) os.makedirs('NPORT_json', exist_ok=True) for file in Path('NPORT').iterdir(): filing = Filing(str(file), 'NPORT-P') filing.parse_filing() filing.write_json(f'NPORT_json/{file.name}.json') 您还可以直接访问馆藏数据，因为 Filing() 是一个可迭代对象。 pd.DataFrame(filing)

python xml beautifulsoup portfolio edgar

回答 2 投票 0

beautifulsoup 相关问题

最新问题