beautifulsoup 相关问题

Beautiful Soup是一个用于解析HTML / XML的Python包。此软件包的最新版本是版本4，导入为bs4。

我有以下 urllib 和 BeautifulSoup 代码： getSite = urllib.urlopen(pageName) # 打开当前站点 getSitesoup = BeautifulSoup(getSite.read()) # 读取网站内容打印

python beautifulsoup python-unicode

回答 4 投票 0

程序解析错误表

我有点困惑为什么我的代码返回“大麻股票”，它位于 class=cwl-performance 的表格下。我正在尝试使用 class=cwl-symbols 从表中收集股票名称......

python beautifulsoup

回答 3 投票 0

为什么我的 BeautifulSoup 脚本无法正确解析下议院财务利益登记册页面中的 .htm 数据？

导入请求从 bs4 导入 BeautifulSoup # MP 的基本 URL base_url =“https://publications.parliament.uk/pa/cm/cmregmem/240930/” # 内容页面 URL content_url = f"{ba...

python web-scraping beautifulsoup

回答 1 投票 0

如何在Python中使用BeautifulSoup有效地从脚本标签中提取数据

我正在使用 Python 从如下 URL 中的标签中提取数据：响应 = session.get("example.com") 汤 = BeautifulSoup(response.content,features='html.parser') all_scripts = 汤。

python beautifulsoup python-re

回答 1 投票 0

抓取所有这些链接

我想抓取这个链接点击该链接后，您将看到很多篮球比赛。我想把它们全部刮掉。但我不知道如何自动完成。例如：Scrape 1、Scr...

python python-3.x beautifulsoup

回答 1 投票 0

如何从javascript获取Beautifulsoup中的图像？

在我的学校，我们有一个交互式白板，我们可以将它们导出到带有提供的链接的网站。唯一的问题是链接过期了（这很愚蠢），所以我想做一个简单的 python ...

python selenium-webdriver web-scraping beautifulsoup

回答 1 投票 0

Python 中的 BeautifulSoup find() 以意想不到的方式处理元组

我正在练习爬行网络，昨天我得到了一个意想不到的正确结果，我认为它不应该起作用。我用 soup.find('id'=i) 来查找属性键 i，我虽然 i 必须是字符串，...

python beautifulsoup

回答 1 投票 0

使用 bs4 从本地 html 文件解析数据？

我尝试使用以下代码解析本地 html 文档 - 导入操作系统、系统从 bs4 导入 BeautifulSoup 路径 = os.path.abspath(os.path.dirname(sys.argv[0])) fnHTML = os.path.join(路径, "...

python beautifulsoup

回答 1 投票 0

使用 Beautiful Soup 解析 Grobid .tei.xml 输出

我正在尝试使用 Beautiful Soup 从使用 Grobid 生成的 .tei.xml 文件中提取元素。我可以使用以下方法获取标题：标题 = soup.findAll('标题') 正确的语法是什么...

python beautifulsoup grobid

回答 3 投票 0

如何抓取像Beauhurst和Pitchbook这样的数据库平台？

使用Python 3.12； pycharm。一些背景信息：我的 Excel 表格上有不到 800 家所谓的英国私募股权公司，其中很多都处于不活跃/解散等状态。我必须抓住一些关键...

python web-scraping beautifulsoup python-requests

回答 1 投票 0

Pandas 和 bs4 html 抓取

我正在从html文件中提取数据，它是表格格式，所以我编写了这行代码，将所有表格转换为带有pandas的数据框。 dfs = pd.read_html("synced_contacts.html"...

python pandas dataframe web-scraping beautifulsoup

回答 2 投票 0

使用 BeautifulSoup 将 HTML 中编码的 JSON 转换为 JSON

我知道这里也有人问过类似的问题，但我仍在努力寻找解决方案。我可以使用 Beautiful Soup 解析 bandintown 网站上的原始 HTML，但我的最终目标是

html json parsing beautifulsoup

回答 1 投票 0

如何在列表理解中使用 if/else 以及遍历 html 页面的漂亮汤逻辑

我正在尝试学习一个 youtube 练习，该练习将从 wiki 页面 https://en.wikipedia.org/wiki/Toy_Story_3 中抓取特定的 html 块，我对 in 中的键值对数据感兴趣...

python html pandas beautifulsoup list-comprehension

回答 1 投票 0

BeautifulSoup 获取列表的 href - 需要简化脚本 - 替换多处理

我有以下汤：下一个 ... 我想从中提取 href“some_url” 这个我...

python web-scraping beautifulsoup html-parsing

回答 2 投票 0

学习时一直坚持使用 BeautifulSoup 进行刮擦。需要一些指点

我开始使用 BeautifulSoup 学习屏幕抓取。首先，我采用了以下格式的维基百科文章 < 我开始使用 BeautifulSoup 学习屏幕抓取。首先，我采用了以下格式的维基百科文章 <table class="wikitable sortable jquery-tablesorter"> <caption></caption> <thead> <tr> <th colspan="2" style="width: 6%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Opening</th> <th style="width: 20%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Title</th> <th style="width: 10%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Director</th> <th style="width: 45%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Cast</th> <th style="width: 30%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Production company</th> <th class="unsortable" style="width: 1%;"><abbr title="Reference(s)">Ref.</abbr></th> </tr> </thead> <tbody> <tr> <td rowspan="3" style="text-align: center; background: #77bc83;"> <b> O<br /> C<br /> T </b> </td> <td rowspan="1" style="text-align: center; background: #77bc83;"><b>11</b></td> <td style="text-align: center;"> <i><a href="/wiki/Viswam_(film)" title="Viswam (film)">Viswam</a></i> </td> <td>Sreenu Vaitla</td> <td> <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" /> <div class="hlist"> <ul> <li><a href="/wiki/Gopichand_(actor)" title="Gopichand (actor)">Gopichand</a></li> <li><a href="/wiki/Kavya_Thapar" title="Kavya Thapar">Kavya Thapar</a></li> <li><a href="/wiki/Vennela_Kishore" title="Vennela Kishore">Vennela Kishore</a></li> <li><a href="/wiki/Sunil" title="Sunil">Sunil</a></li> <li><a href="/wiki/Naresh" title="Naresh">Naresh</a></li> </ul> </div> </td> <td> Chitralayam Studios<br /> People Media Factory </td> <td style="text-align: center;"> <sup id="cite_ref-180" class="reference"> <a href="#cite_note-180"><span class="cite-bracket">[</span>178<span class="cite-bracket">]</span></a> </sup> </td> </tr> <tr> <td rowspan="2" style="text-align: center; background: #77bc83;"><b>31</b></td> <td style="text-align: center;"> <i><a href="/wiki/Lucky_Baskhar" title="Lucky Baskhar">Lucky Baskhar</a></i> </td> <td><a href="/wiki/Venky_Atluri" title="Venky Atluri">Venky Atluri</a></td> <td> <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" /> <div class="hlist"> <ul> <li><a href="/wiki/Dulquer_Salmaan" title="Dulquer Salmaan">Dulquer Salmaan</a></li> <li><a href="/wiki/Meenakshi_Chaudhary" title="Meenakshi Chaudhary">Meenakshi Chaudhary</a></li> </ul> </div> </td> <td><a href="/wiki/S._Radha_Krishna" title="S. Radha Krishna">Sithara Entertainments</a></td> <td style="text-align: center;"> <sup id="cite_ref-181" class="reference"> <a href="#cite_note-181"><span class="cite-bracket">[</span>179<span class="cite-bracket">]</span></a> </sup> </td> </tr> <tr> <td style="text-align: center;"> <i><a href="/wiki/Mechanic_Rocky" title="Mechanic Rocky">Mechanic Rocky</a></i> </td> <td>Ravi Teja Mullapudi</td> <td> <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" /> <div class="hlist"> <ul> <li><a href="/wiki/Vishwak_Sen" title="Vishwak Sen">Vishwak Sen</a></li> <li><a href="/wiki/Meenakshi_Chaudhary" title="Meenakshi Chaudhary">Meenakshi Chaudhary</a></li> </ul> </div> </td> <td>SRT Entertainments</td> <td style="text-align: center;"> <sup id="cite_ref-182" class="reference"> <a href="#cite_note-182"><span class="cite-bracket">[</span>180<span class="cite-bracket">]</span></a> </sup> </td> </tr> <tr> <td style="text-align: center; background: #77ea83;"> <b> N<br /> O<br /> V </b> </td> <td style="text-align: center; background: #77ea83;"><b>9</b></td> <td style="text-align: center;"> <i><a href="/wiki/Telusu_Kada" title="Telusu Kada">Telusu Kada</a></i> </td> <td><a href="/wiki/Neeraja_Kona" title="Neeraja Kona">Neeraja Kona</a></td> <td> <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" /> <div class="hlist"> <ul> <li><a href="/wiki/Siddhu_Jonnalagadda" title="Siddhu Jonnalagadda">Siddhu Jonnalagadda</a></li> <li><a href="/wiki/Raashii_Khanna" title="Raashii Khanna">Raashii Khanna</a></li> <li><a href="/wiki/Srinidhi_Shetty" title="Srinidhi Shetty">Srinidhi Shetty</a></li> </ul> </div> </td> <td>People Media Factory</td> <td style="text-align: center;"> <sup id="cite_ref-183" class="reference"> <a href="#cite_note-183"><span class="cite-bracket">[</span>181<span class="cite-bracket">]</span></a> </sup> </td> </tr> <tr> <td rowspan="2" style="text-align: center; background: #f4ca16; textcolor: #000;"> <b> D<br /> E<br /> C </b> </td> <td rowspan="1" style="text-align: center; background: #f8de7e;"><b>6</b></td> <td style="text-align: center;"> <i><a href="/wiki/Pushpa_2:_The_Rule" title="Pushpa 2: The Rule">Pushpa 2: The Rule</a></i> </td> <td><a href="/wiki/Sukumar" title="Sukumar">Sukumar</a></td> <td> <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" /> <div class="hlist"> <ul> <li><a href="/wiki/Allu_Arjun" title="Allu Arjun">Allu Arjun</a></li> <li><a href="/wiki/Fahadh_Faasil" title="Fahadh Faasil">Fahadh Faasil</a></li> <li><a href="/wiki/Rashmika_Mandanna" title="Rashmika Mandanna">Rashmika Mandanna</a></li> </ul> </div> </td> <td><a href="/wiki/Mythri_Movie_Makers" title="Mythri Movie Makers">Mythri Movie Makers</a></td> <td style="text-align: center;"> <sup id="cite_ref-184" class="reference"> <a href="#cite_note-184"><span class="cite-bracket">[</span>182<span class="cite-bracket">]</span></a> </sup> </td> </tr> <tr> <td rowspan="1" style="text-align: center; background: #f8de7e;"><b>20</b></td> <td style="text-align: center;"><i>Robinhood</i></td> <td><a href="/wiki/Venky_Kudumula" title="Venky Kudumula">Venky Kudumula</a></td> <td> <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" /> <div class="hlist"> <ul> <li><a href="/wiki/Nithiin" title="Nithiin">Nithiin</a></li> <li><a href="/wiki/Sreeleela" title="Sreeleela">Sreeleela</a></li> </ul> </div> </td> <td><a href="/wiki/Mythri_Movie_Makers" title="Mythri Movie Makers">Mythri Movie Makers</a></td> <td style="text-align: center;"> <sup id="cite_ref-185" class="reference"> <a href="#cite_note-185"><span class="cite-bracket">[</span>183<span class="cite-bracket">]</span></a> </sup> </td> </tr> </tbody> <tfoot></tfoot> </table> 这是我写的Python脚本： soup = BeautifulSoup(html_page, "html.parser") tables = soup.find_all("table",{"class":"wikitable sortable"}) headers = ['month','day','movie','director','cast','producer','reference'] movie_tables = [] total_movies = 0 for table in tables: caption = table.find("caption") if not caption or not caption.get_text().strip(): movie_tables.append(table) #captions = soup.find_all("caption") max_columns = len(headers) # List to store dictionaries data_dict_list = [] movies= [] for movie_table in movie_tables: table_rows = movie_table.find("tbody").find_all("tr")[1:] for table_row in table_rows: total_movies += 1 columns = table_row.find_all('td') row_data = [col.get_text(strip=True) for col in columns] # If the row has fewer columns than the max, pad it with None if len(row_data) == 6: row_data.insert(0, None) elif len(row_data) == 5: row_data.insert(0, None) row_data.insert(1, None) for col in columns: li_tags = col.find_all('li') if li_tags: cast="" for li in li_tags: li_values = li.get_text(strip=True) cast = ', '.join(li_values) row_data.append(cast) else: row_data.append(col.get_text()) # Create a dictionary mapping headers to row data row_dict = dict(zip(headers, row_data)) # Append the dictionary to the list data_dict_list.append(row_dict) # Print the list of dictionaries for row_dict in data_dict_list: print(row_dict) 这是我得到的输出（这里仅显示一些项目）： {'month': 'OCT', 'day': '11', 'movie': 'Viswam', 'director': 'Sreenu Vaitla', 'cast': 'GopichandKavya ThaparVennela KishoreSunilNaresh', 'producer': 'Chitralayam StudiosPeople Media Factory', 'reference': '[178]'} {'month': None, 'day': '31', 'movie': 'Lucky Baskhar', 'director': 'Venky Atluri', 'cast': 'Dulquer SalmaanMeenakshi Chaudhary', 'producer': 'Sithara Entertainments', 'reference': '[179]'} {'month': None, 'day': None, 'movie': 'Mechanic Rocky', 'director': 'Ravi Teja Mullapudi', 'cast': 'Vishwak SenMeenakshi Chaudhary', 'producer': 'SRT Entertainments', 'reference': '[180]'} {'month': 'NOV', 'day': '9', 'movie': 'Telusu Kada', 'director': 'Neeraja Kona', 'cast': 'Siddhu JonnalagaddaRaashii KhannaSrinidhi Shetty', 'producer': 'People Media Factory', 'reference': '[181]'} {'month': 'DEC', 'day': '6', 'movie': 'Pushpa 2: The Rule', 'director': 'Sukumar', 'cast': 'Allu ArjunFahadh FaasilRashmika Mandanna', 'producer': 'Mythri Movie Makers', 'reference': '[182]'} {'month': None, 'day': '20', 'movie': 'Robinhood', 'director': 'Venky Kudumula', 'cast': 'NithiinSreeleela', 'producer': 'Mythri Movie Makers', 'reference': '[183]'} 这就是我想要得到的（只是在这里显示最后一项）： {'month': 'DEC', 'day': '20', 'movie': 'Robinhood', 'director': 'Venky Kudumula', 'cast': 'Nithiin|Sreeleela', 'producer': 'Mythri Movie Makers', 'reference': '[183]'} 我在最后一天左右一直在尝试调试这个，但我不知道哪里出了问题。我期待：当这些列跨越多行并且未在所有行中表示时，月份、日期将填充在所有项目中。接下来，我想在不同的演员之间有一个分隔符，以便我以后可以轻松创建图表。另外，在执行所有这些操作时，如何提取超链接并将其存储在字典中的单独键中？如果您希望显示所有月份，则必须填写未显示的月份。空虚的月份都有一个共同点，那就是和前一个月一样。您可以简单地创建一个名为 lastMonth 的变量，将其分配给第一个月，然后将其与下一个月进行比较。如果下个月是空的，则更换它。如果不为空且与 lastMonth 不同，则将 lastMonth 变量的值替换为当前月份。对所有词典重复此操作，您将获得所有月份。

python web-scraping beautifulsoup

回答 1 投票 0

如何抓取所有客户评论？

我正在尝试抓取此网站中的所有评论 - https://www.backmarket.com/en-us/r/l/airpods/345c3c05-8a7b-4d4d-ac21-518b12a0ec17。网站上说有 753 条评论，但是当我尝试抓取时...

python beautifulsoup

回答 1 投票 0

Beautiful Soup 'ResultSet' 对象没有属性 'text'

从 bs4 导入 BeautifulSoup 导入 urllib.request 导入 win_unicode_console win_unicode_console.enable() 链接=（'https://pietroalbini.io/'） req = urllib.request.Request(link, headers={'用户-...

python beautifulsoup

回答 3 投票 0

打印未格式化的网页原始 html 数据，包含标签和类似信息

导入请求从 bs4 导入 BeautifulSoup url = 'https://www.somewebpage.com' 响应 = requests.get(url) 汤 = BeautifulSoup(response.text, 'html.parser') 打印（汤.美化（））当我跑步时...

python html beautifulsoup python-requests

回答 1 投票 0

使用 Python 抓取检查元素数据

我想从特定网页中抓取检查元素数据，并解析它以找到我需要的数据。导入请求从 bs4 导入 BeautifulSoup url = 'https://www.somewebpage.com'

python web-scraping beautifulsoup

回答 1 投票 0

从期刊网站下载 PDF 时出现数据抓取问题

我在使用 BeautifulSoup 和 Python 从 MDPI 遥感期刊中抓取 PDF 时遇到问题。我的代码的目的是抓取每本期刊卷以及其中的问题，以...

python web-scraping beautifulsoup pypdf

回答 1 投票 0

beautifulsoup 相关问题

最新问题