beautifulsoup 相关问题

Beautiful Soup是一个用于解析HTML / XML的Python包。此软件包的最新版本是版本4，导入为bs4。

我正在尝试使用 Beautiful Soup 抓取一些 Box Office Mojo 页面以获取全球票房总数据。下面的代码可以很好地获取国内数据，但当我输入“Worldwi...

python web-scraping beautifulsoup

回答 1 投票 0

我打算从以下网站html中提取数据链接：http://movie.walkerplus.com/list/2015/12/ html的部分内容如下所示：监督我打算从以下网站html中提取数据链接：http://movie.walkerplus.com/list/2015/12/ html 部分内容如下所示： <dl class="directorList"> <dt>監督</dt> <dd> <a href="/person/209306/" title="">スティーヴ・マーティノ</a> </dd> </dl> <dl class="roleList"> <dt>出演キャスト</dt> <dd> <a href="/person/226530/" title="">鈴木福</a> <a href="/person/228506/" title="">芦田愛菜</a> <a href="/person/266939/" title="">小林星蘭</a> </dd> 我想获得这个网站的所有directionList数据和出现キャsu 比如sutiiヴ・マーティノ，铃木福芦田爱菜小林星兰我的代码如下所示： from bs4 import BeautifulSoup from urllib.request import urlopen html = urlopen("http://movie.walkerplus.com/list/2015/12/") bsObj_movie = BeautifulSoup(html, "html.parser") print(bsObj_movie) movie_title = bsObj_movie.findAll({"h3"}) movie_description = bsObj_movie.findAll("p", {"class": "clearboth"}) movie_directors = bsObj_movie.findAll("dl", {"class": "directorList"}) movie_roles = bsObj_movie.findAll("dl", {"class": "roleList"}) for description in movie_description: print(description.get_text()) for title in movie_title: print(title.get_text()) for director in movie_directors: print(director.find('a').get_text()) for role in movie_roles: print(role.get_text()) 我已经成功获得了movie_title和movie_description，但 movie_directors 和 movie_roles 如下所示。监督セルゲイ・博多罗夫登场キャsuto 铃木福芦田爱菜小林星兰实际上，我只是想提取以下数据，不包括：监督和出现キャsuto。并且我想打印标题，描述，导演和角色，而不仅仅是导演和角色。此外，我想提取这些数据并将其存储到一个数据库中，该数据库有一个包含四列的表：标题、描述、导演和角色。セルゲイ・博多罗夫铃木福芦田爱菜小林星兰提前致谢！并非每部电影，即第一页上的The Royal Opera House Cinema Season 2015/16 Royal Opera "The Marriage of Figaro"都有导演，所以我将其过滤掉： from bs4 import BeautifulSoup from urllib.request import urlopen html = urlopen("http://movie.walkerplus.com/list/2015/12/") soup = BeautifulSoup(html, "html.parser") data = soup.select("div.movie dl.directorList") for d in data: title = d.select_one("a").text description = d.find_previous("p", "clearboth").text cast = ",".join([a.text.strip() for a in d.find_next("dl","roleList").select("dd a")]) director = d.dd.a.text print(title, director, cast, description)

html python-3.x web-scraping beautifulsoup

回答 1 投票 0

无法刮

我正在尝试从angellist https://angel.co/companies 获取公司列表我尝试使用这段代码从 bs4 导入 BeautifulSoup 导入 urllib2 headers = { '用户代理' : 'Mozilla/5.0...

python html web-scraping beautifulsoup

回答 3 投票 0

如何从BeautifulSoup下载图片？

图片 https://i.sstatic.net/S1BR2.png 导入请求从 bs4 导入 BeautifulSoup r = requests.get("xxxxxxxxx") 汤 = BeautifulSoup(r.content) 对于链接中的链接：如果 link.get('s...

python python-2.7 web-scraping beautifulsoup

回答 2 投票 0

使用Beautifulsoup和Requests抓取'N'页（如何获取真实页码）

我想获取网站中所有的titles()。 http://www.shyan.gov.cn/zwhd/web/webindex.action 现在，我的代码仅成功抓取一页。但是，有多个页面可供使用...

python selenium-webdriver web-scraping beautifulsoup python-requests

回答 1 投票 0

chartink.com 上的网页抓取

请帮我抓取这个链接。链接 - https://chartink.com/screener/time-pass-48 我正在尝试网络抓取，但它没有显示我想要的表格。请同样帮助我。我已经尝试过...

python html web web-scraping beautifulsoup

回答 5 投票 0

如何在Python中加载站点的所有资源，包括AJAX请求等？

我知道如何使用 Python 请求网站并读取其文本。在过去，我尝试过使用像 BeautifulSoup 这样的库来发出对网站上链接的所有请求，但这并没有得到什么......

python selenium beautifulsoup urllib2 python-requests

回答 3 投票 0

抓取赔率体育信息

我正在使用 Python 3.5，实际上我专注于使用 BeautifulSoup/lxml/Selenium/PhantomJS 进行网页抓取我只是想抓取我需要的所有数据，以便用 Python 代码进行破解。我可以轻松

javascript python-3.x parsing web-scraping beautifulsoup

回答 2 投票 0

使用 BeautifulSoup 进行网页抓取

我想从此链接中抓取国家名称和国家首都： https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order 从 html 代码中，我正在寻找所有...

python web-scraping beautifulsoup

回答 2 投票 0

有没有办法使用 MediaWiki 根据 URL 从维基百科中提取图像

我有一个维基百科 URL，想要提取每个内容图像 URL 我尝试使用 BeautifulSoup 进行正常的网络抓取，其中我将 URL 并获取具有“thumbimage”类的图像到 g...

python web-scraping beautifulsoup mediawiki wikipedia-api

回答 2 投票 0

如何用Python抓取动态网页

[我正在努力做什么] 抓取下面的网页以获取二手车数据。 http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1 [问题] 抄写...

python html web-scraping beautifulsoup

回答 2 投票 0

将结果保存到 for 循环列表中？

url = 'http://www.millercenter.org/president/speeches' conn = urllib2.urlopen(url) html = conn.read() miller_center_soup = BeautifulSoup(html) 链接 = miller_center_soup.find_all('a') 用于标签...

python list python-2.7 web-scraping beautifulsoup

回答 2 投票 0

使用 BeautifulSoup 将 img 标签替换为内联 SVG

我有一个由pandoc生成的HTML文件，其中嵌入了SVG插图。 SVG 内容以 base64 编码并包含在 img 元素的 src 属性中。它看起来像这样：我有一个由 pandoc 生成的 HTML 文件，其中嵌入了 SVG 插图。 SVG 内容以 base64 编码并包含在 src 元素的 img 属性中。看起来像这样： <figure> <img role="img" aria-label="Figure 1" src="data:image/svg+xml;base64,<base64str>" alt="Figure 1" /> <figcaption aria-hidden="true">Figure 1</figcaption> </figure> 我想使用 BeautifulSoup 将 img 元素替换为解码后的 SVG 字符串。所以这就是我所做的： from bs4 import BeautifulSoup import base64 with open("file.html") as f: soup = BeautifulSoup(f, "html.parser") # get all images images = soup.find_all("img") # try with the first one # decode the SVG string from the src attribute svg_str = base64.b64decode(images[0]["src"].split(",")[1]).decode() # replace the tag with the string images[0].replace_with(soup.new_tag(svg_str)) 但是，images[0] 保持不变，但没有返回错误。我查看了互联网上的示例，但我无法弄清楚我做错了什么。您遇到的问题是由于您尝试用解码的 SVG 字符串替换 img 标签的方式造成的。 soup.new_tag 方法用于创建新标签，但您向其传递一个字符串，这不是正确的用法。相反，您应该直接将 img 标签替换为解码后的 SVG 内容。以下是实现此目标的方法：解码base64字符串。将解码后的 SVG 字符串解析为 BeautifulSoup 对象。将 img 标签替换为解析后的 SVG 内容。这是更正后的代码： from bs4 import BeautifulSoup import base64 with open("file.html") as f: soup = BeautifulSoup(f, "html.parser") # get all images images = soup.find_all("img") # process each image for img in images: # decode the SVG string from the src attribute svg_str = base64.b64decode(img["src"].split(",")[1]).decode() # parse the SVG string into a BeautifulSoup object svg_soup = BeautifulSoup(svg_str, "html.parser") # replace the img tag with the parsed SVG content img.replace_with(svg_soup) # Save the modified HTML to a new file with open("modified_file.html", "w") as f: f.write(str(soup))

python html svg beautifulsoup

回答 1 投票 0

如何阻止每个字母打印在不同的行上？

当我尝试用 beautifulsoup 抓取一些文本时类抓取（对象）： def dirae（自我，单词）： url = 'http://dirae.es/palabras/' + 字站点 = urllib2.urlopen(u...

python web-scraping beautifulsoup

回答 1 投票 0

抓取和解析网站以获取信息[关闭]

我正在尝试收集有关美国所有高尔夫球场的信息。我创建了一个脚本来从 PGA 网站上抓取数据，该网站提供了大约 18000 个高尔夫球场。所以我的脚本不是

python csv parsing web-scraping beautifulsoup

回答 1 投票 0

如何使用 beautifulsoup 在亚马逊网页上抓取产品详细信息[已关闭]

网页：http://www.amazon.com/Harry-Potter-Prisoner-Azkaban-Rowling/dp/0439136369/ref=pd_sim_b_2?ie=UTF8&refRID=1MFBRAECGPMVZC5MJCWG 我怎样才能在p中抓取产品详细信息和输出字典...

python web-scraping beautifulsoup

回答 2 投票 0

试图找出此页面的逻辑：存储了大约 ++ 100 个结果 - 并使用 Python 和 BS4 进行解析

试图找出此页面背后的逻辑：我们已将一些结果存储在以下数据库中： https://www.raiffeisen.ch/rch/de/ueber-uns/raiffeisen-gruppe/organization/raiffeisenbanken/de...

python pandas web-scraping beautifulsoup

回答 1 投票 0

findAll() 返回空输出

我正试图从闭嘴中抓取每条评论的标题、日期、评级和实际评论。但我无法提取页面标题下的任何内容。评论位于“更多

python selenium-webdriver web-scraping beautifulsoup request

回答 1 投票 0

如何从父标签中选择特定的子标签并从中抓取数据？

在 HTML 文件中存在几个这样的标签，其中 class="b-card"，我在该文件中提取了以下 HTML 代码：在 HTML 文件中存在多个这样的 <div> 标签，其中 class="b-card"，我提取了以下 HTML 代码： <div class="b-card"> <div class="builder-exp-wrap"> <a class="no-ajaxy img-wrap js-rc-link" data-href="/puravankara-limited-100046"> <img alt="Images for Logo of Puravankara" src="https://im.proptiger.com/3/100046/13/puravankara-4491843.jpeg?width=155&height=50"/> </a> <div class="builder-details-wrap"> <a class="no-ajaxy builder-name put-ellipsis js-b-card" data-builderid="100046" href="/puravankara-limited-100046" target="_blank"> Puravankara Limited </a> </div> </div> <div class="b-dtls"> <div class="count-wrap one"> <div class="circle"> <div class="val"> 99 </div> </div> <div class="lbl"> Total Projects </div> </div> <div class="count-wrap"> <div class="circle"> <div class="val"> 36 </div> </div> <div class="lbl"> Ongoing Projects </div> </div> </div> 在它下面，我想抓取div标签的文本，其中class="val"。如下所示，我可以使用 class="b-card" 方法迭代地抓取整个 div 标签块，其中 find_all() 。在其中，我还可以抓取 div 标签下的文本，其中 class="builder-details-wrap" 因为有一个 a 标签作为子标签。但如果我想抓取div标签下class="count-wrap"下的数据，我不知道如何继续。在这个父标签下，有两个子 div 标签，我不确定如何选择一个 class="circle" 的标签，我最终需要转到类 div 标签，其中 class="val" 来抓取其文本。 from bs4 import BeautifulSoup import requests main_url="https://www.proptiger.com/bangalore/all-builders?page=1" main_url_html=BeautifulSoup(requests.get(main_url).text,"html.parser") for bcard in main_url_html.find_all('div',class_='b-card'): bcard_CompanyName=bcard.find('div',class_='builder-details-wrap') bcard_CompanyName=bcard_CompanyName.a.text bcard_OngoingProjs=bcard.find('div',class_='count-wrap') bcard_OngoingProjs=bcard_OngoingProjs.div.div.text 任何帮助将不胜感激。我更喜欢使用select而不是find，但这当然是个人选择用这个代码 for bcard in main_url_html.select('div.b-card'): bcard_CompanyName=bcard.select_one('div.builder-details-wrap a').text print(bcard_CompanyName) for project_stat in bcard.select('div.count-wrap'): lbl = project_stat.select_one('.lbl').text.strip() val = project_stat.select_one('.val').text.strip() print(lbl, val) 您即将加入第一家公司 Mahindra Lifespaces Developers Total Projects 145 Ongoing Projects 70

python html beautifulsoup

回答 1 投票 0

Python|HTML：如何从父标签中选择特定的子标签并从中抓取数据？

在 HTML 文件中存在几个这样的标签，其中 class="b-card" 我提取了以下 HTML 代码：在 HTML 文件中存在几个这样的标签，其中 class="b-card"，我提取了以下 HTML 代码： <div class="builder-exp-wrap"> <a class="no-ajaxy img-wrap js-rc-link" data-href="/puravankara-limited-100046"> <img alt="Images for Logo of Puravankara" src="https://im.proptiger.com/3/100046/13/puravankara-4491843.jpeg?width=155&height=50"/> </a> <div class="builder-details-wrap"> <a class="no-ajaxy builder-name put-ellipsis js-b-card" data-builderid="100046" href="/puravankara-limited-100046" target="_blank"> Puravankara Limited </a> </div> </div> <div class="b-dtls"> <div class="count-wrap one"> <div class="circle"> <div class="val"> 99 </div> </div> <div class="lbl"> Total Projects </div> </div> <div class="count-wrap"> <div class="circle"> <div class="val"> 36 </div> </div> <div class="lbl"> Ongoing Projects </div> </div> </div> 在它下面，我想抓取 class="val" 的 div 标签的文本。如下所示，我可以使用 find_all() 方法迭代地抓取 class="b-card" 的整个 div 标签块。在其中，我还可以抓取 div 标签下的文本，其中 class="builder-details-wrap" 因为有一个 a 标签作为子标签。但如果我想抓取 div 标签下的数据，其中 class="count-wrap"，我不知道如何继续。在这个父标签下，有两个子 div 标签，我不确定如何选择 class="circle" 的那个，我最终需要转到 class="val" 的 class div 标签来抓取其文本. from bs4 import BeautifulSoup import requests main_url="https://www.proptiger.com/bangalore/all-builders?page=1" main_url_html=BeautifulSoup(requests.get(main_url).text,"html.parser") for bcard in main_url_html.find_all('div',class_='b-card'): bcard_CompanyName=bcard.find('div',class_='builder-details-wrap') bcard_CompanyName=bcard_CompanyName.a.text bcard_OngoingProjs=bcard.find('div',class_='count-wrap') bcard_OngoingProjs=bcard_OngoingProjs.div.div.text 任何帮助将不胜感激。我更喜欢使用select而不是find，但这当然是个人选择用这个代码 for bcard in main_url_html.select('div.b-card'): bcard_CompanyName=bcard.select_one('div.builder-details-wrap a').text print(bcard_CompanyName) for project_stat in bcard.select('div.count-wrap'): lbl = project_stat.select_one('.lbl').text.strip() val = project_stat.select_one('.val').text.strip() print(lbl, val) 您即将加入第一家公司 Mahindra Lifespaces Developers Total Projects 145 Ongoing Projects 70

python html beautifulsoup

回答 1 投票 0

beautifulsoup 相关问题

最新问题