beautifulsoup 相关问题

Beautiful Soup是一个用于解析HTML / XML的Python包。此软件包的最新版本是版本4，导入为bs4。

在我见过的BeautifulSoup的所有示例和教程中，都会传递一个HTML/XML文档并返回一个汤对象，然后可以使用该对象来修改文档。但是，我该如何使用

python xml beautifulsoup

回答 2 投票 0

我正在编写一个Python脚本，它将在从网页解析后提取脚本位置。假设有两种情况： <p>我正在编写一个Python脚本，它将在从网页解析后提取脚本位置。假设有两种情况：</p> <pre><code><script type="text/javascript" src="http://example.com/something.js"></script> </code></pre> <p>和</p> <pre><code><script>some JS</script> </code></pre> <p>我可以从第二种情况获取JS，即JS写在标签内。 </p> <p>但是有什么办法，我可以从第一个场景中获取 src 的值（即提取脚本中 src 标签的所有值，例如 <a href="http://example.com/something.js">http://example.com/something.js</a>）</p> <p>这是我的代码</p> <pre><code>#!/usr/bin/python import requests from bs4 import BeautifulSoup r = requests.get("http://rediff.com/") data = r.text soup = BeautifulSoup(data) for n in soup.find_all('script'): print n </code></pre> <p><strong>输出</strong>：一些JS</p> <p><strong>需要输出</strong>：<a href="http://example.com/something.js">http://example.com/something.js</a></p> </question> <answer tick="true" vote="27"> <p>仅当所有 <pre><code>src</code></pre> 值存在时，它才会获取它们。否则它会跳过那个 <pre><code><script></code></pre> 标签</p> <pre><code>from bs4 import BeautifulSoup import urllib2 url="http://rediff.com/" page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) sources=soup.findAll('script',{"src":True}) for source in sources: print source['src'] </code></pre> <p>我得到以下两个 <pre><code>src</code></pre> 值作为结果</p> <pre><code>http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js http://im.rediff.com/uim/common/realmedia_banner_1_5.js </code></pre> <p>我想这就是你想要的。希望这有用。</p> </answer> <answer tick="false" vote="5"> <p>从脚本节点获取“src”。</p> <pre><code>import requests from bs4 import BeautifulSoup r = requests.get("http://rediff.com/") data = r.text soup = BeautifulSoup(data) for n in soup.find_all('script'): print "src:", n.get('src') <==== </code></pre> </answer> <answer tick="false" vote="1"> <p>这应该可行，您只需过滤以查找所有脚本标签，然后确定它们是否具有“src”属性。如果他们这样做，那么 javascript 的 URL 包含在 src 属性中，否则我们假设 javascript 位于标签中</p> <pre><code>#!/usr/bin/python import requests from bs4 import BeautifulSoup # Test HTML which has both cases html = '<script type="text/javascript" src="http://example.com/something.js">' html += '</script> <script>some JS</script>' soup = BeautifulSoup(html) # Find all script tags for n in soup.find_all('script'): # Check if the src attribute exists, and if it does grab the source URL if 'src' in n.attrs: javascript = n['src'] # Otherwise assume that the javascript is contained within the tags else: javascript = n.text print javascript </code></pre> <p>这个输出是</p> <pre><code>http://example.com/something.js some JS </code></pre> </answer> <answer tick="false" vote="0"> <p>如果有人在 python3.x 中需要这个，这应该可以工作</p> <pre><code>from bs4 import BeautifulSoup as BS import requests res = requests.get("http://rediff.com/", verify=False).text parser = 'html.parser' # or you can use 'lxml' (preferred) soup = BS(res, parser) for item in soup.find_all('script', {'src': True}): print(item['src']) </code></pre> <h3>输出</h3> <pre><code>//newads.rediff.com/rediffadserver/www/delivery/asyncjs.php https://www.googletagservices.com/tag/js/gpt.js https://www.googletagservices.com/tag/js/gpt.js //imworld.rediff.com/worldrediff/js_2_5/sns_us_home_9.js //imworld.rediff.com/worldrediff/js_2_5/us_home_other_4_min.js https://www.googletagmanager.com/gtag/js?id=G-3FM4PW27JR https://fundingchoicesmessages.google.com/i/pub-2932970604686705?ers=1 </code></pre> </answer> </body></html>

python python-2.7 beautifulsoup

回答 0 投票 0

来自 BeautifulSoup 对象的数据框

我想从 BeautifulSoup 对象创建一个数据框 - 将 pandas 导入为 pd 从请求导入获取从 bs4 导入 BeautifulSoup 进口再 # 获取网页 url = 'https://carbondale.craig...

python beautifulsoup

回答 1 投票 0

Selenium - XPath - 通过innerHTML 搜索元素

我正在学习 Selenium，并且对 XPath 有很好的掌握。我遇到的一个问题是，在网页上，我想要选择一个具有动态生成的 id 和类的元素。我有三...

html python-2.7 selenium xpath beautifulsoup

回答 2 投票 0

“如何从BeautifulSoup中带有ID的表获取数据？

我正在尝试使用 BeautifulSoup 和 requests 库从 id='stats_standard' 的表中获取数据，但是我尝试了各种方法，例如使用 find 和 select，但我仍然没有收到...

python web-scraping beautifulsoup

回答 1 投票 0

使用 BeautifulSoup 提取标签内的内容

我想提取内容Hello world。请注意，页面上还有多个和类似的：我想提取内容Hello world。请注意，页面上还有多个 <table> 和类似的 <td colspan="2">：<table border="0" cellspacing="2" width="800"> <tr> <td colspan="2"><b>Name: </b>Hello world</td> </tr> <tr> ... 我尝试了以下方法：hello = soup.find(text='Name: ') hello.findPreviousSiblings 但它什么也没返回。此外，我在以下提取My home address时也遇到问题：<td><b>Address:</b></td> <td>My home address</td> 我也使用相同的方法来搜索 text="Address: " 但如何向下导航到下一行并提取 <td> 的内容？ contents 运算符非常适合从 text 中提取 <tag>text</tag> 。 <td>My home address</td> 示例： s = '<td>My home address</td>' soup = BeautifulSoup(s) td = soup.find('td') #<td>My home address</td> td.contents #My home address <td><b>Address:</b></td> 示例： s = '<td><b>Address:</b></td>' soup = BeautifulSoup(s) td = soup.find('td').find('b') #<b>Address:</b> td.contents #Address: 使用 .next 代替： >>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>' >>> soup = BeautifulSoup(s) >>> hello = soup.find(text='Name: ') >>> hello.next u'Hello world' .next 和 .previous 允许您按照解析器处理文档元素的顺序移动文档元素，而同级方法则使用解析树。使用下面的代码使用 python beautifulSoup 从 html 标签中提取文本和内容 s = '<td>Example information</td>' # your raw html soup = BeautifulSoup(s) #parse html with BeautifulSoup td = soup.find('td') #tag of interest <td>Example information</td> td.text #Example information # clean text from html from bs4 import BeautifulSoup, Tag def get_tag_html(tag: Tag): return ''.join([i.decode() if type(i) is Tag else i for i in tag.contents])

python beautifulsoup

回答 4 投票 0

403 抓取网站时出现禁止错误，用户代理已使用并更新。有什么想法吗？

正如上面的标题所述，我收到 403 错误。生成的 URL 是有效的，我可以打印它们，然后在浏览器中打开它们就可以了。我有一个用户代理，它与我的

python web-scraping beautifulsoup

回答 2 投票 0

使用 BeautifulSoup 获取 <a> 标签内容

我想在Python中使用BeautifulSoup（版本4.12.3）获取标签的内容。我有这个代码和 HTML 示例： h =“”” ... 我想在Python中使用BeautifulSoup（版本4.12.3）获取<a>标签的内容。我有这个代码和 HTML 示例： h = """ <a id="0"> <table> <thead> <tr> <th scope="col">Person</th> <th scope="col">Most interest in</th> <th scope="col">Age</th> </tr> </thead> <tbody> <tr> <th scope="row">Chris</th> <td>HTML tables</td> <td>22</td> </tr> </table> </a> """ test = bs4.BeautifulSoup(h) test.find('a') # find_all, select => same results 但它只返回： <a id="0"> </a> 我希望 <table> 内的内容出现在 <a> 标签之间。（我不知道将表格包装在 <a> 标签内是否常见，但我尝试阅读的 HTML 代码是这样的）我需要从 <a> 标签解析表格内容，因为我需要将 id="0" 链接到表格的内容。我怎样才能做到这一点？如何使用 <a> 标签获取 <table> 标签内容？明确指定您要使用的解析器（使用 html.parser）。默认情况下，它将使用可用的“最佳”解析器 - 我按下 lxml，它不能很好地解析此文档： import bs4 h = """ <a id="0"> <table> <thead> <tr> <th scope="col">Person</th> <th scope="col">Most interest in</th> <th scope="col">Age</th> </tr> </thead> <tbody> <tr> <th scope="row">Chris</th> <td>HTML tables</td> <td>22</td> </tr> </table> </a> """ test = bs4.BeautifulSoup(h, "html.parser") # <-- define parser here out = test.find("a") print(out) 打印： <a id="0"> <table> <thead> <tr> <th scope="col">Person</th> <th scope="col">Most interest in</th> <th scope="col">Age</th> </tr> </thead> <tbody> <tr> <th scope="row">Chris</th> <td>HTML tables</td> <td>22</td> </tr> </tbody></table> </a>

python html beautifulsoup

回答 1 投票 0

无法使用 beautifulsoup 获取 div 元素内的所有 span 标签

我正在亚马逊上抓取产品详细信息页面文本，但我以项目符号列表的形式返回数据。我希望将数据添加为其他抓取数据旁边的列。导出csv文件亚马逊产品详细信息...

python web-scraping beautifulsoup

回答 1 投票 0

抓取多个页面具有相同网址的网站？页面跳转是ajax请求

我已经这样做好几天了，我正在尝试抓取这个网站：“https://careers.ispor.org/jobseeker/search/results/” 我已经涵盖了从提取

python ajax selenium-webdriver web-scraping beautifulsoup

回答 1 投票 0

来自 UniProt 数据库的网页抓取表

我有一个 UniProt ID 列表，想使用 BeautifulSoup 废弃包含结构信息的表。我使用的网址如下：https://www.uniprot.org/uniprot/P03496，带有

python python-3.x web-scraping beautifulsoup

回答 2 投票 0

R / Python 中的网页抓取

我需要从 https://eservices.dha.gov.ae/DHASearch/UIPages/ProfessionalSearch.aspx?PageLang=En 提取数据。我需要 4 列 -“姓名”、“性别”、“职称”、“医院名称”、“联系方式”。 ”

python web-scraping beautifulsoup scrapy rvest

回答 2 投票 0

如何使用 beautiful soup 从 HTML 中提取带有 ::marker 的标签

我正在尝试使用 BeautifulSoup 查找具有 :: 标记的 li 元素，如下所示。我尝试使用 cssutils 但不成功（也许我使用错误）伪代码： lis = soup_obj.find_...

python html beautifulsoup

回答 3 投票 0

无法抓取所有评论

我正在尝试抓取这个网站并尝试获得评论，但我遇到了一个问题，该页面仅加载 50 条评论。要加载更多内容，您必须单击“显示更多评论”，但我不...

python python-3.x beautifulsoup request

回答 1 投票 0

为什么我的代码只删除产品评论的第一页？

我正在这个网站上抓取产品评论” https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&

python selenium-webdriver beautifulsoup

回答 2 投票 0

使用 BeautifulSoup 提取两个 h1 标签之间的数据

美汤：提取两个标签之间的所有内容我正在使用 BeautifulSoup 提取两个特定 HTML 标签之间的内容。这些标签没有任何特定的属性或 ID，而且我...

python html beautifulsoup html-tag-summary

回答 2 投票 0

使用 python 进行多个类的网页抓取

我正在尝试从 HTML 格式的 10K 归档文件中抓取地址：https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm 它有多个 div 类，我想抓取...

python web-scraping web beautifulsoup edgar

回答 2 投票 0

如何修复 python 中的“TypeError: 'NoneType' object is not callable”

当我尝试运行这个简单的 python 网页抓取程序（如下所示）时，我收到错误“TypeError：‘NoneType’对象不可调用”。我该如何解决这个问题？从 bs4 导入美丽...

python web-scraping beautifulsoup python-requests typeerror

回答 1 投票 0

使用 BeautifulSoup 抓取第一个表时出现 HTTP 错误 404，但第二个表工作正常

我正在编写一个 Python 脚本，使用 BeautifulSoup 从 Investing.com 抓取历史 CDS 数据。目标是从页面上的特定表中提取数据并将其编译成 DataFrame。哈...

python beautifulsoup python-requests finance

回答 1 投票 0

使用 Selenium 在 Python 中进行网页抓取自动化的问题

我的 ETL 流程遇到问题。让我解释一下我的问题，我有这段代码：导入时间从硒导入网络驱动程序从 selenium.webdriver.common.by 导入将 pandas 导入为 pd 导入

python selenium-webdriver beautifulsoup etl

回答 1 投票 0

beautifulsoup 相关问题

最新问题