注释在网页上可见,但BeautifulSoup返回的html对象不包含注释部分

问题描述 投票:1回答:1

我尝试使用其URL链接从网页中提取评论的文本内容,并使用BeautifulSoup进行抓取。单击URL链接时,页面上会显示注释的内容,但BeautifulSoup返回的HTML对象不包含这些标记和文本。

我使用BeautifulSoup和'html.parser'进行网页抓取。我成功地在给定网页中提取了视频的喜欢/观看/评论的数量,但评论部分的信息未包含在HTML文件中。我使用的浏览器是Chrome,系统是Ubuntu 18.04.1 LTS。

这是我使用的代码(在python中):

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os

webpage_link = "https://www.airvuz.com/video/Majestic-Beast-Nanuk?id=59b2a56141ab4823e61ea901"

try:
    page = urlopen(webpage_link)
except urllib.error.HTTPError as err:  # webpage cannot be found
    print("ERROR! %s" %(webpage_link))

soup = BeautifulSoup(page, 'html.parser')

预期的结果是汤对象包含在网页上可见的所有内容,特别是评论的文本内容(例如“不在那里我很喜欢看到白熊的生活方式。感谢这样的纪录片的提供者。”和“WOOOW ......太棒了......”);但是,我找不到汤对象中的相应节点。任何帮助,将不胜感激!

python web-scraping beautifulsoup data-extraction
1个回答
0
投票

注释由JavasSript通过ajax请求生成。您可以发送相同的请求并从json回复中获取评论。您可以使用检查工具中的网络选项卡查找请求。

from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment
import json
webpage_link = "https://www.airvuz.com/api/comments/video/59b2a56141ab4823e61ea901?page=1&limit=20"
page = urlopen(webpage_link).read()
comments_json=data = json.loads(page)
for comment_info in comments_json['data']:
    print(comment_info['comment'].strip()) 

产量

Not being there I enjoyed a lot seeing the life style of white bear. Thanks to the provider for  such documentary.
WOOOW... amazing...
I've been photographing polar bears for years, but to see this footage from a drones perspective was epic! Well done and congratz on the Nominee! Well deserved.
You are da man Florian!
Absolutely outstanding!
This is incredible
jaw dropping
This is wow amazing, love it.
So cool! Did the bears react to the drone at all?
Congratulations! It's awesome! I am watching in tears....
Awesome!
perfect video awesome
It is very, very beautiful !!! Sincere congratulations
Made my day, exquisite, thank you
Wow
Super!
Marvelous!
Man this is incredible!
Material is good, but  edi is bad. This history about  beer's family...
Muy bueno!
© www.soinside.com 2019 - 2024. All rights reserved.