我们想要抓取文章(内容+标题)来扩展我们的数据集以用于文本分类目的。
目标:从 >> https://www.bbc.com/news/technology
的所有页面中抓取所有文章问题:代码似乎只从 https://www.bbc.com/news/technology?page=1 中提取文章,尽管如此,我们还是关注了所有页面。我们关注页面的方式会不会有问题?
class BBCSpider_2(scrapy.Spider):
name = "bbc_tech"
start_urls = ["https://www.bbc.com/news/technology"]
def parse(self, response: Response, **kwargs: Any) -> Any:
max_pages = response.xpath("//nav[@aria-label='Page']/div/div/div/div/ol/li[last()]/div/a//text()").get()
max_pages = int(max_pages)
for p in range(max_pages):
page = f"https://www.bbc.com/news/technology?page={p+1}"
yield response.follow(page, callback=self.parse_articles2)
接下来,我们将进入相应页面上的每篇文章:
def parse_articles2(self, response):
container_to_scan = [4, 8]
for box in container_to_scan:
if box == 4:
articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div/div/ul/li")
if box == 8:
articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div[2]/ol/li")
for article_idx in range(len(articles)):
if box == 4:
relative_url = response.xpath(f"//*[@id='main-content']/div[4]/div/div/ul/li[{article_idx+1}]/div/div/div/div[1]/div[1]/a/@href").get()
elif box == 8:
relative_url = response.xpath(f"//*[@id='main-content']/div[8]/div[2]/ol/li[{article_idx+1}]/div/div/div[1]/div[1]/a/@href").get()
else:
relative_url = None
if relative_url is not None:
followup_url = "https://www.bbc.com" + relative_url
yield response.follow(followup_url, callback=self.parse_article)
最后但并非最不重要的一点是,我们正在抓取每篇文章的内容和标题:
def parse_article(response):
article_text = response.xpath("//article/div[@data-component='text-block']")
content = []
for box in article_text:
text = box.css("div p::text").get()
if text is not None:
content.append(text)
title = response.css("h1::text").get()
yield {
"title": title,
"content": content,
}
当我们运行这个时,我们得到的 items_scraped_count 为 24。但它应该是 24 x 29 +/- ...
看来您对第2页和第3页等的后续调用正在被scrapy的重复过滤功能过滤,发生这种情况的原因是因为无论您在网址中输入什么页码,该网站都会保持提供相同的首页询问。 渲染首页后,它使用 json api 来获取所请求页面的实际文章信息,除非直接调用 api,否则无法单独通过 scrapy 捕获该信息。
json api 可以在浏览器开发工具的网络选项卡中找到,或者我在下面的示例中使用它。 您只需输入所需的页码,类似于您在
.../news/technology?page=?
url 中所做的操作。 请参阅下面的示例...
另一件事...您的
parse_article
方法缺少 self
作为第一个参数,这会引发错误并阻止您实际抓取任何页面内容。我还重写了您的一些 xpath,以使它们更具可读性。
import scrapy
class BBCSpider_2(scrapy.Spider):
name = "bbc_tech"
start_urls = ["https://www.bbc.com/news/technology"]
def parse(self, response):
max_pages = response.xpath("//nav[@aria-label='Page']//ol/li[last()]//text()").get()
for article in response.xpath("//div[@type='article']"):
if link := article.xpath(".//a[contains(@class, 'LinkPostLink')]/@href").get():
yield response.follow(link, callback=self.parse_article)
for i in range(2, int(max_pages)):
yield scrapy.Request(f"https://www.bbc.com/wc-data/container/topic-stream?adSlotType=mpu_middle&enableDotcomAds=true&isUk=false&lazyLoadImages=true&pageNumber={i}&pageSize=24&promoAttributionsToSuppress=%5B%22%2Fnews%22%2C%22%2Fnews%2Ffront_page%22%5D&showPagination=true&title=Latest%20News&tracking=%7B%22groupName%22%3A%22Latest%20News%22%2C%22groupType%22%3A%22topic%20stream%22%2C%22groupResourceId%22%3A%22urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252%22%2C%22groupPosition%22%3A5%2C%22topicId%22%3A%22cd1qez2v2j2t%22%7D&urn=urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252", callback=self.parse_json)
def parse_json(self, response):
for post in response.json()["posts"]:
yield scrapy.Request(response.urljoin(post["url"]), callback=self.parse_article)
def parse_article(self, response):
article_text = response.xpath("//article/div[@data-component='text-block']//text()").getall()
content = " ".join([i.strip() for i in article_text])
title = response.css("h1::text").get()
yield {
"title": title,
"content": content,
}