如何在不使用 selenium 或 bs4 的情况下编写 python 代码以在 https://techcrunch.com/ 上加载更多文章

Question

我想存储来自 https://techcrunch.com/ 的 100 篇文章的 html/pdf，但我必须在网站上“单击加载更多”才能一次显示 20 多篇文章。目前，我的Python代码只能获取前20个，因为其他的都被加载更多按钮“隐藏”了。

我无法使用selenium或任何第三方网络爬虫来模拟用户单击“加载更多”...还有其他方法可以让我获取隐藏的文章吗？

我一直在尝试查看源代码，但我不知道要寻找什么。我想我可能必须使用请求来模拟数据请求？但也不知道该怎么做。

Answer 1

有分页API URL。当您打开 Web 开发人员工具 -> 网络选项卡时，单击“加载更多”按钮时您应该会看到它：

import requests

api_url = "https://techcrunch.com/wp-json/tc/v1/magazine?page={page}&_embed=true&es=true&cachePrevention=0"

for page in range(1, 10):  # <--- change the number of pages here
    data = requests.get(api_url.format(page=page)).json()
    for article in data:
        print(article["title"]["rendered"])

打印：

...

Apple releases spatial video recording on iPhone 15 Pro
Sila inks supply deal with Panasonic for its breakthrough battery material
Apple&#8217;s new Journal app is now available with the release of iOS 17.2
We should all be paying more attention to the PDD-Alibaba rivalry

如何在不使用 selenium 或 bs4 的情况下编写 python 代码以在 https://techcrunch.com/ 上加载更多文章

问题描述投票：0回答：1

1个回答

最新问题

如何在不使用 selenium 或 bs4 的情况下编写 python 代码以在 https://techcrunch.com/ 上加载更多文章

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1