rvest - 计算机 A 出现错误 403，但计算机 B 没有（同一网络）

Question

直到几周前，我还能够定期运行脚本从我的 Lubuntu 20.04 机器上抓取 https://unjobs.org。

自从我从头开始安装了新的 24.04 后，rvest 在运行相同的脚本时给我一个 403。

library(rvest)
read_html('https://unjobs.org')
Error in open.connection(x, "rb") : HTTP error 403.

我还尝试使用“礼貌”，使用与 robots.txt 一致的不同用户代理：

library(polite)
scrape(bow('https://unjobs.org', user_agent='Twitterbot'))
NULL
Warning message:
Client error: (403) Forbidden https://unjobs.org

我尝试从另一台 Windows 计算机运行该脚本，并且从那里脚本运行得非常顺利。我在两台机器上都有相同的设置

R 4.4.0 维斯特1.0.4 httr 1.4.7

你猜问题出在哪里吗？

Tnx

Answer 1

可能的答案是该网站已经关闭了剪贴板

library(rvest)
library(polite)

url <- "https://unjobs.org"

bow(url, user_agent = "This is not scrappable")
#> <polite session> https://unjobs.org
#>     User-agent: This is not scrappable
#>     robots.txt: 12 rules are defined for 6 bots
#>    Crawl delay: 5 sec
#>   The path is not scrapable for this user-agent

^{创建于 2024-06-17，使用 reprex v2.1.0}

Answer 2

我也是这么想的，但是：

我可以从连接到同一路由器的 Windows 计算机（甚至无需设置用户代理）运行脚本
在第一次尝试前几个小时，我曾经运行过之前安装的 Lubuntu 20.04 中的脚本。

我唯一能想到的是，该网站不喜欢我的新 Lubuntu 安装，但是 - 除了 user_agent - httr 和 rvest 在抓取时与网站共享哪些信息？

rvest - 计算机 A 出现错误 403，但计算机 B 没有（同一网络）

问题描述投票：0回答：2

2个回答

最新问题

rvest - 计算机 A 出现错误 403，但计算机 B 没有（同一网络）

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2