我需要从使用 cookie 保护的网站下载信息。我手动通过这个保护,然后将cookie插入到
httr
。
这是类似的主题,但它不能解决我的问题:(Copying cookie for httr)
library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"
cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"
response <- GET(url, config(cookie= cook))
content(x = response,as = 'text', encoding = "UTF-8")
因此,当我使用内容时,它会返回我未登录的信息(就像没有 cookie 时一样)
如何解决这个问题?
测试凭证为登录:
mytest2
,通过:qwerty12
)
这将是
set_cookies
和 GET
& httr
:
GET("http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ",
set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
`__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
`__utmb` = "29983421.5.10.1413649536",
`__utmc` = "29983421",
`__utmt` = "1",
`__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"))
这对我有用,至少我认为它有用,因为我无法阅读该语言。返回的表具有相同的结构,并且没有提示登录。
不幸的是,登录时的验证码会阻止使用 Rselenium(或其他类似的爬行包),因此您必须继续手动获取这些 cookie(或使用实用程序从会话中提取它们)。
最后,您可能想认真考虑更改这些凭证,现在:-)
编辑: @VadymB 和我都发现代码不起作用直到我们重新启动 RStudio。您的里程可能会有所不同。
你可以试试这个:
url <- "http://httpbin.org/get"
httr::GET(url)
httr::GET(url, httr::add_headers(a = 1, b = 2))
httr::GET(url, httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
# codes ref by: https://httr.r-lib.org/reference/GET.html
这些将是命令的出局:
httr::GET(url)
#| Response [http://httpbin.org/get]
#| Date: 2024-07-31 02:14
#| Status: 200
#| Content-Type: application/json
#| Size: 378 B
#| {
#| "args": {},
#| "headers": {
#| "Accept": "application/json, text/xml, application/xml, */*",
#| "Accept-Encoding": "deflate, gzip, br, zstd",
#| "Host": "httpbin.org",
#| "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7",
#| "X-Amzn-Trace-Id": "Root=1-66a99dfc-3ee62d216a517e6844e8815f"
#| },
#| "origin": "101.200.73.219",
#| ...
httr::GET(url, httr::add_headers(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#| Date: 2024-07-31 02:14
#| Status: 200
#| Content-Type: application/json
#| Size: 408 B
#| {
#| "args": {},
#| "headers": {
#| "A": "1",
#| "Accept": "application/json, text/xml, application/xml, */*",
#| "Accept-Encoding": "deflate, gzip, br, zstd",
#| "B": "2",
#| "Host": "httpbin.org",
#| "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7",
#| "X-Amzn-Trace-Id": "Root=1-66a99dfc-2fddaa4e49a8325309990191"
#| ...
httr::GET(url, httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#| Date: 2024-07-31 02:14
#| Status: 200
#| Content-Type: application/json
#| Size: 404 B
#| {
#| "args": {},
#| "headers": {
#| "Accept": "application/json, text/xml, application/xml, */*",
#| "Accept-Encoding": "deflate, gzip, br, zstd",
#| "Cookie": "a=1;b=2",
#| "Host": "httpbin.org",
#| "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7",
#| "X-Amzn-Trace-Id": "Root=1-66a99dfc-44b9d09700c6b7f87e086e40"
#| },
#| ...
httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#| Date: 2024-07-31 02:14
#| Status: 200
#| Content-Type: application/json
#| Size: 434 B
#| {
#| "args": {},
#| "headers": {
#| "A": "1",
#| "Accept": "application/json, text/xml, application/xml, */*",
#| "Accept-Encoding": "deflate, gzip, br, zstd",
#| "B": "2",
#| "Cookie": "a=1;b=2",
#| "Host": "httpbin.org",
#| "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7",
#| ...
httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#| Date: 2024-07-31 02:14
#| Status: 200
#| Content-Type: application/json
#| Size: 434 B
#| {
#| "args": {},
#| "headers": {
#| "A": "1",
#| "Accept": "application/json, text/xml, application/xml, */*",
#| "Accept-Encoding": "deflate, gzip, br, zstd",
#| "B": "2",
#| "Cookie": "c=3;d=4",
#| "Host": "httpbin.org",
#| "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7",
#| ...
因此,
httr::set_cookies
就像 httr::add_headers
的扭曲,但 httr::add_headers
具有更大的优先级,而它们似乎都在设置 cookie。
但是,
httr::set_cookies(...)
比httr::add_headers(cookie = ....)
更易于阅读,所以我认为你仍然可以使用它。