如何正确设置cookie以使用httr获取URL内容

问题描述 投票:0回答:2

我需要从使用 cookie 保护的网站下载信息。我手动通过这个保护,然后将cookie插入到

httr

这是类似的主题,但它不能解决我的问题:(Copying cookie for httr

library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"

cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"

response <- GET(url, config(cookie= cook))

content(x = response,as = 'text', encoding = "UTF-8")   

因此,当我使用内容时,它会返回我未登录的信息(就像没有 cookie 时一样)

如何解决这个问题?

测试凭证为登录:

mytest2
,通过:
qwerty12

r cookies httr
2个回答
6
投票

这将是

set_cookies
GET
&
httr
:

GET("http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ", 
    set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
                `__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
                `__utmb` = "29983421.5.10.1413649536",
                `__utmc` = "29983421",
                `__utmt` = "1",
                `__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"))

这对我有用,至少我认为它有用,因为我无法阅读该语言。返回的表具有相同的结构,并且没有提示登录。

不幸的是,登录时的验证码会阻止使用 Rselenium(或其他类似的爬行包),因此您必须继续手动获取这些 cookie(或使用实用程序从会话中提取它们)。

最后,您可能想认真考虑更改这些凭证,现在:-)


编辑: @VadymB 和我都发现代码不起作用直到我们重新启动 RStudio。您的里程可能会有所不同。


0
投票

你可以试试这个:

url <- "http://httpbin.org/get"
httr::GET(url)
httr::GET(url, httr::add_headers(a = 1, b = 2))
httr::GET(url, httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
# codes ref by: https://httr.r-lib.org/reference/GET.html

这些将是命令的出局:

httr::GET(url)
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 378 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-3ee62d216a517e6844e8815f"
#|   }, 
#|   "origin": "101.200.73.219", 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 408 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-2fddaa4e49a8325309990191"
#| ...

httr::GET(url, httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 404 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "Cookie": "a=1;b=2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-44b9d09700c6b7f87e086e40"
#|   }, 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 434 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Cookie": "a=1;b=2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 434 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Cookie": "c=3;d=4", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#| ...

因此,

httr::set_cookies
就像
httr::add_headers
的扭曲,但
httr::add_headers
具有更大的优先级,而它们似乎都在设置 cookie。

但是,

httr::set_cookies(...)
httr::add_headers(cookie = ....)
更易于阅读,所以我认为你仍然可以使用它。

© www.soinside.com 2019 - 2024. All rights reserved.