我需要从这个网站下载pdf文档:https://tealprod.tea.state.tx.us/Audit/Public/PDFViewer.asp。下面的代码从第一个下拉菜单中选择年份,在第三个下拉菜单中填充学区,然后选择学区。填写完这两个字段后,会出现一个名为“查看”的按钮,链接到pdf文档,但我不知道如何单击“查看”按钮:
library(tidyverse)
library(rvest)
session <- html_session("https://tealprod.tea.state.tx.us/Audit/Public/PDFViewer.asp")
form <- html_form(session)[[1]]
years <- form$fields[[6]]$options[-1]
years <- years[grep("201[6-9]|202[0-3]", years)]
for (i in 1:length(years)) {
form <- set_values(form, 'YearSelected' = years[i])
session <- submit_form(session, form)
form <- html_form(session)[[1]]
districts <- form$fields[[8]]$options[-1]
for (j in 1:length(districts)) {
form <- set_values(form, "DNSelected" = districts[j])
session <- submit_form(session, form)
# [ I filled out the form; how do I download the file? ]**
}
}
这里有几个问题。首先,并不是每个地区都有每年的记录,因此您的代码需要能够处理没有结果的搜索。其次,有 1,293 个区和 9 年,因此您需要向服务器发送超过 11,000 个请求。即使使用速度很快的 PC,这也可能需要几个小时。
弄清楚如何获取 pdf 需要阅读页面上的 JavaScript 代码,以了解页面如何构造 pdf 的 URL。最好存储 URL,以便您可以在闲暇时下载它们。
session <- session("https://tealprod.tea.state.tx.us/Audit/Public/PDFViewer.asp")
form <- html_form(session)[[1]]
years <- c(na.omit(as.numeric(form$fields$YearSelected$options)))
years <- years[years >= 2016]
pdf_urls <- character()
for (i in 1:length(years)) {
form <- set_values(form, 'YearSelected' = years[i])
session <- session_submit(session, form)
form <- html_form(session)[[1]]
districts <- form$fields$CDSelected$options[-1]
district_nm <- form$fields$DNSelected$options[-1]
for (j in 1:length(districts)) {
form <- set_values(form, "CDSelected" = districts[j],
DNSelected = district_nm[j])
session <- session_submit(session, form)
pdf_link <- session$response |>
read_html() |>
html_element(xpath = "//input[@type = 'button']")
if(length(pdf_link) > 0) {
pdf_urls <- c(pdf_urls, paste0(
"https://tealprod.tea.state.tx.us/Audit/Public/",
"showPDF.asp?pdpath=", pdf_link |> html_attr("name"),
'&pdname=',
sub("^.*'(.*\\.pdf)'.*$", "\\1", pdf_link |> html_attr("onclick"))))
}
}
}
前几个网址是:
head(pdf_urls)
# [1] "https://tealprod.tea.state.tx.us/Audit/Public/showPDF.asp?pdpath=Q412\\FDEJ\\DDEMDF\\sv0n0pvny\\DDEMDFNEFDEJ.cQS&pdname=001902a7.pdf"
# [2] "https://tealprod.tea.state.tx.us/Audit/Public/showPDF.asp?pdpath=Q412\\FDEJ\\DDEMDG\\sv0n0pvny\\DDEMDGNEFDEJ.cQS&pdname=001903a7.pdf"
# [3] "https://tealprod.tea.state.tx.us/Audit/Public/showPDF.asp?pdpath=Q412\\FDEJ\\DDEMDH\\sv0n0pvny\\DDEMDHNEFDEJ.cQS&pdname=001904a7.pdf"
# [4] "https://tealprod.tea.state.tx.us/Audit/Public/showPDF.asp?pdpath=Q412\\FDEJ\\DDEMDJ\\sv0n0pvny\\DDEMDJNEFDEJ.cQS&pdname=001906a7.pdf"
# [5] "https://tealprod.tea.state.tx.us/Audit/Public/showPDF.asp?pdpath=Q412\\FDEJ\\DDEMDK\\sv0n0pvny\\DDEMDKNEFDEJ.cQS&pdname=001907a7.pdf"
# [6] "https://tealprod.tea.state.tx.us/Audit/Public/showPDF.asp?pdpath=Q412\\FDEJ\\DDEMDL\\sv0n0pvny\\DDEMDLNEFDEJ.cQS&pdname=001908a7.pdf"
那么你可以这样做
download.file(pdf_urls[1], "first.pdf")
第一个网址链接到具有此首页的文档: