Web scrape同义词

Question

我正在尝试从美国国家癌症研究所词库数据库中检索同义词，但是在寻找正确的html时遇到了一些麻烦。以下是我的代码和我正在使用的数据框。当我运行脚本以提取同义词时，我得到了Error in open.connection(x, "rb") : HTTP error 404.，我似乎无法弄清楚正确的html链接应该是什么以及如何找到它。

library(xml2)
library(rvest)
library(dplyr)
library(tidyverse)

synonyms<-read_csv("terms.csv")
##list of acronyms 
words <- c(synonyms$Keyword)

##Designate html like and the values to search 
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)

Data<-data.frame(Pages=c(htmls))


results<-sapply(Data$Pages, function(url){
  try(
    url %>%
      as.character() %>% 
      read_html() %>% 
      html_nodes('p') %>% 
      html_text()
  )
})

Answer 1

我怀疑这行代码有问题：

##Designate html like and the values to search 
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)

因为paste0() just concatenates text together，这将为您提供类似的网址

https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Ketamine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Azacitidine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Axicabtagene+Ciloleucel

虽然我对rvest没有特别的经验，但是您看到的404错误几乎肯定与网络浏览器无法加载这些URL有关。我建议登录或打印出htmls，以便您可以确认它们确实在网络浏览器中正常运行。

我会指出，在这种情况下，网站会提供a downloadable database；您可能会发现，离线下载和查询要比进行此网络抓取更容易。

Web scrape同义词

问题描述投票：0回答：1

1个回答

最新问题

Web scrape同义词

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1