谷歌学者网络抓取的迭代

问题描述 投票:0回答:1

我正在寻求使用 R 来抓取谷歌学者的网页,例如某人没有公开个人资料的情况。

一个挑战是一次只能显示 10 个结果 - 因此,对于拥有大量出版物的人来说,需要多行代码。 我想尽可能地迭代。

这是“手册”版本:

packages <- c("rvest", "xml2", "curl", "data.table")
lapply(packages, library, character.only = T)

#set up searches, with a "start" counter for the successive pages of results:
url_name0 <- 'http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998'
url_name1 <- 'http://scholar.google.com/scholar?start=10&q=author:"pi+campbell"&as_ylo=1998'
url_name2 <- 'http://scholar.google.com/scholar?start=20&q=author:"pi+campbell"&as_ylo=1998'
url_name3 <- 'http://scholar.google.com/scholar?start=30&q=author:"pi+campbell"&as_ylo=1998'

#scrape
wp0 <- xml2::read_html(url_name0)
wp1 <- xml2::read_html(url_name1)
wp2 <- xml2::read_html(url_name2)
wp3 <- xml2::read_html(url_name3)

# Extract raw data (titles)
titles0 <- rvest::html_text(rvest::html_nodes(wp0, '.gs_rt'))
titles1 <- rvest::html_text(rvest::html_nodes(wp1, '.gs_rt'))
titles2 <- rvest::html_text(rvest::html_nodes(wp2, '.gs_rt'))
titles3 <- rvest::html_text(rvest::html_nodes(wp3, '.gs_rt'))

现在,对于后一部分,应该可以迭代。 我尝试过 for 循环:

counter <- 0:3

titles <- vector("list", length(counter))

for (i in seq_along(counter)) {
    titles[[i]] <- rvest::html_text(rvest::html_nodes(wp[[i]], '.gs_rt'))
}

但这会产生错误:

UseMethod("xml_find_all") 中的错误: 没有适用于“xml_find_all”的方法应用于“NULL”类的对象

(在早期阶段我遇到了不同的错误:)

html_elements(...) 中的错误:找不到对象“wp”

后续操作具有类似的结构——所以如果我能找出这个操作的迭代,我应该能够编写一些更有效的代码......

r loops web-scraping iteration rvest
1个回答
0
投票

也许有些不那么手动的东西,例如使用

rvest::session()
:

浏览页面
library(rvest)

s <- session('http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998') 

# Extract aprox. number of results (including citations) 
about_n_results <- 
  html_elements(s, "#gs_ab_md .gs_ab_mdw") |> 
  html_text() |> 
  stringr::str_extract("(?<=About )\\d+") |> 
  as.integer()

# Storage list
titles <- vector(mode = "list", length = ceiling(about_n_results / 10))

# Extract titles, store in a list, try to follow Next link;
# repeat while there is a Next link, 
# break when number of returned titles is < 10 (pressumably only citations will follow)
while (!is.null(s)) {
  current_page_n <- 
    html_elements(s, "span.gs_ico_nav_current ~ b") |> 
    html_text(trim = TRUE) |> 
    as.integer()
  
  titles[[current_page_n]] <- html_elements(s, '.gs_rt a') |> html_text(trim = TRUE)
  if (length(titles[[current_page_n]]) < 10) break
  
  # Follow Next link, return NULL when there isn't one
  s <- 
    tryCatch(
      error = function(cnd) NULL,
      session_follow_link(s, xpath = "//b[text() = 'Next']/../../a")
    )
}
#> Navigating to
#> </scholar?start=10&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
#> Navigating to
#> </scholar?start=20&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
#> Navigating to
#> </scholar?start=30&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.

unlist(titles)
#>  [1] "'Black Lives Matter:'sport, race and ethnicity in challenging times"                                                                                                 
#>  [2] "'Pray (ing) the person marking your work isn't racist': racialised inequities in HE assessment practice"                                                             
#>  [3] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup"                                     
#>  [4] "Sport, race and ethnicity in the wake of black lives matter: Introduction to the special issue"                                                                      
#>  [5] "A ('black') historian using sociology to write a history of 'black'sport: A critical reflection"                                                                     
#>  [6] "Education, Retirement and Career Transitions for'Black'Ex-Professional Footballers: From Being Idolised to Stacking Shelves"                                         
#>  [7] "Black British Students' Experiences of Assessment"                                                                                                                   
#>  [8] "'That black boy's different class!': a historical sociology of the black middle-classes, boundary-work and local football in the British East-Midlands c. 1970− 2010"
#>  [9] "White British Students' Experiences of Assessment"                                                                                                                   
#> [10] "'Race', politics and local football–continuity and change in the life of a British African-Caribbean local football club"                                            
#> [11] "THE EFFECTS OF RACIALLY INCLUSIVE ASSESSMENT ON THE RACE AWARD GAP AND ON STUDENTS'LIVED EXPERIENCES OF ASSESSMENT"                                                  
#> [12] "Cavaliers Made Us 'United': Local Football, Identity Politics and Second-generation African-Caribbean Youth in the East Midlands c.1970–9"                           
#> [13] "Electrolytes and pH changes in pre-eclamptic rats"                                                                                                                   
#> [14] "Racially Inclusive Assessment and Academic Teaching Staff"                                                                                                           
#> [15] "Race and Assessment in Higher Education: From Conceptualising Barriers to Making Measurable Change"                                                                  
#> [16] "Conceptualising Inter-and Intra-Race-Based Barriers in Assessment"                                                                                                   
#> [17] "Afterword: 12 Years a Black Race Inclusion Academic – Some Reflections on Working in a 'Postracism' Space"                                                           
#> [18] "'White digital footballers can't jump': (re)constructions of race in FIFA 20"                                                                                        
#> [19] "British South Asian Students' Experiences of Assessment"                                                                                                             
#> [20] "'Policy Shorts': Mapping and 'Tackling'Racial Inequities in HE Assessment–Summarising the Case Study"                                                                
#> [21] "Evaluating the Racially Inclusive Curricula Toolkit in HE': Empirically Measuring the Efficacy and Impact of Making Curriculum-content Racially Inclusive on the …"  
#> [22] "Discussion and Concluding Comments"                                                                                                                                  
#> [23] "Football professionals, qualifications and post-playing career preparations"                                                                                         
#> [24] "Author interview: Q and A with Dr Paul Ian Campbell, author of education, retirement and career transitions for 'black'ex-professional footballers"                  
#> [25] "“SHEA BUTTER” FROM B. PARKII STUDIES ON EXTRACTION METHODS"                                                                                                          
#> [26] "'Black Lives Matter:'sport, race and ethnicity in challenging times"                                                                                                 
#> [27] "Internet of Things (IoT) Model for the Detection of an Infectious Disease (COVID-19)"                                                                                
#> [28] "Working-class Soccer Schoolboys, Race and Education in England"                                                                                                      
#> [29] "Retirement, Training and Transitions into Non-sport Work"                                                                                                            
#> [30] "Ethnicity, Community and 'Local'Football"                                                                                                                            
#> [31] "ERGO: an integrated, user-friendly model for computing energy and greenhouse gas budgets of bioenergy systems"                                                       
#> [32] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup."                                    
#> [33] "Intracellular Threshhold for Ionized Mg^ sup 2+^ in Rat Platelets"                                                                                                   
#> [34] "Sport, Race and Ethnicity at a time of multiple global crises."

创建于 2024 年 10 月 23 日,使用 reprex v2.1.1

© www.soinside.com 2019 - 2024. All rights reserved.