我正在寻求使用 R 来抓取谷歌学者的网页,例如某人没有公开个人资料的情况。
一个挑战是一次只能显示 10 个结果 - 因此,对于拥有大量出版物的人来说,需要多行代码。 我想尽可能地迭代。
这是“手册”版本:
packages <- c("rvest", "xml2", "curl", "data.table")
lapply(packages, library, character.only = T)
#set up searches, with a "start" counter for the successive pages of results:
url_name0 <- 'http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998'
url_name1 <- 'http://scholar.google.com/scholar?start=10&q=author:"pi+campbell"&as_ylo=1998'
url_name2 <- 'http://scholar.google.com/scholar?start=20&q=author:"pi+campbell"&as_ylo=1998'
url_name3 <- 'http://scholar.google.com/scholar?start=30&q=author:"pi+campbell"&as_ylo=1998'
#scrape
wp0 <- xml2::read_html(url_name0)
wp1 <- xml2::read_html(url_name1)
wp2 <- xml2::read_html(url_name2)
wp3 <- xml2::read_html(url_name3)
# Extract raw data (titles)
titles0 <- rvest::html_text(rvest::html_nodes(wp0, '.gs_rt'))
titles1 <- rvest::html_text(rvest::html_nodes(wp1, '.gs_rt'))
titles2 <- rvest::html_text(rvest::html_nodes(wp2, '.gs_rt'))
titles3 <- rvest::html_text(rvest::html_nodes(wp3, '.gs_rt'))
现在,对于后一部分,应该可以迭代。 我尝试过 for 循环:
counter <- 0:3
titles <- vector("list", length(counter))
for (i in seq_along(counter)) {
titles[[i]] <- rvest::html_text(rvest::html_nodes(wp[[i]], '.gs_rt'))
}
但这会产生错误:
UseMethod("xml_find_all") 中的错误: 没有适用于“xml_find_all”的方法应用于“NULL”类的对象
(在早期阶段我遇到了不同的错误:)
html_elements(...) 中的错误:找不到对象“wp”
后续操作具有类似的结构——所以如果我能找出这个操作的迭代,我应该能够编写一些更有效的代码......
也许有些不那么手动的东西,例如使用
rvest::session()
: 浏览页面
library(rvest)
s <- session('http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998')
# Extract aprox. number of results (including citations)
about_n_results <-
html_elements(s, "#gs_ab_md .gs_ab_mdw") |>
html_text() |>
stringr::str_extract("(?<=About )\\d+") |>
as.integer()
# Storage list
titles <- vector(mode = "list", length = ceiling(about_n_results / 10))
# Extract titles, store in a list, try to follow Next link;
# repeat while there is a Next link,
# break when number of returned titles is < 10 (pressumably only citations will follow)
while (!is.null(s)) {
current_page_n <-
html_elements(s, "span.gs_ico_nav_current ~ b") |>
html_text(trim = TRUE) |>
as.integer()
titles[[current_page_n]] <- html_elements(s, '.gs_rt a') |> html_text(trim = TRUE)
if (length(titles[[current_page_n]]) < 10) break
# Follow Next link, return NULL when there isn't one
s <-
tryCatch(
error = function(cnd) NULL,
session_follow_link(s, xpath = "//b[text() = 'Next']/../../a")
)
}
#> Navigating to
#> </scholar?start=10&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
#> Navigating to
#> </scholar?start=20&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
#> Navigating to
#> </scholar?start=30&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
unlist(titles)
#> [1] "'Black Lives Matter:'sport, race and ethnicity in challenging times"
#> [2] "'Pray (ing) the person marking your work isn't racist': racialised inequities in HE assessment practice"
#> [3] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup"
#> [4] "Sport, race and ethnicity in the wake of black lives matter: Introduction to the special issue"
#> [5] "A ('black') historian using sociology to write a history of 'black'sport: A critical reflection"
#> [6] "Education, Retirement and Career Transitions for'Black'Ex-Professional Footballers: From Being Idolised to Stacking Shelves"
#> [7] "Black British Students' Experiences of Assessment"
#> [8] "'That black boy's different class!': a historical sociology of the black middle-classes, boundary-work and local football in the British East-Midlands c. 1970− 2010"
#> [9] "White British Students' Experiences of Assessment"
#> [10] "'Race', politics and local football–continuity and change in the life of a British African-Caribbean local football club"
#> [11] "THE EFFECTS OF RACIALLY INCLUSIVE ASSESSMENT ON THE RACE AWARD GAP AND ON STUDENTS'LIVED EXPERIENCES OF ASSESSMENT"
#> [12] "Cavaliers Made Us 'United': Local Football, Identity Politics and Second-generation African-Caribbean Youth in the East Midlands c.1970–9"
#> [13] "Electrolytes and pH changes in pre-eclamptic rats"
#> [14] "Racially Inclusive Assessment and Academic Teaching Staff"
#> [15] "Race and Assessment in Higher Education: From Conceptualising Barriers to Making Measurable Change"
#> [16] "Conceptualising Inter-and Intra-Race-Based Barriers in Assessment"
#> [17] "Afterword: 12 Years a Black Race Inclusion Academic – Some Reflections on Working in a 'Postracism' Space"
#> [18] "'White digital footballers can't jump': (re)constructions of race in FIFA 20"
#> [19] "British South Asian Students' Experiences of Assessment"
#> [20] "'Policy Shorts': Mapping and 'Tackling'Racial Inequities in HE Assessment–Summarising the Case Study"
#> [21] "Evaluating the Racially Inclusive Curricula Toolkit in HE': Empirically Measuring the Efficacy and Impact of Making Curriculum-content Racially Inclusive on the …"
#> [22] "Discussion and Concluding Comments"
#> [23] "Football professionals, qualifications and post-playing career preparations"
#> [24] "Author interview: Q and A with Dr Paul Ian Campbell, author of education, retirement and career transitions for 'black'ex-professional footballers"
#> [25] "“SHEA BUTTER” FROM B. PARKII STUDIES ON EXTRACTION METHODS"
#> [26] "'Black Lives Matter:'sport, race and ethnicity in challenging times"
#> [27] "Internet of Things (IoT) Model for the Detection of an Infectious Disease (COVID-19)"
#> [28] "Working-class Soccer Schoolboys, Race and Education in England"
#> [29] "Retirement, Training and Transitions into Non-sport Work"
#> [30] "Ethnicity, Community and 'Local'Football"
#> [31] "ERGO: an integrated, user-friendly model for computing energy and greenhouse gas budgets of bioenergy systems"
#> [32] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup."
#> [33] "Intracellular Threshhold for Ionized Mg^ sup 2+^ in Rat Platelets"
#> [34] "Sport, Race and Ethnicity at a time of multiple global crises."
创建于 2024 年 10 月 23 日,使用 reprex v2.1.1