我目前正在使用 rvest 从网站 https://www.genome.jp/kegg/tables/br08606.html#5 抓取 2 个 HTML 表。具体来说,我希望抓取第二张表(带有类别细菌和古细菌的表)。该表链接到其他网站,特别是“类别”、“来源”和“有机体”下的一些栏目。我想在抓取表格时保留所有这些链接,但我不确定如何实现这一点。如果我能得到一些指导,那就太棒了......这是我在搜索互联网后到目前为止所尝试的。
library(rvest)
library(dplyr)
library(tidyverse)
library(janitor)
item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")
tables <- item %>% html_table(fill = TRUE)
Bacteria_table <- tables[[2]]
Bacteria_table <- Bacteria_table %>% clean_names()
source <- item %>%
html_nodes("table") %>%
.[[2]] %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")
print(source)
Bacteria_table_links <- data.frame(Bacteria_table)
Bacteria_table_links$source_links <- source
你可以这样做:
library(rvest)
library(dplyr)
library(tidyverse)
library(janitor)
item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")
# Extract the table content
tables <- item %>% html_table(fill = TRUE)
Bacteria_table <- tables[[2]] %>% clean_names()
# Extract all rows from the table and their links
table_rows <- item %>%
html_nodes("table") %>%
.[[2]] %>%
html_nodes("tr") # Each row in the table
# Extract links for each row
links_list <- table_rows %>%
map(~ .x %>%
html_nodes("td a") %>% # Get <a> tags within the row
html_attr("href") %>% # Extract href attributes
paste(collapse = "; ") # Combine multiple links with ";"
)
# Add extracted links to the table
# Remove the header row from the links_list to align with the data
links_list <- links_list[-1] # Assuming the first row is the header
# Check alignment
if (length(links_list) == nrow(Bacteria_table)) {
Bacteria_table$source_links <- trimws(sub(".*;", "", links_list)) # get the last link from each row
} else {
stop("Mismatch between table rows and extracted links!")
}
结果:
> head(Bacteria_table)
# A tibble: 6 × 9
category category_2 category_3 organisms organisms_2 organisms_3 year source source_links
<chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
1 Bacteria Enterobacteria Escherichia eco KGB Escherichia coli K-12 MG1655 1997 RefSeq https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/8…
2 Bacteria Enterobacteria Escherichia ecj KGB Escherichia coli K-12 W3110 2001 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/010/2…
3 Bacteria Enterobacteria Escherichia ecd KGB Escherichia coli K-12 DH10B 2008 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/019/4…
4 Bacteria Enterobacteria Escherichia ebw KGB Escherichia coli K-12 BW2952 2009 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/022/3…
5 Bacteria Enterobacteria Escherichia ecok KGB Escherichia coli K-12 MDS42 2013 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/350/1…
6 Bacteria Enterobacteria Escherichia ecoc KGB Escherichia coli K-12 C3026 2023 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/559/6…