使用R抓取HTML表格,想要保留URL

问题描述 投票:0回答:1

我目前正在使用 rvest 从网站 https://www.genome.jp/kegg/tables/br08606.html#5 抓取 2 个 HTML 表。具体来说,我希望抓取第二张表(带有类别细菌和古细菌的表)。该表链接到其他网站,特别是“类别”、“来源”和“有机体”下的一些栏目。我想在抓取表格时保留所有这些链接,但我不确定如何实现这一点。如果我能得到一些指导,那就太棒了......这是我在搜索互联网后到目前为止所尝试的。

library(rvest)
library(dplyr)
library(tidyverse)
library(janitor)

item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")

tables <- item %>% html_table(fill = TRUE)

Bacteria_table <- tables[[2]]
Bacteria_table <- Bacteria_table %>% clean_names()

source <- item %>%
  html_nodes("table") %>%
  .[[2]] %>%
  html_nodes(xpath = "//td/a") %>%
  html_attr("href")

print(source)

Bacteria_table_links <- data.frame(Bacteria_table)
Bacteria_table_links$source_links <- source
html r web-scraping html-table rvest
1个回答
0
投票

你可以这样做:

library(rvest)
library(dplyr)
library(tidyverse)
library(janitor)

item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")
# Extract the table content
tables <- item %>% html_table(fill = TRUE)
Bacteria_table <- tables[[2]] %>% clean_names()

# Extract all rows from the table and their links
table_rows <- item %>%
  html_nodes("table") %>%
  .[[2]] %>%
  html_nodes("tr")  # Each row in the table

# Extract links for each row
links_list <- table_rows %>%
  map(~ .x %>%
        html_nodes("td a") %>%      # Get <a> tags within the row
        html_attr("href") %>%       # Extract href attributes
        paste(collapse = "; ")      # Combine multiple links with ";"
  )

# Add extracted links to the table
# Remove the header row from the links_list to align with the data
links_list <- links_list[-1]  # Assuming the first row is the header

# Check alignment
if (length(links_list) == nrow(Bacteria_table)) {
  Bacteria_table$source_links <- trimws(sub(".*;", "", links_list)) # get the last link from each row
  
} else {
  stop("Mismatch between table rows and extracted links!")
}

结果:

> head(Bacteria_table)
# A tibble: 6 × 9
  category category_2     category_3  organisms organisms_2 organisms_3                   year source  source_links                                           
  <chr>    <chr>          <chr>       <chr>     <chr>       <chr>                        <int> <chr>   <chr>                                                  
1 Bacteria Enterobacteria Escherichia eco       KGB         Escherichia coli K-12 MG1655  1997 RefSeq  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/8…
2 Bacteria Enterobacteria Escherichia ecj       KGB         Escherichia coli K-12 W3110   2001 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/010/2…
3 Bacteria Enterobacteria Escherichia ecd       KGB         Escherichia coli K-12 DH10B   2008 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/019/4…
4 Bacteria Enterobacteria Escherichia ebw       KGB         Escherichia coli K-12 BW2952  2009 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/022/3…
5 Bacteria Enterobacteria Escherichia ecok      KGB         Escherichia coli K-12 MDS42   2013 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/350/1…
6 Bacteria Enterobacteria Escherichia ecoc      KGB         Escherichia coli K-12 C3026   2023 GenBank https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/559/6…
© www.soinside.com 2019 - 2024. All rights reserved.