在 R 中,抓取时,我收到错误,因为它识别了一个额外的列,然后没有识别它

问题描述 投票:0回答:1

我正在尝试从特定连锁店的各个商店特定网址中抓取商店信息。我用的是R

我开始测试我的刮擦并收到错误消息

mutate(.,names = c(“姓名”,“地址”,“电话”))中的错误: 错误原因: !

names
尺寸必须为 4 或 1,而不是 3。

所以我尝试通过添加一个有效的“删除列”来修复它。

当我尝试运行 for 循环时,它会切换为需要 3 列或 4 列,无论循环未使用哪一列。请看下面的代码

以下是我采取的步骤以及输出是什么:

初步测试

library(rvest)
library(dplyr)
library(tidyr)

trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"


urls_1 <- trial_urls_1 %>% 
  read_html() %>% 
  html_elements(".store-details :nth-child(2)") %>% 
  html_text() %>% 
  tibble() %>% 
  rename(info = ".") %>% 
  unnest(col = info) %>% 
  mutate(names = c("Name", "Address", "Phone")) %>% #I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
  pivot_wider(names_from = names,
              values_from = info )

urls_1

错误:

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)
Error in mutate(., names = c("Name", "Address", "Phone")) : 
Caused by error:
! `names` must be size 4 or 1, not 3.

测试 2(有效!):

urls_2 <- trial_urls_1 %>% 
  read_html() %>% 
  html_elements(".store-details :nth-child(2)") %>% 
  html_text() %>% 
  tibble() %>% 
  rename(info = ".") %>% 
  unnest(col = info) %>% 
  mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
  pivot_wider(names_from = names,
              values_from = info ) %>% 
  select(-drop)

urls_2

具有 4 个列名称的 For 循环

#blank list
store_info_2 <- list()

#loop over all_store_urls
for (i in 1:length(all_store_urls)){
  store_info_2[[i]] <- all_store_urls[i] %>%
    read_html() %>% 
    html_elements(".store-details :nth-child(2)") %>% 
    html_text() %>% 
    tibble() %>% 
    rename(info = ".") %>% 
    unnest(col = info) %>% 
    mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
    pivot_wider(names_from = names,
              values_from = info )
}

这会产生此错误:

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone", "drop")`.
Caused by error:
! `names` must be size 3 or 1, not 4.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)

所以我恢复了 3 个名字,同样的事情发生了:

store_info_1 <- list()

#loop over all_store_urls
for (i in 1:length(all_store_urls)){
  store_info_1[[i]] <- all_store_urls[i] %>%
    read_html() %>% 
    html_elements(".store-details :nth-child(2)") %>% 
    html_text() %>% #extract link
    tibble() %>%
    rename(info = ".") %>% 
    unnest(col = info) %>% 
    mutate(names = c("Name", "Address", "Phone")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
    pivot_wider(names_from = names,
              values_from = info )
}

错误:

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)

我错过了什么?

r for-loop web-scraping
1个回答
0
投票

对于包含的示例页面,您的 CSS 选择器 (

.store-details :nth-child(2)
) 返回 4 个元素:

library(rvest)

trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"
html  <- read_html(trial_urls_1)

html |> 
  html_elements(".store-details :nth-child(2)")
#> {xml_nodeset (4)}
#> [1] <h1>Lakewood</h1>
#> [2] <a href="https://www.google.com/maps/search/?api=1&amp;query=39.716896%2C ...
#> [3] <a href="tel:303-957-9276" role="button" tabindex="0" aria-hidden="false" ...
#> [4] <br>

但它非常宽松,可以轻松地为您正在测试的其他商店返回 3 或 5 个元素。 我宁愿为每个提取的位提供显式选择器,这样输出形状始终是固定的。如果有些页面的某些选择器没有匹配,则该项目/列仍然存在,只是使用

.store-details .not-present
:

填充 NA,如下所示
list(
  Name    = html_element(html, ".store-details h1"),
  Adress  = html_element(html, ".store-details .store-address"),
  Phone   = html_element(html, ".store-details .store-phone"),
  Missing = html_element(html, ".store-details .not-present")
) |> 
  lapply(html_text2) |> 
  tibble::as_tibble()
#> # A tibble: 1 × 4
#>   Name     Adress                                                  Phone Missing
#>   <chr>    <chr>                                                   <chr> <chr>  
#> 1 Lakewood "Fairfield Commons\n98 Wadsworth Blvd. Lakewood, CO 80… 303-… <NA>

© www.soinside.com 2019 - 2024. All rights reserved.