在 R 中，抓取时，我收到错误，因为它识别了一个额外的列，然后没有识别它

Question

我正在尝试从特定连锁店的各个商店特定网址中抓取商店信息。我用的是R

我开始测试我的刮擦并收到错误消息

mutate（.，names = c（“姓名”，“地址”，“电话”））中的错误：错误原因：！
names
尺寸必须为 4 或 1，而不是 3。

所以我尝试通过添加一个有效的“删除列”来修复它。

当我尝试运行 for 循环时，它会切换为需要 3 列或 4 列，无论循环未使用哪一列。请看下面的代码

以下是我采取的步骤以及输出是什么：

初步测试

library(rvest)
library(dplyr)
library(tidyr)

trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"


urls_1 <- trial_urls_1 %>% 
  read_html() %>% 
  html_elements(".store-details :nth-child(2)") %>% 
  html_text() %>% 
  tibble() %>% 
  rename(info = ".") %>% 
  unnest(col = info) %>% 
  mutate(names = c("Name", "Address", "Phone")) %>% #I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
  pivot_wider(names_from = names,
              values_from = info )

urls_1

错误：

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)
Error in mutate(., names = c("Name", "Address", "Phone")) : 
Caused by error:
! `names` must be size 4 or 1, not 3.

测试 2（有效！）：

urls_2 <- trial_urls_1 %>% 
  read_html() %>% 
  html_elements(".store-details :nth-child(2)") %>% 
  html_text() %>% 
  tibble() %>% 
  rename(info = ".") %>% 
  unnest(col = info) %>% 
  mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
  pivot_wider(names_from = names,
              values_from = info ) %>% 
  select(-drop)

urls_2

具有 4 个列名称的 For 循环

#blank list
store_info_2 <- list()

#loop over all_store_urls
for (i in 1:length(all_store_urls)){
  store_info_2[[i]] <- all_store_urls[i] %>%
    read_html() %>% 
    html_elements(".store-details :nth-child(2)") %>% 
    html_text() %>% 
    tibble() %>% 
    rename(info = ".") %>% 
    unnest(col = info) %>% 
    mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
    pivot_wider(names_from = names,
              values_from = info )
}

这会产生此错误：

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone", "drop")`.
Caused by error:
! `names` must be size 3 or 1, not 4.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)

所以我恢复了 3 个名字，同样的事情发生了：

store_info_1 <- list()

#loop over all_store_urls
for (i in 1:length(all_store_urls)){
  store_info_1[[i]] <- all_store_urls[i] %>%
    read_html() %>% 
    html_elements(".store-details :nth-child(2)") %>% 
    html_text() %>% #extract link
    tibble() %>%
    rename(info = ".") %>% 
    unnest(col = info) %>% 
    mutate(names = c("Name", "Address", "Phone")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
    pivot_wider(names_from = names,
              values_from = info )
}

错误：

Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
  1. ... %>% pivot_wider(names_from = names, values_from = info)
 10. dplyr:::dplyr_internal_error(...)

我错过了什么？

Answer 1

对于包含的示例页面，您的 CSS 选择器 (

.store-details :nth-child(2)

) 返回 4 个元素：

library(rvest)

trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"
html  <- read_html(trial_urls_1)

html |> 
  html_elements(".store-details :nth-child(2)")
#> {xml_nodeset (4)}
#> [1] <h1>Lakewood</h1>
#> [2] <a href="https://www.google.com/maps/search/?api=1&amp;query=39.716896%2C ...
#> [3] <a href="tel:303-957-9276" role="button" tabindex="0" aria-hidden="false" ...
#> [4] <br>

但它非常宽松，可以轻松地为您正在测试的其他商店返回 3 或 5 个元素。我宁愿为每个提取的位提供显式选择器，这样输出形状始终是固定的。如果有些页面的某些选择器没有匹配，则该项目/列仍然存在，只是使用

.store-details .not-present

:

填充 NA，如下所示

list(
  Name    = html_element(html, ".store-details h1"),
  Adress  = html_element(html, ".store-details .store-address"),
  Phone   = html_element(html, ".store-details .store-phone"),
  Missing = html_element(html, ".store-details .not-present")
) |> 
  lapply(html_text2) |> 
  tibble::as_tibble()
#> # A tibble: 1 × 4
#>   Name     Adress                                                  Phone Missing
#>   <chr>    <chr>                                                   <chr> <chr>  
#> 1 Lakewood "Fairfield Commons\n98 Wadsworth Blvd. Lakewood, CO 80… 303-… <NA>

在 R 中，抓取时，我收到错误，因为它识别了一个额外的列，然后没有识别它

问题描述投票：0回答：1

1个回答

最新问题

在 R 中，抓取时，我收到错误，因为它识别了一个额外的列，然后没有识别它

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1