我正在尝试从特定连锁店的各个商店特定网址中抓取商店信息。我用的是R
我开始测试我的刮擦并收到错误消息
mutate(.,names = c(“姓名”,“地址”,“电话”))中的错误: 错误原因: !
尺寸必须为 4 或 1,而不是 3。names
所以我尝试通过添加一个有效的“删除列”来修复它。
当我尝试运行 for 循环时,它会切换为需要 3 列或 4 列,无论循环未使用哪一列。请看下面的代码
以下是我采取的步骤以及输出是什么:
初步测试
library(rvest)
library(dplyr)
library(tidyr)
trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"
urls_1 <- trial_urls_1 %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>%
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone")) %>% #I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info )
urls_1
错误:
Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
1. ... %>% pivot_wider(names_from = names, values_from = info)
10. dplyr:::dplyr_internal_error(...)
Error in mutate(., names = c("Name", "Address", "Phone")) :
Caused by error:
! `names` must be size 4 or 1, not 3.
测试 2(有效!):
urls_2 <- trial_urls_1 %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>%
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info ) %>%
select(-drop)
urls_2
具有 4 个列名称的 For 循环
#blank list
store_info_2 <- list()
#loop over all_store_urls
for (i in 1:length(all_store_urls)){
store_info_2[[i]] <- all_store_urls[i] %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>%
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone", "drop")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info )
}
这会产生此错误:
Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone", "drop")`.
Caused by error:
! `names` must be size 3 or 1, not 4.
Backtrace:
1. ... %>% pivot_wider(names_from = names, values_from = info)
10. dplyr:::dplyr_internal_error(...)
所以我恢复了 3 个名字,同样的事情发生了:
store_info_1 <- list()
#loop over all_store_urls
for (i in 1:length(all_store_urls)){
store_info_1[[i]] <- all_store_urls[i] %>%
read_html() %>%
html_elements(".store-details :nth-child(2)") %>%
html_text() %>% #extract link
tibble() %>%
rename(info = ".") %>%
unnest(col = info) %>%
mutate(names = c("Name", "Address", "Phone")) %>% #without drop I get an error saying the scrape is yielding 4 columns. This is what seems to give the error in the for loop
pivot_wider(names_from = names,
values_from = info )
}
错误:
Error in `mutate()`:
ℹ In argument: `names = c("Name", "Address", "Phone")`.
Caused by error:
! `names` must be size 4 or 1, not 3.
Backtrace:
1. ... %>% pivot_wider(names_from = names, values_from = info)
10. dplyr:::dplyr_internal_error(...)
我错过了什么?
对于包含的示例页面,您的 CSS 选择器 (
.store-details :nth-child(2)
) 返回 4 个元素:
library(rvest)
trial_urls_1 <- "https://www.sprouts.com/store/co/lakewood/fairfield-commons/"
html <- read_html(trial_urls_1)
html |>
html_elements(".store-details :nth-child(2)")
#> {xml_nodeset (4)}
#> [1] <h1>Lakewood</h1>
#> [2] <a href="https://www.google.com/maps/search/?api=1&query=39.716896%2C ...
#> [3] <a href="tel:303-957-9276" role="button" tabindex="0" aria-hidden="false" ...
#> [4] <br>
但它非常宽松,可以轻松地为您正在测试的其他商店返回 3 或 5 个元素。 我宁愿为每个提取的位提供显式选择器,这样输出形状始终是固定的。如果有些页面的某些选择器没有匹配,则该项目/列仍然存在,只是使用
.store-details .not-present
: 填充 NA,如下所示
list(
Name = html_element(html, ".store-details h1"),
Adress = html_element(html, ".store-details .store-address"),
Phone = html_element(html, ".store-details .store-phone"),
Missing = html_element(html, ".store-details .not-present")
) |>
lapply(html_text2) |>
tibble::as_tibble()
#> # A tibble: 1 × 4
#> Name Adress Phone Missing
#> <chr> <chr> <chr> <chr>
#> 1 Lakewood "Fairfield Commons\n98 Wadsworth Blvd. Lakewood, CO 80… 303-… <NA>