我想创建一个数据框,其中包含从长字符串列匹配的字符串。在数据框“d”中,类型列是具有不同类型的字符串列。出现问题的原因是每行没有匹配的相同数量的项目。
library(tidyverse)
d <- structure(list(genres = c("[{'id': 35, 'name': 'Comedy'}]", "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]",
"[{'id': 16, 'name': 'Animation'}, {'id': 12, 'name': 'Adventure'}, {'id': 10751, 'name': 'Family'}]"
), budget = c(1.4e+07, 4e+07, 8e+06)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
d
#> # A tibble: 3 x 2
#> genres budget
#> <chr> <dbl>
#> 1 [{'id': 35, 'name': 'Comedy'}] 1.40e7
#> 2 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id… 4.00e7
#> 3 [{'id': 16, 'name': 'Animation'}, {'id': 12, 'name': 'Adventure'… 8.00e6
我找到了一个解决这个问题的工作流程,这个工作流程有点不合适。首先,使用带有str_extract_all
的simplify = T
提取所有匹配项,返回数据框。然后使用列名创建一个矢量字符串,分配给提取的数据框,最后使用bind_cols
:
foo <- str_extract_all(d$genres, '(?<=\'name\':\\s\')([^\']*)', simplify = T)
colnames(foo) <- paste0("genre_", 1:ncol(foo), "_extract")
foo <- foo %>% as_tibble()
foo_final <- bind_cols(d, foo)
foo_final
#> # A tibble: 3 x 6
#> genres budget genre_1_extract genre_2_extract genre_3_extract
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 [{'id… 1.40e7 Comedy "" ""
#> 2 [{'id… 4.00e7 Comedy Drama Family
#> 3 [{'id… 8.00e6 Animation Adventure Family
#> # … with 1 more variable: genre_4_extract <chr>
由reprex package创建于2019-03-10(v0.2.1)
我想知道是否有办法在管道操作符,mutate或map_df中以一种整齐的方式使它...我相信有更好的方法可以做到这一点。
我不确定这比你的解决方案更优雅,但它是纯粹的管道:
# Use `str_count` to determine the maximum number of genres associated with any
# single row, and create one column for each genre. Stored here for
# convenience, since it's needed twice below.
genre.col.names = paste("genre",
1:(max(str_count(d$genres, "\\}, \\{")) + 1),
sep = "_")
# Use `separate` to split the `genres` column into one column for each genre.
# Then use `mutate` and some regular expressions to populate each column with
# just the genre name and no other material (quotes, brackets, etc.).
d %>%
separate(genres,
into = genre.col.names,
sep = "\\}, \\{") %>%
mutate_(.dots = setNames(paste("gsub(\".*'name': '([A-Za-z]+)'.*\", \"\\\\1\", ",
genre.col.names,
")",
sep = ""),
genre.col.names))
在我看来,更整洁的格式会很长:每部电影每部电影一行。 (我假设每个记录都是电影......?)如果这种格式适合你,这里有一种方法可以获得它(我确信甚至有更简单的方法):
library(fuzzyjoin)
# Get all the genres appearing in any record and put them in a dataframe.
all.genres = data.frame(
genre.name = unique(unlist(str_extract_all(d$genres, '(?<=\'name\':\\s\')([^\']*)')))
)
# Left join the main data frame to the genre list using a regular expression.
d %>%
rownames_to_column() %>%
regex_left_join(all.genres, by = c("genres" = "genre.name")) %>%
select(rowname, genre.name)
如果你真的需要宽格式的数据,那么每个可能类型的一列怎么样,值是TRUE/FALSE
?如果当然有很多可能的类型,这会变得笨拙。但优点是,例如,关于这部电影是否是喜剧的信息总是在同一栏目中。 (使用genre_1
,genre_2
等,你必须检查每一栏,以确定这部电影是否是喜剧。)这是一种方法:
# Using the list of unique genres that we created earlier, add one column for
# each genre that contains a flag for whether the record is an example of that
# genre.
d %>%
mutate_(.dots = setNames(paste("grepl(\"", all.genres$genre.name, "\", genres)",
sep = ""),
all.genres$genre.name))