我正在使用 RStudio。我正在处理伦敦蓝色和红色历史牌匾的数据集。我想看的一些有趣的专栏包括“标题”,“性别”(选项是男性,女性,物体,地点等),“安装”(牌匾竖立的年份),“subject_lead_primary_role”(他们的职称,一般为1-5个字)和“铭文”(牌匾上的铭文,多个字)。
我有兴趣过滤包含原始集合中所有作者的子集。使用“subject_lead_primary_role”和“inscription”等列,我想过滤包含以下任何一个或多个单词的所有行:“作者”,“作家”,“小说家”,“散文家”,“诗人”,“剧作家” ”、“记者”、“戏剧家”和“日记作家”。我希望这个子集不含 NA。
我对 R 非常陌生。我能够导入数据集并对数值数据执行基本可视化,但之前没有使用过字符串。
这是一个如何对数据框进行子集化的示例,类似于您使用
tidyverse
包描述的数据框。
library(tidyverse)
# Create example dataset
dat <- data.frame(
color = rep(c('Blue','Red'), times = 3),
role = c('Academic','Dentist','Pilot','Author','Playwright','Journalist'),
inscription = c('Darwin is buried in London',
'William Guy created the 1921 Dentists Act in the United Kingdom',
'Mary Ellis flew planes in WWII',
'Oscar Wilde was a writer of many great stories',
'Shakespeare was a playwrite, poet, and actor',
'John Oliver probably has a plaque in the UK'))
# Subsetting based on one word in inscription column
dat %>%
filter(str_detect(inscription, 'writer'))
#> color role inscription
#> 1 Red Author Oscar Wilde was a writer of many great stories
# Subsetting based on two+ words in inscription column
# Note that this filter is case-sensitive
dat %>%
filter(str_detect(inscription, 'writer|poet|Oliver'))
#> color role inscription
#> 1 Red Author Oscar Wilde was a writer of many great stories
#> 2 Blue Playwright Shakespeare was a playwrite, poet, and actor
#> 3 Red Journalist John Oliver probably has a plaque in the UK
创建于 2024-05-14,使用 reprex v2.1.0