我有一个数据框:
df <- data.frame(
Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
Dominantspp = c("OM", "OM", "RSS", "CH"),
Commonspp = c(" ", " ", " ", "OM"),
Rarespp = c(" ", " ", "SD", "NP"),
NP = rep("northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM", 4),
OM = rep("steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM", 4),
RSS = rep("redside shiner|REDSIDE SHINER|rs|RS|rss|RSS", 4),
suck = rep("suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS", 4)
)
我需要使用填充了常见鱼码/名称(NP,OM,RSS,suck)的列来评估前四列中的表达式,并根据每个列输出1/0,如果表达式得到满足的话。我下面的代码与完整的单词(仅部分)不匹配,并提供不正确的数据(请参阅下面的结果)。
df %>%
rowwise() %>%
transmute_at(vars(NP, OM, RSS, suck),
funs(case_when(
grepl(., Dominantspp) ~ "1",
grepl(., Commonspp) ~ "1",
grepl(., Rarespp) ~ "1",
grepl(., Otherspp) ~ "1",
TRUE ~ "0"))) %>%
ungroup()
结果:在第三行中看到“suck”和“RSS”都收到“1”。
# A tibble: 4 x 4
NP OM RSS suck
<chr> <chr> <chr> <chr>
1 0 1 0 1
2 0 1 0 0
3 0 0 1 1
4 1 1 1 1
期望的输出:
NP OM RSS suck
1 0 1 0 1
2 0 1 0 0
3 0 0 1 0
4 1 1 1 0
使用相同方法解决问题的最快方法是使用\\b
在每个正则表达式的开头和结尾添加单词边界:
df <- data.frame(
Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
Dominantspp = c("OM", "OM", "RSS", "CH"),
Commonspp = c(" ", " ", " ", "OM"),
Rarespp = c(" ", " ", "SD", "NP"),
NP = rep("\\b(northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM)\\b", 4),
OM = rep("\\b(steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM\\b)", 4),
RSS = rep("\\b(redside shiner|REDSIDE SHINER|rs|RS|rss|RSS)\\b", 4),
suck = rep("\\b(suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS)\\b", 4),
stringsAsFactors = FALSE
)
这使得正则表达式只匹配完整的单词,这将使您的后续解决方案工作。
话虽如此,我认为这不一定是解决问题的方法(今天很少推荐rowwise()
,而且这种方法不适合许多鱼类代码)。如果您将其标准化为整洁的格式,我认为您可以更轻松地使用此数据,每行和代码组合一行:
library(tidyr)
library(tidytext)
row_codes <- df %>%
select(Otherspp:Rarespp) %>%
mutate(row = row_number()) %>%
gather(type, codes, -row) %>%
unnest_tokens(code, codes, token = "regex", pattern = " ")
这将导致:
row type code
1 1 Dominantspp om
2 1 Otherspp suck
3 1 Otherspp sd
4 2 Dominantspp om
5 2 Otherspp bt
6 3 Dominantspp rss
7 3 Otherspp sd
8 3 Otherspp rs
9 3 Rarespp sd
10 4 Commonspp om
11 4 Dominantspp ch
12 4 Otherspp rss
13 4 Rarespp np
此时,代码更容易使用(您不再需要正则表达式)。例如,您可以将它inner_join
到鱼码表。