我正在使用电子邮件及其各自的信息来清理数据库。有些电子邮件不止一次出现,但从一行到另一行的信息是互补的。所以我想用电子邮件作为关键来组合行。如果信息刚刚复制,请删除电子邮件。
我的数据库是一个csv文件,使用read.csv将其转换为数据框。
输入
EMAIL Country Gender Language
1 [email protected] US S
2 [email protected] AR female S
3 [email protected] female
4 [email protected] US female E
4 [email protected] US female E
5 [email protected] US male
产量
EMAIL Country Gender Language
1 [email protected] US male S
2 [email protected] AR female S
3 [email protected] US female E
我们可以使用dplyr
。在按“EMAIL”分组后,使用unique
获取每个列的summarise_all
元素,这些元素不是空白
library(dplyr)
df %>%
group_by(EMAIL) %>%
summarise_all(funs(unique(.[.!=''])))
# A tibble: 3 x 4
# Groups: EMAIL [3]
# EMAIL Country Gender Language
# <chr> <chr> <chr> <chr>
#1 [email protected] US male S
#2 [email protected] AR female S
#3 [email protected] US female E
df <- structure(list(EMAIL = c("[email protected]", "[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]"), Country = c("US", "AR", "", "US", "US",
"US"), Gender = c("", "female", "female", "female", "female",
"male"), Language = c("S", "S", "", "E", "E", "")), .Names = c("EMAIL",
"Country", "Gender", "Language"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
我们也可以使用aggregate
作为基本R选项:
df_out <- aggregate(x=df, by=list(df$EMAIL), function(x) { max(x, na.rm=TRUE) })
df_out[order(df_out$EMAIL), -1]
EMAIL Country Gender Language
1 [email protected] US female E
2 [email protected] US male S
3 [email protected] AR female S
这里的基本思想是,我们任意获取每个电子邮件密钥的每列的最大值,同时忽略NA
值。这似乎适用于您的数据集。
数据:
df <- data.frame(EMAIL=c('[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'),
Country=c('US', 'AR', NA, 'US', 'US', 'US'),
Gender=c(NA, 'female', 'female', 'female', 'female', 'male'),
Language=c('S', 'S', NA, 'E', 'E', NA), stringsAsFactors=FALSE)