虽然有更多的列和观察结果,但我的数据框如下所示:
dt <- data.frame(hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
"Full-time", "Full-time", "Full-time", "Unemployed", "Part-time",
"Full-time"),
relhead = c("Head", "Head", "Head", "Partner", "other", "Head",
"Partner", "Head", "Partner", "Head", "Partner"))
| hid | syear | employlvl | relhead |
|-----|-------|-------------|-----------------------|
| 1 | 2000 | Full-time | Head |
| 2 | 2001 | Part-time | Head |
| 2 | 2003 | Part-time | Head |
| 2 | 2003 | Unemployed | Partner |
| 2 | 2003 | Unemployed | other |
| 4 | 2000 | Full-time | Head |
| 4 | 2000 | Full-time | Partner |
| 4 | 2001 | Full-time | Head |
| 4 | 2001 | Unemployed | Partner |
| 4 | 2002 | Part-time | Head |
| 4 | 2002 | Full-time | Partner |
我想再创建一个列来表明合作伙伴的就业水平,并希望得到以下输出:
| hid | syear | employlvl | relhead | Partner |
|-----|-------|-------------|-----------------------|-------------------|
| 1 | 2000 | Part-time | Head | NA |
| 2 | 2001 | Part-time | Head | NA |
| 2 | 2003 | Part-time | Head | Unemployed |
| 2 | 2003 | Unemployed | Partner | NA |
| 2 | 2003 | Unemployed | other | NA |
| 4 | 2000 | Full-time | Head | Full-time |
| 4 | 2000 | Full-time | Partner | NA |
| 4 | 2001 | Full-time | Head | Unemployed |
| 4 | 2001 | Unemployed | Partner | NA |
| 4 | 2002 | Part-time | Head | Full-time |
| 4 | 2002 | Full-time | Partner | NA |
目前,我使用以下代码:
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(hid, syear) %>%
filter(n() > 1) %>%
filter(`relhead` != "Child") %>%
spread(relhead, employlvl) %>%
mutate(Relation = "Head") %>%
rename(`Employment Partner` = Partner) %>%
select(-Head)
dt3 <- dt %>%
left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))
对于这个小数据集,代码绝对可以正常工作。但当我尝试获取全部数据时,我得到以下结果:
错误:数据源必须是字典
仔细检查我的数据集后,我发现有两列具有相同的名称。我重命名其中之一后,它就可以正常工作了。
正如其他答案中所述,这是由非唯一名称引起的。我能够通过修改您的示例来重现错误(
relhead
的第三个元素)
dt <- data.frame(
hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
"Full-time", "Full-time", "Full-time", "Unemployed", "Part-time",
"Full-time"),
relhead = c("Head", "Head", "Employment Partner", "Partner", "other", "Head",
"Partner", "Head", "Partner", "Head", "Partner")
)
在这种情况下,
spread
创建第一个"Employment Partner"
列,rename
创建第二个。您应该检查 "Employment Partner"
、"Relation"
(也许还有 hid
、syear
)中的任何一个是否在 dt$relhead
中(第一个给您错误,第二个被 mutate(Relation=...)
覆盖)。
最小可重现示例:
data_frame(g = c("a1","a2","a3"), i=1) %>%
spread(g, i) %>%
rename(a1 = a3) %>%
select(-a1)
当我不小心在
rename()
包的 dplyr
语句中使用 2 个相同的新名称时,我收到了相同的错误消息。将 names(df2)
与 unique(names(df2))
进行比较,因为您之前可能已经拥有相同的变量名称。
如果错误仅在运行
select(-Head)
后发生,您可能可以通过使用基本 R 命令来找到解决方法来实现相同的目的。
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(hid, syear) %>%
filter(n() > 1) %>%
filter(`relhead` != "Child") %>%
spread(relhead, employlvl) %>%
mutate(Relation = "Head") %>%
rename(`Employment Partner` = Partner)
以上部分与原代码相同。之后,运行以下命令。
dt2$Head <- NULL
这是一个基本 R 命令,用于删除
Head
列,这与 select(-Head)
想要做的事情是一样的。
然后您可以运行其余代码来加入数据框。
dt3 <- dt %>%
left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))
由于您没有提供可重现的示例,我们无法弄清楚此错误消息的真正含义,但也许此解决方法可以帮助您暂时完成任务。
这是由于在
select(-variable)
调用之后执行 rename
造成的。 我遇到了同样的错误,当我删除“重命名”调用并执行相同的 select(-variable) 时,它起作用了。
不知道为什么会出现这种情况,但这就是错误的触发因素。
问题(我认为)是
plyr
和 dplyr
中同名函数之间的行为差异。因此,当您同时加载它们时,您可能会得到意想不到的结果。我也用 group_by
和 summarize
看到了这一点。
一般来说,我发现处理这个问题的最好方法就是使用
dplyr::select
、dplyr::rename
等等。
更好的是不要使用
plyr
,因为 dplyr
已经涵盖了它,但我有一些使用 plyr
的遗留代码,所以我不愿意乱搞它.