最近我从 STATA 切换到 R。
在 STATA 中,有一种叫做值标签的东西。例如,使用命令
encode
可以将字符串变量转换为数字,并为每个数字附加一个字符串标签。由于字符串变量包含名称(大多数情况下会重复),因此使用值标签可以让您在处理大型数据集时节省大量空间。
不幸的是,我没有设法在 R 中找到类似的命令。我发现的唯一可以将标签附加到我的值向量的包是
sjlabelled
。它执行附件,但当我尝试将附加的数字向量合并到另一个数据帧时,标签似乎“掉落”。
示例:从字符串变量开始。
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
# Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences) # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
谢谢!
附注抱歉,代码不太优雅,正如我之前所说,我对 R 还很陌生。
来源:https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
... 2007 年 6 月左右,R 在 底层 C 代码感谢 Seth Falcon。这意味着什么 实际上,字符串被哈希为整数 表示并存储在 R 中的全局表中。任何时候给定 R中需要字符串,它可以被其底层引用 整数。这在全球范围内有效地实施了因素编码 之前的字符串行为。一旦实施,就有了 从效率的角度来看,编码几乎没有什么好处 字符变量作为因子。当然,你仍然需要使用 建模函数的“因素”。 ...
我稍微调整了你的初始测试数据。我对这么多字符串感到困惑,不确定它们对于这个问题是否是必要的。如果我错过了一点,请告诉我。这是我的调整和答案:
#####################################
# initial problem rephrased
#####################################
# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)
# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))
# show labels in this frame
get_labels(df1)
# include associated values
get_labels(df1, values = "as.prefix")
# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))
# labels lost after merge
get_labels(df_merge, values = "as.prefix")
#####################################
# solution with dplyr
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")
解决方案归因于:
我不确定我完全理解这个问题,但在我看来,你的代码是一种非常迂回的方式来实现非常简单的事情:字符串值也有与之关联的数字。 在 R 中,因子是排列成“级别”的字符串,其本质上也有数字,因此您可以利用它。例如,重用部分代码:
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
sentences <- strsplit(paragraph, " ")
sentences <- unique(unlist(sentences, use.names = FALSE))
sentences <- factor(sentences, levels=sentences)
data.frame(sentences, num.value = as.numeric(sentences))
sentences num.value
1 Melanija 1
2 Knavs 2
3 was 3
4 born 4
5 in 5
6 Novo 6
7 Mesto, 7
8 and 8
9 grew 9
10 up 10
...
因此,要访问数值,您只需转换(或强制)为数字即可。
PS。我不明白你为什么称这些“句子”:你被空格分割,所以你得到了某种“单词”,但它们也包括标点符号,这可能不是你想要的(你可以尝试一些东西)喜欢:
tidytext::unnest_tokens(data.frame(paragraph), word, paragraph)
。