我有一个包含单词的数据框,我想提取每个单词的字母和二元组合。
数据:
df$text
[1] "table"
[2] "run"
[3] "mug"`
最后我想收到输出:
1 a b c d e..z aa ab bb...zz
table 1 1 0 0 0..0 0 1 0 0
首先,我尝试使用 Quanteda 提取所有字母:
text <- c("table", "run", "mug")
dict <- dictionary(list(a= "a",
b = "b",
c = "c",
d = "d",
e = "e",
f = "f",
g = "g",
h = "h",
i = "i",
j = "j",
k = "k",
l ="l",
m = "m",
n = "n",
o = "o",
p = "p",
q = "q",
r = "r",
s = "s",
t = "t",
u = "u",
v = "v",
w = "w",
x = "x",
y = "y",
z = "z"))
corp<- corpus(text)
tokens(corp) |>
tokens_lookup(dictionary = dict) |>
dfm()
但是没有成功:
Document-feature matrix of: 3 documents, 26 features (100.00% sparse) and 0 docvars.
features
docs a b c d e f g h i j
text1 0 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 0 0 0 0 0
text3 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 16 more features ]
我对此完全陌生,如果您有任何提示如何做到这一点,请帮忙。谢谢!
对于单个字母,您可以使用以下代码:
toks <- tokens(corp, "character")
print(dfm(tokens_lookup(toks, dictionary=dict)), max_nfeat=26)
Document-feature matrix of: 3 documents, 26 features (85.90% sparse) and 0 docvars.
features
docs a b c d e f g h i j k l m n o p q r s t u v w x y z
text1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
text2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0
text3 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
只要你设置了字典,就应该能够以类似的方式获取两个字母的组合。
twos <- apply(t(combn(letters, 2)), 1, \(x) paste0(x, collapse=""))
dict2 <- dictionary(setNames(as.list(c(letters, twos)), c(letters, twos)))
print(dfm(tokens_lookup(toks, dictionary=dict2)), max_nfeat=30)
Document-feature matrix of: 3 documents, 351 features (98.96% sparse) and 0 docvars.
features
docs a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ad ae
text1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
text3 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 321 more features ]