数据框中每个单词的字母和二元组组成

问题描述 投票:0回答:1

我有一个包含单词的数据框,我想提取每个单词的字母和二元组合。

数据:

df$text

[1] "table"
[2] "run"
[3] "mug"`

最后我想收到输出:


 1      a b c d e..z aa ab bb...zz 
table   1 1 0 0 0..0  0 1  0    0

首先,我尝试使用 Quanteda 提取所有字母:

text <- c("table", "run", "mug")

dict <- dictionary(list(a= "a",
                        b = "b",
                        c = "c",
                        d = "d", 
                        e = "e",
                        f = "f",
                        g = "g", 
                        h = "h",
                        i = "i",
                        j = "j",
                        k = "k",
                        l ="l",
                        m = "m",
                        n = "n",
                        o = "o",
                        p = "p",
                        q = "q",
                        r = "r", 
                        s = "s",
                        t = "t",
                        u = "u",
                        v = "v", 
                        w = "w",
                        x = "x",
                        y = "y",
                        z = "z"))


corp<- corpus(text)

tokens(corp) |>
  tokens_lookup(dictionary = dict) |>
  dfm()

但是没有成功:

Document-feature matrix of: 3 documents, 26 features (100.00% sparse) and 0 docvars.
       features
docs    a b c d e f g h i j
  text1 0 0 0 0 0 0 0 0 0 0
  text2 0 0 0 0 0 0 0 0 0 0
  text3 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 16 more features ]

我对此完全陌生,如果您有任何提示如何做到这一点,请帮忙。谢谢!

r nlp n-gram
1个回答
0
投票

对于单个字母,您可以使用以下代码:

toks <- tokens(corp, "character")
print(dfm(tokens_lookup(toks, dictionary=dict)), max_nfeat=26)

Document-feature matrix of: 3 documents, 26 features (85.90% sparse) and 0 docvars.
       features
docs    a b c d e f g h i j k l m n o p q r s t u v w x y z
  text1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
  text2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0
  text3 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0

只要你设置了字典,就应该能够以类似的方式获取两个字母的组合。

twos <- apply(t(combn(letters, 2)), 1, \(x) paste0(x, collapse=""))
dict2 <- dictionary(setNames(as.list(c(letters, twos)), c(letters, twos)))

print(dfm(tokens_lookup(toks, dictionary=dict2)), max_nfeat=30)

Document-feature matrix of: 3 documents, 351 features (98.96% sparse) and 0 docvars.
       features
docs    a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ad ae
  text1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0  0  0  0  0
  text2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0  0  0  0  0
  text3 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0  0  0  0  0
[ reached max_nfeat ... 321 more features ]
© www.soinside.com 2019 - 2024. All rights reserved.