避免 Quanteda 中频率和文档频率计数重叠

Question

下面是包含 4 个文档的虚拟语料库。

该词典的开发是为了识别语料库中单词或短语的频率，以及单词或短语出现的文档数量。

“澳大利亚人”这个世界出现在两个字典键中（peep、indig）。关键内容旨在相互排斥。

类似地，“澳大利亚”（oz 和澳大利亚邮政）、foreign（外国和 multinat）和农场/农民（乳制品和农民）分别出现在两个字典键中，但根据字典，它们打算被计算一次。

预期的总体频率计数为（从 kwic 表的“模式”列中提取）并报告为下面的 x2。请注意，出现了“行业”一词，但未分配给“行业”，因为它是在 indig 键中定义的。

Dairy 是出现频率最高的键，出现在三个文档中。这可以根据每个键的 kwic 表“文档名称”列中的唯一行来计算。

我有三个问题：

是否存在任何可能影响使用此方法的输出准确性的问题？
是否有更好/更简约的方法来实现我想要做的事情？
从 kwic 表中提取相当于 tetxstat 频率计数数据的最佳方法是什么？

        library (quanteda)
        library(quanteda.textstats)

        txt <- c(doc1 = "A significant percent of all farms in Australia, are dairy. 
         Although there are a lot of dairy farms in this country, 
         it is not the biggest farm industry. The life of a farmer is not easy, a dairy 
        farmer has to be an early riser. ",
         doc2 = "Australian people like milk so a healthy dairy industry is important in 
         our country",
         doc3 = "Dairy and sheep farms developed at the expense of Indigenous 
         Australians. Further many companies  are now foreign-owned",
         doc4 = "Some farmers are lucky to receive a service from Australia Post. Mail is 
         sent to many foreign countries and received more quickly than 
         delivered in some locations in Australia.")



         x <- x %>%
         tokens_compound(phrase("dairy farmers"), concatenator = " ") %>%
         tokens_compound(phrase("dairy farms"), concatenator = " ") %>%
         tokens_compound(phrase("dairy farm"), concatenator = " ") %>%
         tokens_compound(phrase("dairy farming"), concatenator = " ") %>%
         tokens_compound(phrase("dairy industry"), concatenator = " ") %>%
         tokens_compound(phrase("indigenous australians"), concatenator = " ") %>%
         tokens_compound(phrase("australia post"), concatenator = " ") %>%
         tokens_compound(phrase("dairy farmer"), concatenator = " ")
              x

         dict <- dictionary(list(multinat = c("offshore petroleum companies", "foreign- 
         owned", "foreign owned", "foreign companies", "multinational", "multinational 
         oil companies", "multinationals", "transnational"),
         dairy = c("dairy farmers", "dairy farms","dairy farm","dairy farming","dairy 
         industry", "dairy farmer","dairy", "milk"),
         auspost = "australia post",
         oz = c("australia", "this country", "our country"),
         farmers = c("farmers", "farmer", "farm", "farms"),
         foreign = c("foreign", "foreigner", "foreigners"), 
         business =c("small business", "business", "businesses", "company", "companies"),
         indig = c("aboriginal", "aboriginals", "indigenous australians", "torres 
         strait"),
         peep = c("australians", "people of australia", "australian people", "people of 
         this nation", "people of this country"),
         industry = c("industry", "industries")))

        kwicdict <- kwic(x, pattern = dict, window = 4)
        write.csv (kwicdict, "D:/Output/TEST.csv")

       DF <- read.csv("D://Output/TEST.csv",header=T)

       ## obtaining frequency count of KWIC table 'pattern ' values
       > x2 <- DF[,8]
       > 
       > table (x2)
       x2
       auspost business    dairy  farmers  foreign    indig industry multinat  oz  peep    
          1        1        6        5        1        1        1        1     5    2

Answer 1

我不认为

kwic()

是你想要的。

tokens_lookup()

允许您指定嵌套范围应该在键之间互斥，而不仅仅是在键内。观察下面的差异。（并注意对乳制品密钥使用通配符。）

library(quanteda)
#> Package version: 4.1.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)

txt <- c(doc1 = "A significant percent of all farms in Australia, are dairy. 
         Although there are a lot of dairy farms in this country, 
         it is not the biggest farm industry. The life of a farmer is not easy, a dairy 
        farmer has to be an early riser. ",
         doc2 = "Australian people like milk so a healthy dairy industry is important in 
         our country",
         doc3 = "Dairy and sheep farms developed at the expense of Indigenous 
         Australians. Further many companies  are now foreign-owned",
         doc4 = "Some farmers are lucky to receive a service from Australia Post. Mail is 
         sent to many foreign countries and received more quickly than 
         delivered in some locations in Australia.")

dict <- dictionary(list(multinat = c("offshore petroleum companies", "foreign-owned", 
                                     "foreign owned", "foreign companies", "multinational", 
                                     "multinational oil companies", "multinationals", "transnational"),
                        dairy = c("dairy farm*", "dairy industry", "dairy", "milk"),
                        auspost = "australia post",
                        oz = c("australia", "this country", "our country"),
                        farmers = c("farmers", "farmer", "farm", "farms"),
                        foreign = c("foreign", "foreigner", "foreigners"), 
                        business =c("small business", "business", "businesses", "company", "companies"),
                        indig = c("aboriginal", "aboriginals", "indigenous australians", "torres strait"),
                        peep = c("australians", "people of australia", "australian people", 
                                 "people of this nation", "people of this country"),
                        industry = c("industry", "industries")))

x <- tokens(txt)

# with overlap
tokens_lookup(x, dict) |>
    dfm()
#> Document-feature matrix of: 4 documents, 10 features (55.00% sparse) and 0 docvars.
#>       features
#> docs   multinat dairy auspost oz farmers foreign business indig peep industry
#>   doc1        0     3       0  2       5       0        0     0    0        1
#>   doc2        0     2       0  1       0       0        0     0    1        1
#>   doc3        1     1       0  0       1       0        1     1    1        0
#>   doc4        0     0       1  2       1       1        0     0    0        0

# without overlap
tokens_lookup(x, dict, nested_scope = "dictionary") |>
    dfm()
#> Document-feature matrix of: 4 documents, 10 features (60.00% sparse) and 0 docvars.
#>       features
#> docs   multinat dairy auspost oz farmers foreign business indig peep industry
#>   doc1        0     3       0  2       3       0        0     0    0        1
#>   doc2        0     2       0  1       0       0        0     0    1        0
#>   doc3        1     1       0  0       1       0        1     1    0        0
#>   doc4        0     0       1  1       1       1        0     0    0        0

^{创建于 2024-10-06，使用 reprex v2.1.1}

避免 Quanteda 中频率和文档频率计数重叠

问题描述投票：0回答：1

1个回答

最新问题

避免 Quanteda 中频率和文档频率计数重叠

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1