下面是包含 4 个文档的虚拟语料库。
该词典的开发是为了识别语料库中单词或短语的频率,以及单词或短语出现的文档数量。
“澳大利亚人”这个世界出现在两个字典键中(peep、indig)。关键内容旨在相互排斥。
类似地,“澳大利亚”(oz 和澳大利亚邮政)、foreign(外国和 multinat)和农场/农民(乳制品和农民)分别出现在两个字典键中, 但根据字典,它们打算被计算一次。
预期的总体频率计数为(从 kwic 表的“模式”列中提取)并报告为下面的 x2。请注意,出现了“行业”一词,但未分配给“行业”,因为它是在 indig 键中定义的。
Dairy 是出现频率最高的键,出现在三个文档中。这可以根据每个键的 kwic 表“文档名称”列中的唯一行来计算。
我有三个问题:
library (quanteda)
library(quanteda.textstats)
txt <- c(doc1 = "A significant percent of all farms in Australia, are dairy.
Although there are a lot of dairy farms in this country,
it is not the biggest farm industry. The life of a farmer is not easy, a dairy
farmer has to be an early riser. ",
doc2 = "Australian people like milk so a healthy dairy industry is important in
our country",
doc3 = "Dairy and sheep farms developed at the expense of Indigenous
Australians. Further many companies are now foreign-owned",
doc4 = "Some farmers are lucky to receive a service from Australia Post. Mail is
sent to many foreign countries and received more quickly than
delivered in some locations in Australia.")
x <- x %>%
tokens_compound(phrase("dairy farmers"), concatenator = " ") %>%
tokens_compound(phrase("dairy farms"), concatenator = " ") %>%
tokens_compound(phrase("dairy farm"), concatenator = " ") %>%
tokens_compound(phrase("dairy farming"), concatenator = " ") %>%
tokens_compound(phrase("dairy industry"), concatenator = " ") %>%
tokens_compound(phrase("indigenous australians"), concatenator = " ") %>%
tokens_compound(phrase("australia post"), concatenator = " ") %>%
tokens_compound(phrase("dairy farmer"), concatenator = " ")
x
dict <- dictionary(list(multinat = c("offshore petroleum companies", "foreign-
owned", "foreign owned", "foreign companies", "multinational", "multinational
oil companies", "multinationals", "transnational"),
dairy = c("dairy farmers", "dairy farms","dairy farm","dairy farming","dairy
industry", "dairy farmer","dairy", "milk"),
auspost = "australia post",
oz = c("australia", "this country", "our country"),
farmers = c("farmers", "farmer", "farm", "farms"),
foreign = c("foreign", "foreigner", "foreigners"),
business =c("small business", "business", "businesses", "company", "companies"),
indig = c("aboriginal", "aboriginals", "indigenous australians", "torres
strait"),
peep = c("australians", "people of australia", "australian people", "people of
this nation", "people of this country"),
industry = c("industry", "industries")))
kwicdict <- kwic(x, pattern = dict, window = 4)
write.csv (kwicdict, "D:/Output/TEST.csv")
DF <- read.csv("D://Output/TEST.csv",header=T)
## obtaining frequency count of KWIC table 'pattern ' values
> x2 <- DF[,8]
>
> table (x2)
x2
auspost business dairy farmers foreign indig industry multinat oz peep
1 1 6 5 1 1 1 1 5 2
我不认为
kwic()
是你想要的。 tokens_lookup()
允许您指定嵌套范围应该在键之间互斥,而不仅仅是在键内。观察下面的差异。 (并注意对乳制品密钥使用通配符。)
library(quanteda)
#> Package version: 4.1.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
txt <- c(doc1 = "A significant percent of all farms in Australia, are dairy.
Although there are a lot of dairy farms in this country,
it is not the biggest farm industry. The life of a farmer is not easy, a dairy
farmer has to be an early riser. ",
doc2 = "Australian people like milk so a healthy dairy industry is important in
our country",
doc3 = "Dairy and sheep farms developed at the expense of Indigenous
Australians. Further many companies are now foreign-owned",
doc4 = "Some farmers are lucky to receive a service from Australia Post. Mail is
sent to many foreign countries and received more quickly than
delivered in some locations in Australia.")
dict <- dictionary(list(multinat = c("offshore petroleum companies", "foreign-owned",
"foreign owned", "foreign companies", "multinational",
"multinational oil companies", "multinationals", "transnational"),
dairy = c("dairy farm*", "dairy industry", "dairy", "milk"),
auspost = "australia post",
oz = c("australia", "this country", "our country"),
farmers = c("farmers", "farmer", "farm", "farms"),
foreign = c("foreign", "foreigner", "foreigners"),
business =c("small business", "business", "businesses", "company", "companies"),
indig = c("aboriginal", "aboriginals", "indigenous australians", "torres strait"),
peep = c("australians", "people of australia", "australian people",
"people of this nation", "people of this country"),
industry = c("industry", "industries")))
x <- tokens(txt)
# with overlap
tokens_lookup(x, dict) |>
dfm()
#> Document-feature matrix of: 4 documents, 10 features (55.00% sparse) and 0 docvars.
#> features
#> docs multinat dairy auspost oz farmers foreign business indig peep industry
#> doc1 0 3 0 2 5 0 0 0 0 1
#> doc2 0 2 0 1 0 0 0 0 1 1
#> doc3 1 1 0 0 1 0 1 1 1 0
#> doc4 0 0 1 2 1 1 0 0 0 0
# without overlap
tokens_lookup(x, dict, nested_scope = "dictionary") |>
dfm()
#> Document-feature matrix of: 4 documents, 10 features (60.00% sparse) and 0 docvars.
#> features
#> docs multinat dairy auspost oz farmers foreign business indig peep industry
#> doc1 0 3 0 2 3 0 0 0 0 1
#> doc2 0 2 0 1 0 0 0 0 1 0
#> doc3 1 1 0 0 1 0 1 1 0 0
#> doc4 0 0 1 1 1 1 0 0 0 0
创建于 2024-10-06,使用 reprex v2.1.1