我正在尝试识别和聚合给定数据集的同义词。请参阅下面的示例数据。
library(tm)
library(SnowballC)
dataset <- c("dad glad accept large admit large accept dad big large big accept big accept dad dad Happy dad accept glad papa dad Happy dad glad dad dad papa admit Happy big accept accept big accept dad Happy admit Happy Happy glad Happy dad accept accept large daddy large accept large large large big daddy accept admit dad admit daddy dad admit dad admit Happy accept accept Happy daddy accept admit")
docs <- Corpus(VectorSource(dataset))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
结果:
accept dad happy admit large big daddy glad papa
15 14 9 8 8 6 4 4 2
我想使用我下载并安装的 wordnet 包查找上述每个单词的同义词。例如,要获得“接受”的同义词,我可以这样做:
library(wordnet)
setDict("C:/Program Files (x86)/WordNet/2.1/dict")
filter <- getTermFilter("ExactMatchFilter", "accept", TRUE)
terms <- getIndexTerms("VERB", 1, filter)
getSynonyms(terms[[1]])
结果:
[1] "accept" "admit" "assume" "bear" "consent" "go for" "have" "live with"
[9] "swallow" "take" "take on" "take over"
现在,我想组合这两个结果集,以便它按以下方式对同义词进行分组。 标记给定组中最常见的单词(排名 1),并稍后按这些单词进行分组,类似于:
id word word_count syn_group rank
1 accept 15 1 1
5 admit 8 1 2
2 dad 14 2 1
8 daddy 4 2 2
9 papa 2 2 3
3 happy 9 3 1
7 glad 4 3 2
4 large 8 4 1
6 big 6 4 2
然后可以像这样聚合
id word word_count
1 accept 15+8
2 dad 14+4+2
3 happy 9+4
4 large 8+6
最终结果就是这样
id word word_count
1 accept 23
2 dad 20
3 large 14
4 happy 13
我遇到了几个问题,包括让 GetIndexTerms 循环遍历单词,无论它们是名词、动词等。希望这一切都有意义吗?任何帮助将非常感激。谢谢你。
我们可以使用
dplyr
执行以下操作
library(dplyr)
df %>%
group_by(syn_group) %>%
mutate(sum_word_count = sum(word_count)) %>%
filter(rank == 1)
数据:
df <- read.table(text = "id word word_count syn_group rank
1 accept 15 1 1
5 admit 8 1 2
2 dad 14 2 1
8 daddy 4 2 2
9 papa 2 2 3
3 happy 9 3 1
7 glad 4 3 2
4 large 8 4 1
6 big 6 4 2", header = T)
请下次发布
dput
的输出。
编辑:这里有一些代码可以帮助您开始循环单词和词性,并存储同义词。剩下的就是确定当前术语是否是前一个术语的同义词,在这种情况下,您已经拥有同义词,并且可以分配唯一的同义词组。接下来,您需要存储一些结果。最后,您需要计算排名,即
seq_along
同义词和 grep
来确定排名位置。这些注释提示您可能希望在何处包含这些提示的代码。
d <- data.frame(Term = row.names(m), word_count = m[,1])
all_pos <- c("ADJECTIVE", "ADVERB", "NOUN","VERB")
syns <- vector("list", length(all_pos))
for(w in seq(nrow(d))){
# if sysns of (d$Term[w]) has been calculated skip over current w
emf <- getTermFilter("ExactMatchFilter", d$Term[w], TRUE)
for(i in seq_along(syns)){
terms <- getIndexTerms(all_pos[i], 1, emf)
if(is.null(terms)){
syns[i] <- NA
} else{
syns[[i]] <- getSynonyms(terms[[1]])
}
}
# store the results of syns for current w
}
我已经能够从网站自动提取法语同义词,如下:
library(stringr)
library(pagedown)
library(pdftools)
path_Save_PDF <- "D:\\"
base_Url <- "https://dictionary.reverso.net/french-synonyms/"
words <- c("fâché")
nb_Words <- length(words)
list_Text <- list()
for(i in 1 : nb_Words)
{
print(i)
pdf_File <- paste0(path_Save_PDF, words[i], ".pdf")
chrome_print(input = paste0(base_Url, words[i]), output = pdf_File)
list_Text[[i]] <- pdftools::pdf_text(pdf_File)
list_Text[[i]] <- strsplit(x = list_Text[[i]], split = "\n")
}
save(list_Text, file = "list_Text.RData")
list_Synonymes <- list()
for(i in 1 : nb_Words)
{
print(i)
id_Lines_Synonymes <- which(str_detect(string = list_Text[[i]][[1]], pattern = "[:space:]{4,8}\\d{1,2}"))
text_Synonymes <- list_Text[[i]][[1]][id_Lines_Synonymes]
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "Facebook®(.*)Visit Site")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "and post updates\\.")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "par extension au sens figuré")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "details")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "\\d")
text_Synonymes <- stringr::str_squish(text_Synonymes)
text_Synonymes <- paste0(text_Synonymes, collapse = ",")
text_Synonymes <- stringr::str_replace_all(string = text_Synonymes, pattern = "\\,\\,", replacement = "\\,")
text_Synonymes <- base::strsplit(text_Synonymes, ",")[[1]]
list_Synonymes[[i]] <- text_Synonymes
}
names(list_Synonymes) <- words
list_Synonymes
list_Synonymes
$fâché
[1] "dépité"
[2] " grognon"
[3] " mécontent"
[4] " morfondu"
[5] " transi"
[6] " horripilé"
[7] " irrité"
[8] " contrarié"
[9] " ennuyé"
[10] " frissonnant"
[11] "navré"
[12] " désolé"
[13] "vexé"
[14] " indisposé"
[15] " piqué"
[16] "en colère"
[17] " mécontent"
[18] "désolé"
[19] " navré"
[20] "brouillé avec quelqu'un"
[21] " en froid"
[22] " être incompétent dans un domaine particulier"
[23] " ne rien comprendre"
之后,它可以用于将同义词分组在一起。