理解为什么 grepl 似乎无法正确识别单词

Question

我正在尝试计算文档中某个单词的出现次数（作为我正在对政客如何使用语言进行的一些研究的一部分）。我不明白为什么我在 R 中得到的值与我独立计算单词数时得到的值不同。

#Counting the occurrences of the word 'migrant' in a political debate
fileContent <- readLines("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
wordToCount <- c("Migrant") 
wordCount <- sum(grepl(wordToCount, fileContent, ignore.case = TRUE))
wordCount #returns 20

这会返回数字 20，但是如果我打开文档并按 ctrl + f 键选择“Migrant”，我会得到 22 次点击（我知道上面的代码应该识别字符串以及整个单词中的场景）。

我也尝试过解析 xml，但更令人困惑的是，这仅返回 18，尽管事实上，如果我再次手动检查解析的数据，仍然有 22 个命中：

#Same as above but parsing the xml
fileContent <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
fileContent <- xml_find_all(fileContent, ".//speech")
fileContent <- xml_text(fileContent)
wordToCount <- c("Migrant") 
wordCount <- sum(grepl(wordToCount, fileContent, ignore.case = TRUE))
wordCount #returns 18

#Outputting the data to double-check values
output <- file("output.txt")
writeLines(fileContent, output)
close(output)

任何人都可以帮我理解为什么这两段代码不返回 22 吗？

Answer 1

如果发现

至少

出现一次

grepl

，

TRUE将返回

migrant

。如果一个字符串包含两次，则只计算一次。看这个例子：

sum(grepl("migrant", 
      c("Something about migrants. Something else about migrants ")))

您可以使用

stringr

包来做您想做的事：

fileContent <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
fileContent <- xml_find_all(fileContent, ".//speech")
fileContent <- xml_text(fileContent)
migrant_count <- stringr::str_count(tolower(fileContent), "migrant")
total_migrant_count <- sum(migrant_count)
print(total_migrant_count) # -> 22

Answer 2

如果该单词在一行中出现两次，

grepl()

将仅返回一个

TRUE

：

grepl("migrant", "migrant workers and other migrants")
#> [1] TRUE

因此，如果您想计算单词数，您可以确保每个单词都在不同的行上：

grepl("migrant", unlist(strsplit("migrant workers and other migrants", " ")), ignore.case = TRUE)
#> [1]  TRUE FALSE FALSE FALSE  TRUE

这会为您的示例文档提供 22 次点击。

Answer 3

gregexpr

是用于计算多个匹配项的基本 R 方法：对于每个字符串，它返回一个列表，其中包含每个字符串的匹配位置向量，如果没有匹配项，则返回 -1：

gg <- gregexpr(wordToCount, fileContent, ignore.case=TRUE)
sum(unlist(gg)>-1)  ## 22

或者将整个文档折叠成单个字符串并计算匹配的数量：

gg <- gregexpr(wordToCount, paste(fileContent, collapse = " "), ignore.case = TRUE)
length(gg[[1]]) ## 22

理解为什么 grepl 似乎无法正确识别单词

问题描述投票：0回答：3

3个回答

最新问题

理解为什么 grepl 似乎无法正确识别单词

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3