我将这本医学词典下载到R中:
url <- "https://archive.org/stream/azfamilymedicalencyclopedia/A-Z%20Family%20Medical%20Encyclopedia_djvu.txt"
destfile <- "A-Z_Family_Medical_Encyclopedia.txt" # Save it locally with this name
download.file(url, destfile)
file_content <- readLines(destfile, encoding = "UTF-8")
是否可以只保留与医疗相关的术语并删除其他所有内容?
我知道如何删除停用词,例如
library(tm)
all_text <- paste(file_content, collapse = " ")
words <- unlist(strsplit(all_text, "\\W+"))
filtered_words <- words[!tolower(words) %in% stopwords("en")]
filtered_text <- paste(filtered_words, collapse = " ")
但是有什么东西只能保留与医疗/健康相关的东西吗? IE。技术/科学词汇?
您可以在此处获取医学术语列表链接。然后将它们与
%in%
进行比较,就像您之前所做的那样。