unnest_tokens.data.frame(。,实体,文本,令牌= tokenize_scispacy_entities,中的错误:标记化功能的预期输出为长度为100的列表
unnest_tokens()对于少量观察值的样本效果很好,但在整个数据集上均失败。
https://github.com/dgrtwo/cord19可重现的示例:
library(dplyr)
library(cord19)
library(tidyverse)
library(tidytext)
library(spacyr)
spacy_initialize("en_core_sci_sm")
tokenize_scispacy_entities <- function(text) {
spacy_extract_entity(text) %>%
group_by(doc_id) %>%
nest() %>%
pull(data) %>%
map("text") %>%
map(str_to_lower)
}
paragraph_entities <- cord19_paragraphs %>%
select(paper_id, text) %>%
sample_n(10) %>%
unnest_tokens(entity, text, token = tokenize_scispacy_entities)
我面临同样的问题。我不知道为什么,但是在删除abstract ==“”之后,一切似乎都正常。
abstract_entities <- article_data %>%
filter(abstract != "") %>%
select(paper_id, abstract) %>%
head(100) %>%
unnest_tokens(entity, abstract, token = tokenize_scispacy_entities)