我想使用 Spacy 生成存储在极坐标 DataFrame 中的文本嵌入,并将结果存储在同一个 DataFrame 中。接下来,我想将此 DataFrame 保存到磁盘,并能够作为 Polars DataFrame 再次加载。从 pandas 到 Polars 的反向转换会导致错误。
这是错误消息:
ArrowInvalid: Could not convert Hello with type spacy.tokens.doc.Doc: did not recognize Python value type when inferring an Arrow data type
这是我的代码:
from io import StringIO
import polars as pl
import pandas as pd
import spacy
nlp = spacy.load("de_core_news_sm")
json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'
#Initalize and store DataFrame
df = pl.read_json(StringIO(json_str))
df = df.with_columns(pl.col("foo").map_elements(lambda x: nlp(x)).alias("encoding"))
df.to_pandas().to_pickle('pickled_df.pkl')
#Load DataFrame
df_loaded_pd = pd.read_pickle('pickled_df.pkl')
df_loaded_pl = pl.from_pandas(df_loaded_pd)
这些是我使用的软件包版本:
# Name Version Build Channel
pandas 2.2.3 py312hf9745cd_1 conda-forge
polars 1.9.0 py312hfe7c9be_0 conda-forge
spacy 3.7.2 py312h6db74b5_0
spacy-curated-transformers 0.2.2 pypi_0 pypi
spacy-legacy 3.0.12 pyhd8ed1ab_0 conda-forge
spacy-loggers 1.0.5 pyhd8ed1ab_0 conda-forge
谢谢您的帮助!
Polars DataFrame 中的 SpaCy 对象可以使用 SpaCys 本机 DocBin 类 进行存储。以下代码生成 doc 对象,并将其保存在本地,然后成功加载它们。
from io import StringIO
from spacy.tokens import DocBin
import polars as pl
import spacy
nlp = spacy.load("de_core_news_md")
json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'
doc = nlp("some text")
#Serialize polars DataFrame
df = pl.read_json(StringIO(json_str))
df = df.with_columns(pl.col("foo").map_elements(lambda x: DocBin(docs=[nlp(x)]).to_bytes()).alias('binary_embbeding'))
df.write_parquet('saved.pq')
#Deserialize polars DataFrame
df_loaded = pl.read_parquet('saved.pq')
df_loaded = df_loaded.with_columns(pl.col('binary_embbeding').map_elements(lambda x: list(DocBin().from_bytes(x).get_docs(nlp.vocab))[0]).alias("spacy_embedding"))
#Calculate similarity
df_loaded.with_columns(pl.col("spacy_embedding").map_elements(lambda x: doc.similarity(x), return_dtype=pl.Float64).alias('Score'))
使用本机极坐标函数(例如
df.write_parquet()
)序列化和反序列化 SpaCys 对象在很大程度上取决于所使用的模型。在上述情况下,代码仅在使用包含词向量的SpaCys语言模型时才有效。
nlp = spacy.load("de_core_news_sm") # Line 20 does not works
nlp = spacy.load("de_core_news_md") # Line 20 works
nlp = spacy.load("de_core_news_lg") # Line 20 works
nlp = spacy.load("de_dep_news_trf") # Line 20 does not works