如何在 Polars DataFrame 中保存和加载 spacy 编码

问题描述 投票:0回答:1

我想使用 Spacy 生成存储在极坐标 DataFrame 中的文本嵌入,并将结果存储在同一个 DataFrame 中。接下来,我想将此 DataFrame 保存到磁盘,并能够作为 Polars DataFrame 再次加载。从 pandas 到 Polars 的反向转换会导致错误。

这是错误消息:

ArrowInvalid: Could not convert Hello with type spacy.tokens.doc.Doc: did not recognize Python value type when inferring an Arrow data type

这是我的代码:

from io import StringIO
import polars as pl
import pandas as pd
import spacy


nlp = spacy.load("de_core_news_sm")
json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'


#Initalize and store DataFrame
df = pl.read_json(StringIO(json_str))
df = df.with_columns(pl.col("foo").map_elements(lambda x: nlp(x)).alias("encoding"))
df.to_pandas().to_pickle('pickled_df.pkl')

#Load DataFrame
df_loaded_pd = pd.read_pickle('pickled_df.pkl')
df_loaded_pl = pl.from_pandas(df_loaded_pd)

这些是我使用的软件包版本:

# Name                    Version                   Build  Channel
pandas                    2.2.3           py312hf9745cd_1    conda-forge
polars                    1.9.0           py312hfe7c9be_0    conda-forge
spacy                     3.7.2           py312h6db74b5_0  
spacy-curated-transformers 0.2.2                    pypi_0    pypi
spacy-legacy              3.0.12             pyhd8ed1ab_0    conda-forge
spacy-loggers             1.0.5              pyhd8ed1ab_0    conda-forge

谢谢您的帮助!

python dataframe spacy python-polars
1个回答
1
投票

序列化和反序列化

Polars DataFrame 中的 SpaCy 对象可以使用 SpaCys 本机 DocBin 类 进行存储。以下代码生成 doc 对象,并将其保存在本地,然后成功加载它们。

from io import StringIO
from spacy.tokens import DocBin
import polars as pl
import spacy

nlp = spacy.load("de_core_news_md")
json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'
doc = nlp("some text")

#Serialize polars DataFrame
df = pl.read_json(StringIO(json_str))
df = df.with_columns(pl.col("foo").map_elements(lambda x: DocBin(docs=[nlp(x)]).to_bytes()).alias('binary_embbeding'))
df.write_parquet('saved.pq')

#Deserialize polars DataFrame
df_loaded = pl.read_parquet('saved.pq')
df_loaded = df_loaded.with_columns(pl.col('binary_embbeding').map_elements(lambda x: list(DocBin().from_bytes(x).get_docs(nlp.vocab))[0]).alias("spacy_embedding"))

#Calculate similarity
df_loaded.with_columns(pl.col("spacy_embedding").map_elements(lambda x: doc.similarity(x), return_dtype=pl.Float64).alias('Score'))

将函数应用于反序列化的 SpaCy 对象

使用本机极坐标函数(例如

df.write_parquet()
)序列化和反序列化 SpaCys 对象在很大程度上取决于所使用的模型。在上述情况下,代码仅在使用包含词向量的SpaCys语言模型时才有效。

nlp = spacy.load("de_core_news_sm") # Line 20 does not works
nlp = spacy.load("de_core_news_md") # Line 20 works
nlp = spacy.load("de_core_news_lg") # Line 20 works
nlp = spacy.load("de_dep_news_trf") # Line 20 does not works

© www.soinside.com 2019 - 2024. All rights reserved.