我有这段代码,它将相同数据的两部分附加到 PyArrow 表中。第二次写入失败,因为该列被分配了
null
类型。我明白它为什么这样做。有没有一种方法可以强制它使用表模式中的类型,而不是使用第二次写入时从数据中推断出的类型?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
data = {
'col1': ['A', 'A', 'A', 'B', 'B'],
'col2': [0, 1, 2, 1, 2]
}
df1 = pd.DataFrame(data)
df1['col3'] = 1
df2 = df1.copy()
df2['col3'] = pd.NA
pat1 = pa.Table.from_pandas(df1)
pat2 = pa.Table.from_pandas(df2)
writer = pq.ParquetWriter('junk.parquet', pat1.schema)
writer.write_table(pat1)
writer.write_table(pat2)
我在上面第二次写的错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1094, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
col1: string
col2: int64
col3: null
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 578 vs.
file:
col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577
问题是
pd.NA
的赋值导致了不正确的数据类型(object
):
df2 = df1.copy()
df2['col3'] = pd.NA
print(df2.dtypes)
col1 object
col2 int64
col3 object
dtype: object
只需先将其更改为
Int64
,使用 Series.astype
:
df2['col3'] = pd.NA
df2['col3'] = df2['col3'].astype('Int64')
pd.Series
:
df2['col3'] = pd.Series(pd.NA, dtype='Int64')
两者都会导致:
pat2.schema
col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577