强制 PyArrow 表写入忽略 NULL 类型并使用列的原始架构类型

问题描述 投票:0回答:1

我有这段代码,它将相同数据的两部分附加到 PyArrow 表中。第二次写入失败,因为该列被分配了

null
类型。我明白它为什么这样做。有没有一种方法可以强制它使用表模式中的类型,而不是使用第二次写入时从数据中推断出的类型?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

data = {
    'col1': ['A', 'A', 'A', 'B', 'B'],
    'col2': [0, 1, 2, 1, 2]
}

df1 = pd.DataFrame(data)
df1['col3'] = 1

df2 = df1.copy()
df2['col3'] = pd.NA

pat1 = pa.Table.from_pandas(df1)
pat2 = pa.Table.from_pandas(df2)

writer = pq.ParquetWriter('junk.parquet', pat1.schema)
writer.write_table(pat1)
writer.write_table(pat2)

我在上面第二次写的错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1094, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file: 
table:
col1: string
col2: int64
col3: null
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 578 vs. 
file:
col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577

python-3.x pandas pyarrow
1个回答
0
投票

问题是

pd.NA
的赋值导致了不正确的数据类型(
object
):

df2 = df1.copy()
df2['col3'] = pd.NA

print(df2.dtypes)

col1    object
col2     int64
col3    object
dtype: object

只需先将其更改为

Int64
,使用
Series.astype
:

df2['col3'] = pd.NA
df2['col3'] = df2['col3'].astype('Int64')

或者在一个声明中,使用

pd.Series
:

df2['col3'] = pd.Series(pd.NA, dtype='Int64')

两者都会导致:

pat2.schema

col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577
© www.soinside.com 2019 - 2024. All rights reserved.