如何在Python中使用pyarrow读取带条件的镶木地板文件

Question

我已经从数据库创建了一个包含三列（id、作者、标题）的 parquet 文件，并且想要使用条件（title='Learn Python'）读取 parquet 文件。下面提到的是我用于此 POC 的 python 代码。

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyodbc


def write_to_parquet(df, out_path, compression='SNAPPY'):
    arrow_table = pa.Table.from_pandas(df)
    if compression == 'UNCOMPRESSED':
        compression = None
    pq.write_table(arrow_table, out_path, use_dictionary=False,
                   compression=compression)


def read_pyarrow(path, nthreads=1):
    return pq.read_table(path, nthreads=nthreads).to_pandas()


path = './test.parquet'
sql = "SELECT * FROM [dbo].[Book] (NOLOCK)"

conn = pyodbc.connect(r'Driver={SQL Server};Server =.;Database = APP_BBG_RECN;Trusted_Connection = yes;')

df = pd.io.sql.read_sql(sql, conn)

write_to_parquet(df, path)

df1 = read_pyarrow(path)

如何在 read_pyarrow 方法中添加条件（title='学习 Python'）？

Answer 1

过滤器现在可用 read_table

table = pq.read_table(
        df, filters=[("title", "in", {'Learn Python'}), 
                     ("year", ">=", 1950)]
    )

Answer 2

尚不支持此功能。我们打算在未来开发此功能。我建议从 Arrow 表转换后使用 pandas 进行过滤。

如何在Python中使用pyarrow读取带条件的镶木地板文件

问题描述投票：0回答：2

2个回答

最新问题

如何在Python中使用pyarrow读取带条件的镶木地板文件

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2