AWS Wrangler - 有限内存环境中的 Pandas red_sql 到 S3

Question

我正在寻找一种方法，在内存有限的环境中从数据库中提取数据并将该数据推送到 S3 中的镶木地板数据集中。如果我这样进行：

with someDB.connect() as connect:
    df = pd.read_sql("SELECT * FROM table", connect)
    wr.s3.to_parquet(df, dataset=True, path="s3://flo-bucket/")

Pandas 数据框 (df) 完全加载到内存中，然后由 wrangler 推送到 S3。所以如果数据框太大，操作就会失败。我想对数据帧进行分块并将这些块传递给一个进程（不必是牧马人），该进程将以镶木地板格式逐步将它们发送到 S3。这可能吗？我找到了将 IO 缓冲区用于 CSV 文件的示例，但我认为镶木地板不可能实现。

Answer 1

可以读取小块的数据，因为它在

read_sql

函数中可用。

尝试类似的东西：

import pandas as pd
import awswrangler as wr

path='s3://bucket/prefix'

def write_parquet_chunk(chunk, path, index=False):
    wr.s3.to_parquet(
        df=chunk,
        path=path,
        dataset=True,
        mode="append",
        index=index
    )

chunksize = 100

with someDB.connect() as connect:
    query = "SELECT * FROM table"
    chunks = pd.read_sql(query, connect, chunksize=chunksize)

    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}")
        write_parquet_chunk(chunk, path)

参考： https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html

AWS Wrangler - 有限内存环境中的 Pandas red_sql 到 S3

问题描述投票：0回答：1

1个回答

最新问题

AWS Wrangler - 有限内存环境中的 Pandas red_sql 到 S3

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1