无法让极坐标从 S3 404 未找到读取配置单元布局镶木地板文件

Question

我正在努力通过极坐标从 S3 读取数据，但一直得不到帮助

Client error with status 404 Not Found

数据在 S3 中布局，我认为是 Hive 分区（尽管这是我们第一次使用它，所以我们可能错过了一些东西）。请参阅最后的注释。

凭证来自 boto3。我确信它们在 boto3 中是正确的，因为我可以使用 boto3 对同一数据执行其他操作：

import boto3.session
import polars

session = boto3.session.Session()
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
    "aws_access_key_id": credentials.access_key,
    "aws_secret_access_key": credentials.secret_key,
    "region": _session.region_name,
    "session_token": credentials.token,
}

url = "s3://my-example-bucket/staging/extract/contracts/*.parquet"

frame = polars.scan_parquet(url, storage_options=storage_options)

这些都不起作用：

result = frame.filter(polars.col("record_date") == date(year=2024, month=1, day=1)).collect()

result = frame.collect()

错误是：

polars.exceptions.ComputeError: 'parquet scan' failed
The reason: Object at location staging/extract/contracts not found: Client error with status 404 Not Found: No Body:

存储桶中的示例键是：

staging/extract/contracts/record_date=2024-01-01/contracts_0_0_2024-02-15T16:21:51.975005+00:00.parquet

备注：

这是我们第一次使用 Hive 分区，因此不能完全排除那里的问题。据我们所知，Polars 唯一需要的就是镶木地板文件存在于其中。 IE：不存在其他元文件。如果问题出在我们如何布置数据上，则可以选择添加其他元文件或更改键。

Answer 1

正如 jqurious 所指出的，答案是 glob 模式需要正确的星数来匹配分区划分的数量 (n+1)。

因此，对于单个分区（

record_date=...

），必须有两颗星：

s3://foo/bar/*/*.parquet

而不是

s3://foo/bar/*.parquet

无法让极坐标从 S3 404 未找到读取配置单元布局镶木地板文件

问题描述投票：0回答：1

1个回答

最新问题

无法让极坐标从 S3 404 未找到读取配置单元布局镶木地板文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1