使用 Polars 读取 Delta 数据框时出现警告,但一切正常

问题描述 投票:0回答:1

我在 MinIO 服务器中以 Delta 格式存储了一些数据,如下所示:

minio server --address :9010 --console-address :19010 ./data

然后,当我尝试用 Polars 阅读它时,我观察到约 1 秒的延迟,然后出现警告,然后一切正常。

警告:

2024-04-08T07:17:19Z WARN  aws_config::imds::region] failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) }))
[2024-04-08T07:17:19Z WARN  aws_config::imds::region] failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: io error: error trying to connect: tcp connect error: Host is down (os error 64): tcp connect error: Host is down (os error 64): Host is down (os error 64) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 64, kind: Uncategorized, message: "Host is down" })), connection: Unknown } }) }))
[2024-04-08T07:17:19Z WARN  aws_config::imds::region] failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: io error: error trying to connect: tcp connect error: Host is down (os error 64): tcp connect error: Host is down (os error 64): Host is down (os error 64) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 64, kind: Uncategorized, message: "Host is down" })), connection: Unknown } }) }))

代码:

import polars as pl

minio_storage_options = {
    "AWS_ENDPOINT_URL": "http://localhost:9010",
    "AWS_ACCESS_KEY_ID": "minioadmin",
    "AWS_SECRET_ACCESS_KEY": "minioadmin",
    "AWS_REGION": "<localhost>",  # Unused??
    "AWS_ALLOW_HTTP": "true",  # Required
}

df = pl.read_delta("s3://reddit-submissions/submissions-raw", storage_options=minio_storage_options)
print(df.head())

我在这里做错了什么?

❯ uv pip freeze | grep "delta\|polars"
deltalake==0.16.4
polars==0.20.18
❯ python
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> ^D
python python-polars
1个回答
0
投票

现在您有两种方法可以“解决”此问题:这两种方法都可以消除警告并基本上提高性能,因为不会尝试身份验证方法。

  1. 设置环境变量(始终有效)

代码:

os.environ["AWS_EC2_METADATA_DISABLED"] = "true"
  1. storage_options
    中设置参数(第一次执行似乎不起作用,导致延迟约3秒,但随后由于某种原因起作用)

代码:

df.write_delta(
  target=s3_path,
  overwrite_schema=True,
  mode="overwrite",
  storage_options = {
      "AWS_S3_ALLOW_UNSAFE_RENAME": "true",  # Required if we don't use a LockClient
      "AWS_REGION": "x",
      "AWS_ACCESS_KEY_ID": "x",
      "AWS_SECRET_ACCESS_KEY": "x",
      "AWS_SESSION_TOKEN": "x"
  }
)
© www.soinside.com 2019 - 2024. All rights reserved.