使用S3作为源创建冰山表

Question

我目前正在开发一个小型设置，其中有 Iceberg、trino、hive Metastore 和 s3 的基本设置。

我可以使用 trino CLI 在 s3 中创建 Iceberg 表。

现在，我想使用存储在 s3 中的示例镶木地板文件，并使用其数据创建冰山表。我无法弄清楚需要传递哪种配置或 trino 命令才能使用 s3 中的 parquet 文件作为我的源。

对此的任何帮助都会很棒。蒂亚！

my-trino.yaml

image:
  tag: "463"
  pullPolicy: IfNotPresent
server:
  workers: 2
  config:
    properties:
      # S3 configurations
      "fs.s3.aws.credentials.provider": "org.apache.hadoop.fs.s3a.WebIdentityTokenCredentialsProvider"
      "fs.s3.endpoint": "s3.us-west-2.amazonaws.com"
      "fs.s3.region": "us-west-1"
      "fs.s3.use-instance-credentials": "false"
      "fs.s3.use-web-identity-token-credentials-provider": "true"
      "fs.s3.path-style-access": "true"
      # Iceberg configurations
      "iceberg.max-splits-per-scan": "1"
      # Discovery configurations
      "discovery.uri": "http://trino-coordinator:8080"
serviceAccount:
  create: false
  name: trino-service-account
coordinator:
  service:
    type: LoadBalancer
    port: 8080
    name: trino-coordinator
  jvm:
    maxHeapSize: "3G"
  resources:
    limits:
      memory: "4Gi"
      cpu: "2"
    requests:
      memory: "2Gi"
      cpu: "1"
worker:
  service:
    type: ClusterIP
    name: trino-worker
  jvm:
    maxHeapSize: "3G"
    -XX:+ExitOnOutOfMemoryError: ""
    -XX:+HeapDumpOnOutOfMemoryError: ""
    -XX:HeapDumpPath: "/tmp/dump.hprof"
  config:
    discovery.uri: "http://trino-coordinator:8080"
  resources:
    limits:
      memory: "4Gi"
      cpu: "2"
    requests:
      memory: "2Gi"
      cpu: "1"
catalogs:
  iceberg: |-
    connector.name=iceberg
    hive.metastore.uri=thrift://hivems-hive-metastore.arvind.svc.cluster.local:9083
    iceberg.catalog.type=hive_metastore
    s3.aws-access-key=******
    s3.aws-secret-key=******
    s3.path-style-access=true
    fs.native-s3.enabled=true
    iceberg.unique-table-location=true
  hive: |-
    connector.name=hive
    hive.metastore.uri=thrift://hivems-hive-metastore.arvind.svc.cluster.local:9083
    hive.non-managed-table-writes-enabled=true

我的值.yaml

# The base hadoop image to use for all components.
# See this repo for image build details: https://github.com/Comcast/kube-yarn/tree/master/image
postgresql:
  postgresqlUsername: hive
  postgresqlPassword: hive
  postgresqlDatabase: metastore

  initdbScriptsConfigMap: hive-metastore-postgresql-init
image:
  repository: jboothomas/hive-metastore-s3
  tag: v6
  pullPolicy: IfNotPresent

resources: {}
conf:
  hiveSite:
    hive_metastore_uris: thrift://hivems-hive-metastore:9083
    fs.s3a.access.key: *******
    fs.s3a.secret.key: *******
    hive.metastore.warehouse.dir: s3a://my-iceberg-trino-bucket/iceberg-warehouse
    fs.s3a.connection.ssl.enabled: false

Answer 1

看起来您已经提供了一个用于设置 Trino 的配置文件，其中包含 S3、Iceberg 和其他相关组件的特定设置。以下是基于提供的配置的一些观察和建议：

S3配置：

fs.s3.region

设置为

us-west-1

，但

fs.s3.endpoint

为

s3.us-west-2.amazonaws.com

。确保区域和端点一致以避免任何潜在问题。

fs.s3.aws.credentials.provider

设置为使用

WebIdentityTokenCredentialsProvider

，适用于 Kubernetes 等使用服务账户 IAM 角色的环境。

冰山配置：

iceberg.max-splits-per-scan

设置为“1”，这可能有助于控制每个扫描操作的拆分数量，从而可能优化特定工作负载的性能。

服务帐号：

服务帐户创建已禁用 (

create: false

)，并提供了特定的服务帐户名称 (trino-service-account)。确保此服务帐户存在并具有必要的权限。

协调员和工作人员配置：

协调器和工作线程 JVM 设置都包含处理内存不足错误和堆转储的选项，这有利于调试。为内存和 CPU 定义了资源限制和请求，确保 Pod 分配了必要的资源。

目录配置：

Iceberg 和 Hive 目录配置了必要的详细信息，例如元存储 URI 和 S3 凭据。确保凭证（

s3.aws-access-key

和

s3.aws-secret-key

）得到安全管理且不会暴露。

PostgreSQL 配置：

Hive 元存储的 PostgreSQL 设置包括用户名、密码和数据库名称。确保这些凭据正确，并且使用提供的初始化脚本正确初始化数据库。图片和资源：

Hive 元存储的映像与拉取策略一起指定。确保图像可用且可访问。

附加配置：

hiveSite

配置包括 Hive 元存储 URI、S3 访问密钥和仓库目录的设置。确保这些设置正确并且 S3 存储桶配置正确。总体而言，该配置似乎结构良好，适合在 Kubernetes 集群上部署带有 Iceberg 和 Hive 的 Trino。确保在转移到生产之前验证所有设置并在临时环境中测试部署

使用S3作为源创建冰山表

问题描述投票：0回答：1

1个回答

最新问题

使用S3作为源创建冰山表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1