带有 Hive Metastore 的 Iceberg 不会在 Spark 中创建目录并使用默认值

问题描述 投票:0回答:1

我遇到了一些(意外的?)行为:Spark 中的目录引用未反映在 Hive Metastore 中。我已根据 documentation 遵循 Spark 配置,看起来应该创建一个具有相应名称的新目录。一切都按预期工作,除了目录没有插入到 Hive Metastore 中。这有一些影响,我将通过一个例子来展示。

这是 PySpark 中的示例脚本:

import os
from pyspark.sql import SparkSession

deps = [
    "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
    "org.apache.iceberg:iceberg-aws:1.2.1",
    "software.amazon.awssdk:bundle:2.17.257",
    "software.amazon.awssdk:url-connection-client:2.17.257"
]
os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} pyspark-shell"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
os.environ["AWS_REGION"] = "eu-east-1"


catalog = "hive_catalog"
spark = SparkSession.\
    builder.\
    appName("Iceberg Reader").\
    config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
    config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog").\
    config(f"spark.sql.catalog.{catalog}.type", "hive").\
    config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
    config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .\
    config(f"spark.sql.catalog.{catalog}.s3.endpoint", "http://localhost:9000").\
    config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
    config("hive.metastore.uris", "thrift://localhost:9083").\
    enableHiveSupport().\
    getOrCreate()

# Raises error
spark.sql("CREATE NAMESPACE wrong_catalog.new_db;")

# Correct creation of namespace
spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")

# Create table
spark.sql(f"CREATE TABLE {catalog}.new_db.new_table (col1 INT, col2 STRING);")

# Insert data
spark.sql(f"INSERT INTO {catalog}.new_db.new_table VALUES (1, 'first'), (2, 'second');")

# Read data
spark.sql(f"SELECT * FROM {catalog}.new_db.new_table;").show()
#|col1|  col2|
#+----+------+
#|   1| first|
#|   2|second|
#+----+------+

# Read metadata
spark.sql(f"SELECT * FROM {catalog}.new_db.new_table.files;").show()
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
#|content|           file_path|file_format|spec_id|record_count|file_size_in_bytes|      column_sizes|    value_counts|null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|    readable_metrics|
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
#|      0|s3a://lakehouse/n...|    PARQUET|      0|           1|               652|{1 -> 47, 2 -> 51}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}|              {}|{1 -> ���, 2 -> ...|{1 -> ���, 2 -> ...|        null|          [4]|        null|            0|{{47, 1, 0, null,...|
#|      0|s3a://lakehouse/n...|    PARQUET|      0|           1|               660|{1 -> 47, 2 -> 53}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}|              {}|{1 -> ���, 2 -> ...|{1 -> ���, 2 -> ...|        null|          [4]|        null|            0|{{47, 1, 0, null,...|
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+

现在看来一切都很好。它创建了一个命名空间、表并在表中插入了数据。现在,显示 Hive Metastore 的结果可以显示问题所在 (

CTLGS
):

|CTLG_ID|NAME|DESC                    |LOCATION_URI    |
|-------|----|------------------------|----------------|
|1      |hive|Default catalog for Hive|s3a://lakehouse/|

它不会插入具有相应目录名称的新目录。我们可以看到命名空间和表实际上已插入到 Hive Metastore 中(

DBS
TBLS
):

|DB_ID|DESC                 |DB_LOCATION_URI          |NAME   |OWNER_NAME    |OWNER_TYPE|CTLG_NAME|
|-----|---------------------|-------------------------|-------|--------------|----------|---------|
|1    |Default Hive database|s3a://lakehouse/         |default|public        |ROLE      |hive     |
|2    |                     |s3a://lakehouse/new_db.db|new_db |thijsvandepoll|USER      |hive     |


|TBL_ID|CREATE_TIME  |DB_ID|LAST_ACCESS_TIME|OWNER         |OWNER_TYPE|RETENTION    |SD_ID|TBL_NAME |TBL_TYPE      |VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|IS_REWRITE_ENABLED|
|------|-------------|-----|----------------|--------------|----------|-------------|-----|---------|--------------|------------------|------------------|------------------|
|1     |1.683.707.647|2    |80.467          |thijsvandepoll|USER      |2.147.483.647|1    |new_table|EXTERNAL_TABLE|                  |                  |0                 |

这意味着它使用 Hive 默认目录而不是提供的名称。我不确定这是预期行为还是意外行为。到目前为止,其他一切都运行良好。然而,当我们想要在另一个目录中创建另一个相同的命名空间时,问题就存在了:

import os
from pyspark.sql import SparkSession

deps = [
    "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
    "org.apache.iceberg:iceberg-aws:1.2.1",
    "software.amazon.awssdk:bundle:2.17.257",
    "software.amazon.awssdk:url-connection-client:2.17.257"
]
os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} pyspark-shell"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
os.environ["AWS_REGION"] = "eu-east-1"

catalog = "other_catalog"
spark = SparkSession.\
    builder.\
    appName("Iceberg Reader").\
    config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
    config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog").\
    config(f"spark.sql.catalog.{catalog}.type", "hive").\
    config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
    config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .\
    config(f"spark.sql.catalog.{catalog}.s3.endpoint", "http://localhost:9000").\
    config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
    config("hive.metastore.uris", "thrift://localhost:9083").\
    enableHiveSupport().\
    getOrCreate()

# Error that catalog already exists
spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")
# pyspark.sql.utils.AnalysisException: Namespace 'new_db' already exists

# Create another namespace
spark.sql(f"CREATE NAMESPACE {catalog}.other_db;")

# Try to access data from other catalog using current catalog
spark.sql("SELECT * FROM {catalog}.new_db.new_table;").show()
#|col1|  col2|
#+----+------+
#|   1| first|
#|   2|second|
#+----+------+

现在我们可以看到,即使我们引用另一个目录,它仍然隐式使用 Hive 默认目录。我们可以通过在 Hive Metastore 中查看

DBS
来看到这一点:

|DB_ID|DESC                 |DB_LOCATION_URI            |NAME    |OWNER_NAME    |OWNER_TYPE|CTLG_NAME|
|-----|---------------------|---------------------------|--------|--------------|----------|---------|
|1    |Default Hive database|s3a://lakehouse/           |default |public        |ROLE      |hive     |
|2    |                     |s3a://lakehouse/new_db.db  |new_db  |thijsvandepoll|USER      |hive     |
|3    |                     |s3a://lakehouse/other_db.db|other_db|thijsvandepoll|USER      |hive     |

基本上这意味着 Iceberg 和 Hive Metastore 没有目录的概念。它只是可以定义的命名空间+表的列表。它看起来实际上是一个单一的目录。

任何人都可以帮助我了解发生了什么事吗?我错过了配置吗?这是预期的行为还是错误?预先感谢!

python apache-spark hive hive-metastore apache-iceberg
1个回答
0
投票

这个问题你找到解决办法了吗?

© www.soinside.com 2019 - 2024. All rights reserved.