我遇到了一些(意外的?)行为:Spark 中的目录引用未反映在 Hive Metastore 中。我已根据 documentation 遵循 Spark 配置,看起来应该创建一个具有相应名称的新目录。一切都按预期工作,除了目录没有插入到 Hive Metastore 中。这有一些影响,我将通过一个例子来展示。
这是 PySpark 中的示例脚本:
import os
from pyspark.sql import SparkSession
deps = [
"org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
"org.apache.iceberg:iceberg-aws:1.2.1",
"software.amazon.awssdk:bundle:2.17.257",
"software.amazon.awssdk:url-connection-client:2.17.257"
]
os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} pyspark-shell"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
os.environ["AWS_REGION"] = "eu-east-1"
catalog = "hive_catalog"
spark = SparkSession.\
builder.\
appName("Iceberg Reader").\
config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog").\
config(f"spark.sql.catalog.{catalog}.type", "hive").\
config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .\
config(f"spark.sql.catalog.{catalog}.s3.endpoint", "http://localhost:9000").\
config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
config("hive.metastore.uris", "thrift://localhost:9083").\
enableHiveSupport().\
getOrCreate()
# Raises error
spark.sql("CREATE NAMESPACE wrong_catalog.new_db;")
# Correct creation of namespace
spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")
# Create table
spark.sql(f"CREATE TABLE {catalog}.new_db.new_table (col1 INT, col2 STRING);")
# Insert data
spark.sql(f"INSERT INTO {catalog}.new_db.new_table VALUES (1, 'first'), (2, 'second');")
# Read data
spark.sql(f"SELECT * FROM {catalog}.new_db.new_table;").show()
#|col1| col2|
#+----+------+
#| 1| first|
#| 2|second|
#+----+------+
# Read metadata
spark.sql(f"SELECT * FROM {catalog}.new_db.new_table.files;").show()
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
#|content| file_path|file_format|spec_id|record_count|file_size_in_bytes| column_sizes| value_counts|null_value_counts|nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id| readable_metrics|
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
#| 0|s3a://lakehouse/n...| PARQUET| 0| 1| 652|{1 -> 47, 2 -> 51}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}| {}|{1 -> ���, 2 -> ...|{1 -> ���, 2 -> ...| null| [4]| null| 0|{{47, 1, 0, null,...|
#| 0|s3a://lakehouse/n...| PARQUET| 0| 1| 660|{1 -> 47, 2 -> 53}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}| {}|{1 -> ���, 2 -> ...|{1 -> ���, 2 -> ...| null| [4]| null| 0|{{47, 1, 0, null,...|
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
现在看来一切都很好。它创建了一个命名空间、表并在表中插入了数据。现在,显示 Hive Metastore 的结果可以显示问题所在 (
CTLGS
):
|CTLG_ID|NAME|DESC |LOCATION_URI |
|-------|----|------------------------|----------------|
|1 |hive|Default catalog for Hive|s3a://lakehouse/|
它不会插入具有相应目录名称的新目录。我们可以看到命名空间和表实际上已插入到 Hive Metastore 中(
DBS
和 TBLS
):
|DB_ID|DESC |DB_LOCATION_URI |NAME |OWNER_NAME |OWNER_TYPE|CTLG_NAME|
|-----|---------------------|-------------------------|-------|--------------|----------|---------|
|1 |Default Hive database|s3a://lakehouse/ |default|public |ROLE |hive |
|2 | |s3a://lakehouse/new_db.db|new_db |thijsvandepoll|USER |hive |
|TBL_ID|CREATE_TIME |DB_ID|LAST_ACCESS_TIME|OWNER |OWNER_TYPE|RETENTION |SD_ID|TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|IS_REWRITE_ENABLED|
|------|-------------|-----|----------------|--------------|----------|-------------|-----|---------|--------------|------------------|------------------|------------------|
|1 |1.683.707.647|2 |80.467 |thijsvandepoll|USER |2.147.483.647|1 |new_table|EXTERNAL_TABLE| | |0 |
这意味着它使用 Hive 默认目录而不是提供的名称。我不确定这是预期行为还是意外行为。到目前为止,其他一切都运行良好。然而,当我们想要在另一个目录中创建另一个相同的命名空间时,问题就存在了:
import os
from pyspark.sql import SparkSession
deps = [
"org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
"org.apache.iceberg:iceberg-aws:1.2.1",
"software.amazon.awssdk:bundle:2.17.257",
"software.amazon.awssdk:url-connection-client:2.17.257"
]
os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} pyspark-shell"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
os.environ["AWS_REGION"] = "eu-east-1"
catalog = "other_catalog"
spark = SparkSession.\
builder.\
appName("Iceberg Reader").\
config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog").\
config(f"spark.sql.catalog.{catalog}.type", "hive").\
config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
config(f"spark.sql.catalog.{catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .\
config(f"spark.sql.catalog.{catalog}.s3.endpoint", "http://localhost:9000").\
config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
config("hive.metastore.uris", "thrift://localhost:9083").\
enableHiveSupport().\
getOrCreate()
# Error that catalog already exists
spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")
# pyspark.sql.utils.AnalysisException: Namespace 'new_db' already exists
# Create another namespace
spark.sql(f"CREATE NAMESPACE {catalog}.other_db;")
# Try to access data from other catalog using current catalog
spark.sql("SELECT * FROM {catalog}.new_db.new_table;").show()
#|col1| col2|
#+----+------+
#| 1| first|
#| 2|second|
#+----+------+
现在我们可以看到,即使我们引用另一个目录,它仍然隐式使用 Hive 默认目录。我们可以通过在 Hive Metastore 中查看
DBS
来看到这一点:
|DB_ID|DESC |DB_LOCATION_URI |NAME |OWNER_NAME |OWNER_TYPE|CTLG_NAME|
|-----|---------------------|---------------------------|--------|--------------|----------|---------|
|1 |Default Hive database|s3a://lakehouse/ |default |public |ROLE |hive |
|2 | |s3a://lakehouse/new_db.db |new_db |thijsvandepoll|USER |hive |
|3 | |s3a://lakehouse/other_db.db|other_db|thijsvandepoll|USER |hive |
基本上这意味着 Iceberg 和 Hive Metastore 没有目录的概念。它只是可以定义的命名空间+表的列表。它看起来实际上是一个单一的目录。
任何人都可以帮助我了解发生了什么事吗?我错过了配置吗?这是预期的行为还是错误?预先感谢!
这个问题你找到解决办法了吗?