我无法使用 pyspark 读取 Hive 表及其元数据
我认为我正在准确地创建配置单元表
设置:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data1 = [(1,2,3),(3,4,5),(5,6,7)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
parquet_path = './bucket_test_parquet1'
现在,使用
DESCRIBE
检查表格
df1.write.bucketBy(5,"a").format("parquet").saveAsTable('df',path=parquet_path,mode='overwrite')
spark.sql("DESCRIBE EXTENDED df").show(100)
output:
+--------------------+--------------------+-------+
| col_name| data_type|comment|
+--------------------+--------------------+-------+
| a| int| null|
| b| int| null|
| c| int| null|
| | | |
|# Detailed Table ...| | |
| Database| default| |
| Table| df| |
| Owner| nitin| |
| Created Time|Tue Feb 01 09:05:...| |
| Last Access| UNKNOWN| |
| Created By| Spark 3.2.0| |
| Type| EXTERNAL| |
| Provider| parquet| |
| Num Buckets| 5| |
| Bucket Columns| [`a`]| |
| Sort Columns| []| |
| Location|file:/home/nitin/...| |
| Serde Library|org.apache.hadoop...| |
| InputFormat|org.apache.hadoop...| |
| OutputFormat|org.apache.hadoop...| |
+--------------------+--------------------+-------+
read_parquet1 = spark.read.format("parquet").load(parquet_path,header=True)
read_parquet1.createOrReplaceTempView("rp1")
read_parquet1 = spark.table("rp1")
spark.sql("DESCRIBE EXTENDED rp1").show(100)
output:
|col_name|data_type|comment|
+--------+---------+-------+
| a| int| null|
| b| int| null|
| c| int| null|
+--------+---------+-------+
正如您所看到的,当我从磁盘读取表时,元数据没有被读入。您能帮我阅读该表,以便我获得元数据吗?
如果您想要数据路径的表模式,您也可以这样做:
read_parquet1 = Spark.read.format("parquet").load(parquet_path,header=True) read_parquet1.PrintSchema() -- 这将为您提供所需的结果。
您的代码的问题在于,当您注意到第一个代码时,您已将数据写入某个位置并询问其架构,而在第二种情况下,您正在从创建临时表的位置读取数据并询问临时表的定义。理想情况下,您应该像我的代码中那样向数据路径询问其架构。