在我的单元测试中,我必须停止本地 Spark 会话并创建另一个会话(使用从上一个会话保存的元存储中的数据)。
但是当创建另一个 Spark 会话时,它无法使用使用第一个 Spark 会话创建的本地 Metastore_db。
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark_builder = SparkSession.builder.enableHiveSupport().master("local").appName("mega_app")
spark_session = spark_builder.getOrCreate()
spark_session.sql("SHOW DATABASES").show() # crashes here!
# Here I get normal output (even when starting this script again and again):
#
# +------------+
# |databaseName|
# +------------+
# | default|
# +------------+
spark_session.stop()
# Here I'm trying to create a new Spark session and check it:
spark_builder = SparkSession.builder.enableHiveSupport().master("local").appName("mega_app")
spark_session = spark_builder.getOrCreate()
spark_session.sql("SHOW DATABASES").show()
# And here I get the error:
#
# java.sql.SQLException: Unable to open a test connection to the given database.
# JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
# Terminating connection pool (set lazyInit to true if you expect to start your database after your app)
#
# ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@24d1c991
# Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/felix/Projects/baka_etl/baka_etl/metastore_db.
如何修复?
附注我用
pyspark==2.4.8
和 pypspark==3.3.4
进行了测试,得到了相同的结果。
你可以尝试pyspark 3.5.3,它可以工作。
我又添加了两行代码来创建表格并显示表格。
# cat test.py
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark_builder = SparkSession.builder.enableHiveSupport().master("local").appName("mega_app")
spark_session = spark_builder.getOrCreate()
spark_session.sql("SHOW DATABASES").show() # crashes here!
spark_session.sql("CREATE TABLE IF NOT EXISTS T1 (C1 INT)")
spark_session.stop()
# Here I'm trying to create a new Spark session and check it:
spark_builder = SparkSession.builder.enableHiveSupport().master("local").appName("mega_app")
spark_session = spark_builder.getOrCreate()
spark_session.sql("SHOW DATABASES").show()
spark_session.sql("SHOW TABLES").show()
以下是测试结果:
# pip list|grep pyspark
pyspark 3.5.3
# python test.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/28 11:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/11/28 11:30:03 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/28 11:30:03 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/11/28 11:30:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
24/11/28 11:30:04 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore [email protected]
+---------+
|namespace|
+---------+
| default|
+---------+
24/11/28 11:30:06 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
+---------+
|namespace|
+---------+
| default|
+---------+
24/11/28 11:30:06 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
| default| t1| false|
+---------+---------+-----------+