由于以下错误,我通过spark-submit运行spark作业时遇到问题:
16/11/16 11:41:12 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(org.apache.hadoop.fs.Path, java.lang.String, java.util.Map, boolean, int, boolean, boolean, boolean)
java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(org.apache.hadoop.fs.Path, java.lang.String, java.util.Map, boolean, int, boolean, boolean, boolean)
at java.lang.Class.getMethod(Class.java:1786)
at org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:114)
at org.apache.spark.sql.hive.client.Shim_v0_14.loadDynamicPartitionsMethod$lzycompute(HiveShim.scala:404)
at org.apache.spark.sql.hive.client.Shim_v0_14.loadDynamicPartitionsMethod(HiveShim.scala:403)
at org.apache.spark.sql.hive.client.Shim_v0_14.loadDynamicPartitions(HiveShim.scala:455)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:562)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:562)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:562)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:281)
at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:228)
at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:227)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:270)
...
我使用spark 1.6.0与scala 2.10,hive 1.1.0,平台是CDH 5.7.1,具有相同版本的spark和hive。在类路径上传递给spark作业的hive-exec是hive-exec-1.1.0-cdh5.7.1.jar。这个jar有一个类org.apache.hadoop.hive.ql.metadata.Hive
,我可以看到有以下方法:
public java.util.Map<java.util.Map<java.lang.String, java.lang.String>, org.apache.hadoop.hive.ql.metadata.Partition> loadDynamicPartitions(org.apache.hadoop.fs.Path, java.lang.String, java.util.Map<java.lang.String, java.lang.String>, boolean, int, boolean, boolean, boolean) throws org.apache.hadoop.hive.ql.metadata.HiveException;
这与我使用的库spark-hive_2.10-1.6.0.jar附带的org.apache.spark.sql.hive.client.ClientWrapper
类中的那个不同,此类中相同方法的签名是使用类org.apache.spark.sql.hive.client.HiveShim
和此方法:
private lazy val loadDynamicPartitionsMethod =
findMethod(
classOf[Hive],
"loadDynamicPartitions",
classOf[Path],
classOf[String],
classOf[JMap[String, String]],
JBoolean.TYPE,
JInteger.TYPE,
JBoolean.TYPE,
JBoolean.TYPE)
我还检查了hive-exec jar的历史,似乎在版本1.0.0之后更改了类org.apache.hadoop.hive.ql.metadata.Hive
的签名。我是Spark新手,但在我看来,spark-hive库使用了一个旧的Hive实现(我可以在jar中的META-INF / DEPENDENCIES文件中看到已经声明了对org.spark-project.hive的依赖:蜂房EXEC:罐子:1.2.1.spark)。有谁知道如何设置spark作业以使用正确的hive库?
确保您已设置以下设置
SET hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=2048
SET hive.exec.dynamic.partition.mode=nonstrict;
在Spark中,您可以在hive上设置如下
hiveCtx.setConf("hive.exec.dynamic.partition","true")
hiveCtx.setConf("hive.exec.max.dynamic.partitions","2048")
hiveCtx.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
如果问题仍然存在,我猜这意味着你正在使用的火花版本与你试图运行你的spark-submit的环境不匹配...你可以尝试在spark-shell中运行你的程序,如果它工作然后尝试将spark版本与环境设置对齐。
你可以设置对你的依赖关系如下或pom
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.3"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.3"
libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.6.3"
libraryDependencies += "org.apache.hive" % "hive-exec" % "1.1.0"
请参考https://mvnrepository.com/artifact/org.apache.spark
您可以使用以下命令SPARK_PRINT_LAUNCH_COMMAND = true spark-shell来获取环境
替代方法是使用spark partition by来保存数据
dataframe.write.mode("overwrite").partitionBy("col1", "col2").json("//path")