ClassCastException:使用saveToCassandra时无法分配scala.collection.immutable.List $ SerializationProxy的实例

问题描述 投票:1回答:2

我正在使用sssContext的cassandraTable()函数从cassandra中读取数据。它将创建一个DataFrame。我正在将此Df转换为Rdd并将其映射到案例类对象。 dataClass是一个数据帧。我已经检查了类似的问题,但没有帮助。

val dataClass = cartData.rdd.map({case Row(session_id : String, time_stamp : Date, data : String) => cartDataClass(session_id, time_stamp, data)})

上面的地图函数内部的匿名函数正在创建问题。这是对的吗 ?看起来它无法序列化该功能。

dataClass是一个RDD [cartDataClass]现在我想把这个RDD保存到cassandra。

dataClass.saveToCassandra("keyspace", "table")

但它抛出了这个例外:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 26, 192.168.1.104): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

这看起来像RDD序列化的一些问题。 RDD是可序列化的,这可能是什么问题?我在scala对象的main函数中编写我的脚本,是不是因为spark无法序列化scala对象?请帮助,我是scala和spark的新手。

scala apache-spark serialization cassandra rdd
2个回答
1
投票

如果我可能会建议。只需将DataFrame本身保存到C *即可。数据帧“write”方法可以与C *一起使用

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#persisting-a-dataset-to-cassandra-using-the-save-command

在不知道如何定义cartDataClass的情况下,很难知道依赖树中可能出现的问题。我的猜测是,被序列化的RDD的依赖树在该类型上遇到了麻烦。


-1
投票
new SparkConf().setAppName("test").setMaster("local[2]").set("spark.executor.memory", "4g")

local[2]及其工作

© www.soinside.com 2019 - 2024. All rights reserved.