我的 Spark 驱动程序因 OOM 而失败:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:141)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.util.AbstractCollection.toString(AbstractCollection.java:462)
at java.util.Collections$UnmodifiableCollection.toString(Collections.java:1037)
at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:469)
at org.apache.spark.util.JsonProtocol$.$anonfun$accumulableInfoToJson$7(JsonProtocol.scala:428)
at org.apache.spark.util.JsonProtocol$$$Lambda$4808/1475607995.apply(Unknown Source)
at scala.Option.map(Option.scala:230)
at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:428)
at org.apache.spark.util.JsonProtocol$.$anonfun$accumulablesToJson$4(JsonProtocol.scala:420)
at org.apache.spark.util.JsonProtocol$$$Lambda$4204/1507248734.apply(Unknown Source)
at scala.collection.immutable.List.map(List.scala:293)
at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:420)
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:412)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:171)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:81)
at org.apache.spark.deploy.history.rpc.app.RpcAppEvent$.encodeEvent(RpcAppEvent.scala:77)
at org.apache.spark.deploy.history.rpc.app.RpcAppEvent$.apply(RpcAppEvent.scala:62)
at org.apache.spark.deploy.history.rpc.app.RpcAppEventQueue.enqueue(RpcAppEventQueue.scala:86)
at org.apache.spark.deploy.history.rpc.app.RpcAppListener.addToEventQueue(RpcAppListener.scala:75)
at org.apache.spark.deploy.history.rpc.app.RpcAppListener.onEvent(RpcAppListener.scala:83)
at org.apache.spark.deploy.history.rpc.util.SingleHandlerListener.onTaskEnd(SingleHandlerListener.scala:48)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:120)
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:104)
at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:127)
Spark 配置:
spark.dynamicAllocation.maxExecutors=9
spark.executor.cores=2
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.initialExecutors=1
spark.executor.instances=8
spark.shuffle.service.client.class=org.apache.spark.network.shuffle.ExternalShuffleClient
spark.driver.maxResultSize=1g
spark.sql.files.maxPartitionBytes=1073741824
spark.sql.join.preferSortMergeJoin=true
spark.dynamicAllocation.executorIdleTimeout=300s
spark.executor.extraJavaOptions=-Detwlogger.component=sparkexecutor -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp/${user.name} -Dlog4jspark.log.file=sparkexecutor.log -Dlog4j2.configurationFile=file:/usr/hdp/current/spark3-client/conf/executor-log4j2.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -XX:+UseParallelGC -XX:+UseParallelOldGC
spark.driver.cores=1
spark.driver.memory=**2g**
spark.executor.memory=9g
spark.driver.memoryOverhead=**384**
spark.executor.memoryOverhead=921
共有 9 个执行者,总共 919 个任务。 代码在以下时间内执行:
df.rdd.foreachPartition
使用一些 int 和字符串累加器(字符串累加器被控制为不无限增长)
每个任务完成后,我可以看到此日志(每个任务大约约为 60000-62000,没有异常值):
INFO Executor [Executor task launch worker for task 258.0 in stage 59.0 (TID 4666)]: Finished task 258.0 in stage 59.0 (TID 4666). 61394 bytes result sent to driver
Spark 在完成约 270/919 个任务后因 OOM 失败。
我的理解是,当执行器将其状态传达给驱动程序并且驱动程序正在积累此信息以执行所有 JSON 操作时,它会失败。
但是,270 * 60 Kb ~ 15 Mb 不应该因 OOM 而杀死驱动程序,除非它在内部复制了多次。
这里最简单的解决方案是向驱动程序添加内存,但我想知道如何进一步调查此 OOM 以证明我的理论,即由于执行器 -> 驱动程序通信而发生 OOM
谢谢!