它运行一个应用程序(从kafka读取,简单转换,写入三角洲表)。当我启动应用程序时,我会在仪表板中看到执行程序和驱动程序(1个核心,executor和驱动程序的2 GB RAM)。 5分钟后,执行人被杀,只剩下驾驶员还活着。我尝试了多次 - 总是5分钟。
船长日志:
2025-03-07T07:18:03.8025281Z stderr F 25/03/07 07:18:03 INFO Master: Registering worker 100.100.206.177:65000 with 10 cores, 7.0 GiB RAM
2025-03-07T07:18:08.6740120Z stderr F 25/03/07 07:18:08 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
2025-03-07T07:18:08.6744812Z stderr F 25/03/07 07:18:08 INFO Master: Launching driver driver-20250307071808-0002 on worker worker-20250307071803-100.100.206.177-65000
2025-03-07T07:18:11.6502394Z stderr F 25/03/07 07:18:11 INFO Master: Registering app AppName
2025-03-07T07:18:11.6509960Z stderr F 25/03/07 07:18:11 INFO Master: Registered app AppName with ID app-20250307071811-0002
2025-03-07T07:18:11.6516823Z stderr F 25/03/07 07:18:11 INFO Master: Start scheduling for app app-20250307071811-0002 with rpId: 0
2025-03-07T07:18:11.6520393Z stderr F 25/03/07 07:18:11 INFO Master: Launching executor app-20250307071811-0002/0 on worker worker-20250307071803-100.100.206.177-65000
2025-03-07T07:18:11.7802035Z stderr F 25/03/07 07:18:11 INFO Master: Start scheduling for app app-20250307071811-0002 with rpId: 0
2025-03-07T07:18:14.1149857Z stderr F 25/03/07 07:18:14 INFO Master: 100.100.0.179:40548 got disassociated, removing it.
2025-03-07T07:18:14.1151561Z stderr F 25/03/07 07:18:14 INFO Master: 100.100.206.177:40175 got disassociated, removing it.
2025-03-07T07:23:11.7824346Z stderr F 25/03/07 07:23:11 INFO Master: 100.100.0.36:58476 got disassociated, removing it.
2025-03-07T07:23:11.7825329Z stderr F 25/03/07 07:23:11 INFO Master: ca-app-worker-bdso2eczojhja--confinsubmit-ff59f4b75-v8lm4:38129 got disassociated, removing it.
2025-03-07T07:23:11.7825770Z stderr F 25/03/07 07:23:11 INFO Master: Removing app app-20250307071811-0002
2025-03-07T07:23:11.8980486Z stderr F 25/03/07 07:23:11 WARN Master: Got status update for unknown executor app-20250307071811-0002/0
执行日志:
2025-03-07T07:20:56.04669 Successfully Connected to container: 'ca-app-worker-bdso2eczojhja' [Revision: 'ca-app-worker-bdso2eczojhja--confinsubmit-ff59f4b75-v8lm4', Replica: 'ca-app-worker-bdso2eczojhja--confinsubmit']
2025-03-07T07:18:08.5551767Z stderr F 25/03/07 07:18:08 INFO Utils: Successfully started service 'driverClient' on port 40175.
2025-03-07T07:18:08.6040221Z stderr F 25/03/07 07:18:08 INFO TransportClientFactory: Successfully created connection to ca-app-master-bdso2eczojhja/100.100.229.245:7077 after 22 ms (0 ms spent in bootstraps)
2025-03-07T07:18:08.6720836Z stderr F 25/03/07 07:18:08 INFO ClientEndpoint: ... waiting before polling master for driver state
2025-03-07T07:18:08.6896218Z stderr F 25/03/07 07:18:08 INFO ClientEndpoint: Driver successfully submitted as driver-20250307071808-0002
2025-03-07T07:18:08.7077359Z stderr F 25/03/07 07:18:08 INFO Worker: Asked to launch driver driver-20250307071808-0002
2025-03-07T07:18:08.7390778Z stderr F 25/03/07 07:18:08 INFO DriverRunner: Copying user jar file:/opt/spark/work-dir/sparkjobs-0.1-all.jar to /opt/spark/work/driver-20250307071808-0002/sparkjobs-0.1-all.jar
2025-03-07T07:18:08.7532736Z stderr F 25/03/07 07:18:08 INFO Utils: Copying /opt/spark/work-dir/sparkjobs-0.1-all.jar to /opt/spark/work/driver-20250307071808-0002/sparkjobs-0.1-all.jar
2025-03-07T07:18:08.9767253Z stderr F 25/03/07 07:18:08 INFO DriverRunner: Launch Command: "/opt/java/openjdk/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*:/etc/hadoop/conf" "-Xmx2048M" "-Dspark.dynamicAllocation.enabled=false" "-Dspark.master=spark://ca-app-master-bdso2eczojhja:7077" "-Dspark.driver.memory=2G" "-Dspark.network.timeout=600s" "-Dspark.submit.deployMode=cluster" "-Dspark.shuffle.compress=true" "-Dspark.executor.memory=2G" "-Dspark.app.name=dk.name.sparkjobs.EventsStream" "-Dspark.cores.max=1" "-Dspark.driver.supervise=false" "-Dspark.jars=file:/opt/spark/work-dir/sparkjobs-0.1-all.jar" "-Dspark.submit.pyFiles=" "-Dspark.executor.cores=1" "-Dspark.app.submitTime=1741331888113" "-Dspark.rpc.askTimeout=10s" "org.apache.spark.deploy.worker.DriverWrapper" "spark://[email protected]:65000" "/opt/spark/work/driver-20250307071808-0002/sparkjobs-0.1-all.jar" "dk.name.sparkjobs.EventsStream"
2025-03-07T07:18:11.7022027Z stderr F 25/03/07 07:18:11 INFO Worker: Asked to launch executor app-20250307071811-0002/0 for AppName
2025-03-07T07:18:11.7179607Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing view acls to: spark
2025-03-07T07:18:11.7185133Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing modify acls to: spark
2025-03-07T07:18:11.7188098Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing view acls groups to:
2025-03-07T07:18:11.7189881Z stderr F 25/03/07 07:18:11 INFO SecurityManager: Changing modify acls groups to:
2025-03-07T07:18:11.7192879Z stderr F 25/03/07 07:18:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: spark; groups with view permissions: EMPTY; users with modify permissions: spark; groups with modify permissions: EMPTY
2025-03-07T07:18:11.7391135Z stderr F 25/03/07 07:18:11 INFO ExecutorRunner: Launch command: "/opt/java/openjdk/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*:/etc/hadoop/conf" "-Xmx2048M" "-Dspark.network.timeout=600s" "-Dspark.driver.port=38129" "-Dspark.rpc.askTimeout=10s" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@ca-app-worker-bdso2eczojhja--confinsubmit-ff59f4b75-v8lm4:38129" "--executor-id" "0" "--hostname" "100.100.206.177" "--cores" "1" "--app-id" "app-20250307071811-0002" "--worker-url" "spark://[email protected]:65000" "--resourceProfileId" "0"
2025-03-07T07:18:13.7192983Z stderr F 25/03/07 07:18:13 INFO ClientEndpoint: State of driver-20250307071808-0002 is RUNNING
2025-03-07T07:18:13.7198964Z stderr F 25/03/07 07:18:13 INFO ClientEndpoint: Driver running on 100.100.206.177:65000 (worker-20250307071803-100.100.206.177-65000)
2025-03-07T07:18:13.7203498Z stderr F 25/03/07 07:18:13 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
2025-03-07T07:18:13.7576048Z stderr F 25/03/07 07:18:13 INFO ShutdownHookManager: Shutdown hook called
2025-03-07T07:18:13.7585522Z stderr F 25/03/07 07:18:13 INFO ShutdownHookManager: Deleting directory /tmp/spark-c4eef00a-1dc8-4168-bc51-3ca6e32c4525
2025-03-07T07:23:11.7877695Z stderr F 25/03/07 07:23:11 INFO Worker: Asked to kill executor app-20250307071811-0002/0
2025-03-07T07:23:11.7884040Z stderr F 25/03/07 07:23:11 INFO ExecutorRunner: Runner thread for executor app-20250307071811-0002/0 interrupted
2025-03-07T07:23:11.7893527Z stderr F 25/03/07 07:23:11 INFO ExecutorRunner: Killing process!
2025-03-07T07:23:11.8963907Z stderr F 25/03/07 07:23:11 INFO Worker: Executor app-20250307071811-0002/0 finished with state KILLED exitStatus 143
2025-03-07T07:23:11.8972549Z stderr F 25/03/07 07:23:11 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
2025-03-07T07:23:11.8976360Z stderr F 25/03/07 07:23:11 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20250307071811-0002, execId=0)
2025-03-07T07:23:11.8995390Z stderr F 25/03/07 07:23:11 INFO Worker: Cleaning up local directories for application app-20250307071811-0002
2025-03-07T07:23:11.8998084Z stderr F 25/03/07 07:23:11 INFO ExternalShuffleBlockResolver: Application app-20250307071811-0002 removed, cleanupLocalDirs = true
我尝试了
spark.network.timeout=600s
和
spark.dynamicAllocation.enabled=false
-没有任何改变。在5分钟内,工作效果很好地阅读和写入数据。 5分钟后,我仍然在仪表板上看到驾驶员和工人。
任何想法为什么执行人被杀?
您的执行人被杀死,因为您的执行人是从驾驶员那里获得的,它优雅地关闭了(退出代码143)。触发退出代码143有许多潜在的原因,我相信这是由于内存 / GC问题。您可以通过检查Spark UI来验证它。
不确定您的转换有多复杂,如果您的集群中仍然有资源,请增加执行者的内存。如果不是,请减少