我正试图在EMR上以步骤的形式运行我们的管道。而我卡在了这一步。
"logPath" : "s3://ekin-logs/",
"masterInstanceType" : "m5.xlarge",
"slaveInstanceType" : "m5.xlarge",
"instanceCount" : 2,
"subnetIds" : $SUBNET_ID,
"ec2KeyName" : "ekin-analytics",
"applications" : ["Spark","Hadoop"],
"args" : [
"spark-submit",
"--master", "yarn",
"--executor-memory", "8G",
"--driver-memory", "7G",
"--deploy-mode","cluster",
"--class","com.testinium.analytics.AppCommonDataSource",
"--conf","spark.eventLog.enabled=true",
"s3://analytics-emr-test/ekin-spark-app.jar",
"--prefixOutputDir", "hdfs:///home/hadoop/data/customer",
"--maxTimeGapThreshold","180000",
"--domainId", "13",
"--submitId", "1",
"--startTime" ,"1543664538237",
"--endTime", "1551994119153"
],
"jar" : "command-runner.jar",
"name" : "AppCommonDataSource",
"actionOnFailure" : "CANCEL_AND_WAIT"
一开始我得到了下面的错误:
ExecutorLostFailure(执行者1退出,由其中一个正在运行的任务引起)原因:。容器因超过内存限制被YARN杀死。使用了8.3GB物理内存中的10.4GB。考虑提升spark.yarn.executor.memoryOverhead或禁用yarn.nodemanager.vmem-check-enabled,因为YARN-4714。
我在stackoverflow上研究了一下,发现有人通过在yarn-site.xml中添加yarn.nodemanager.vmem-check-enabled参数来解决这个问题。(堆栈溢出的解决方案)
我也添加了这个参数,但没有任何变化。
我的集群的yarn-site参数。
yarnProperties.put("yarn.scheduler.maximum-allocation-mb", 10240);
yarnProperties.put("yarn.nodemanager.resource.memory-mb", 10240);
yarnProperties.put("yarn.nodemanager.vmem-check-enabled", "false");
yarnProperties.put("yarn.nodemanager.pmem-check-enabled", "false");
而下面这个错误是我最后得到的。
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_1591270643256_0002_01_000002 on host: ip-172-31-35-232.eu-west-1.compute.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
注意: 有时只有一个主节点可以工作。
将EBS卷添加到你的节点上。M5实例不附带任何存储。