由于连接问题,YARN 作业失败

问题描述 投票:0回答:1

我已经在 Kubernetes 集群中进行了

hadoop-3.3.6
设置,所有 hadoop 组件都通过 ClusterIP 服务公开,我能够 telnet 到从各个 pod 公开的端口。但是当我从 datanode pod 运行示例作业时(甚至从资源管理器 pod 中尝试过),我收到以下错误

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /README.txt /MROutput

2023-12-08 21:18:57,722 INFO [main] org.apache.hadoop.security.SecurityUtil: Updating Configuration
2023-12-08 21:18:58,323 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-12-08 21:18:58,544 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-12-08 21:18:58,544 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2023-12-08 21:18:58,714 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens: [Kind: mapreduce.job, Service: job_1702069922844_0002, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@71c3b41)]
2023-12-08 21:18:58,827 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2023-12-08 21:18:59,992 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: nodemanager.hadoop.svc.cluster.local.hadoop/10.233.51.169:37127. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2023-12-08 21:19:00,994 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: nodemanager.hadoop.svc.cluster.local.hadoop/10.233.51.169:37127. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2023-12-08 21:19:01,996 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: nodemanager.hadoop.svc.cluster.local.hadoop/10.233.51.169:37127. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2023-12-08 21:19:02,003 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From nodemanager-8fc5cdf9d-q9kwx/10.233.74.107 to nodemanager.hadoop.svc.cluster.local.hadoop:37127 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:930)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:845)
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1571)
    at org.apache.hadoop.ipc.Client.call(Client.java:1513)
    at org.apache.hadoop.ipc.Client.call(Client.java:1410)
    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
    at com.sun.proxy.$Proxy8.getTask(Unknown Source)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:140)
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:205)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:600)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:652)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:773)
    at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:347)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1632)
    at org.apache.hadoop.ipc.Client.call(Client.java:1457)
    ... 4 more

2023-12-08 21:19:02,004 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2023-12-08 21:19:02,005 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2023-12-08 21:19:02,006 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

nodemanager.hadoop.svc.cluster.local.hadoop
是我的服务主机名,它是正确的,但端口
37127
未在该服务中打开,并且在我每次运行作业时都是随机的,因此我无法将该端口从我的服务中公开

我正在从我的

datanode
运行上述作业,它能够连接到
nodemanager.hadoop.svc.cluster.local.hadoop
服务,但通过 diff。暴露的IP。我可以通过 diff 从
nodemanager-8fc5cdf9d-q9kwx/10.233.74.107
pod 连接到服务。通过该服务公开的 IP。

此外,当我在运行作业时尝试使用

netstat
内的
nodemanager
检查端口创建时,只要作业运行,
tcp6
PORT
37127
就确实存在。

看起来我缺少一些配置。设置?有人可以帮我吗? (为此苦苦挣扎了几天)

作业以失败告终。 Job status after completion

mapreduce hadoop-yarn yarn-workspaces hadoop3
1个回答
0
投票

我已经使用

yarn.app.mapreduce.am.job.client.port-range
修复了端口范围,然后通过 kubernetes 服务公开它们,现在可以工作了。

© www.soinside.com 2019 - 2024. All rights reserved.