我正在尝试设置一个在本地计算机上运行的 Spark 应用程序,以连接到 HDFS 集群,其中 NameNode 在 Docker 容器内运行。
以下是我的设置的相关详细信息:
version: "3.9"
networks:
hadoop-net:
driver: bridge
services:
# Distributed storage
hdfs-namenode:
container_name: "hdfs-namenode"
image: "apache/hadoop:3"
hostname: "hdfs-namenode"
command: ["hdfs", "namenode"]
ports:
- "8020:8020"
- "9870:9870"
env_file:
- ./hadoop-config/config.env
environment:
ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
networks:
- hadoop-net
hdfs-datanode:
depends_on:
hdfs-namenode:
condition: service_started
container_name: "hdfs-datanode"
image: "apache/hadoop:3"
hostname: "hdfs-datanode"
command: ["hdfs", "datanode"]
ports:
- "9864:9864"
env_file:
- ./hadoop-config/config.env
networks:
- hadoop-net
这是 docker 检查网络:
[
{
"Name": "big-data-project-2_hadoop-net",
"Id": "9041e00edf94dfc3bbd06faa11e38158db0c008468a4d39439d6ffce86bcbaa9",
"Created": "2024-12-02T16:27:49.568119343Z",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.18.0.0/16",
"Gateway": "172.18.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"5d35ac2b9390085f0a3318d2276f24be01738be5d61f12e33ecf57bfb843141c": {
"Name": "hdfs-datanode",
"EndpointID": "306e4bd76957da95b27828b938222b8ea9790ffefe53db6e2af7ab77cc0317d1",
"MacAddress": "02:42:ac:12:00:03",
"IPv4Address": "172.18.0.3/16",
"IPv6Address": ""
},
"90d2484e199081c06e40eebcfeca44e9fbb82eb77e24aaf5655564f3302b052a": {
"Name": "hdfs-namenode",
"EndpointID": "7ae97198be0b0fdf678784f4645f7fecbb31b7c559021ffbe8ac7153ebb272fc",
"MacAddress": "02:42:ac:12:00:02",
"IPv4Address": "172.18.0.2/16",
"IPv6Address": ""
}
},
"Options": {},
"Labels": {
"com.docker.compose.network": "hadoop-net",
"com.docker.compose.project": "big-data-project-2",
"com.docker.compose.version": "2.24.6"
}
}
]
所以我尝试将 csv 文件写入我的 HDFS 集群。我使用这段代码作为示例:
spark = SparkSession.builder \
.config("spark.hadoop.fs.defaultFS", "hdfs://localhost:8020") \
.appName("Spark SQL") \
.getOrCreate()
data = [("James", "", "Smith",), ("Anna", "Rose", ""), ("Julia", "", "Williams")]
columns = ["firstname", "middlename", "lastname"]
df = spark.createDataFrame(data, columns)
df.show()
df.write.csv("/user/hadoop/data/people.csv")
错误是:
24/12/02 18:04:31 WARN DataStreamer: Exception in createBlockOutputStream blk_1073741841_1017
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/172.18.0.3:9866]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:589)
at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1774)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1728)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713)
24/12/02 18:04:31 WARN DataStreamer: Abandoning BP-1037713087-172.18.0.2-1733156872096:blk_1073741841_1017
24/12/02 18:04:31 WARN DataStreamer: Exception in createBlockOutputStream blk_1073741842_1018
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/172.18.0.3:9866]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:589)
at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1774)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1728)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713)
24/12/02 18:04:31 WARN DataStreamer: Abandoning BP-1037713087-172.18.0.2-1733156872096:blk_1073741842_1018
24/12/02 18:04:31 WARN DataStreamer: Exception in createBlockOutputStream blk_1073741843_1019
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/172.18.0.3:9866]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:589)
at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1774)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1728)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713)
24/12/02 18:04:31 WARN DataStreamer: Abandoning BP-1037713087-172.18.0.2-1733156872096:blk_1073741843_1019
24/12/02 18:04:32 WARN DataStreamer: Excluding datanode DatanodeInfoWithStorage[172.18.0.3:9866,DS-deaa7ec2-a939-4f64-9f80-d8b5fa30f427,DISK]
24/12/02 18:04:32 WARN DataStreamer: Excluding datanode DatanodeInfoWithStorage[172.18.0.3:9866,DS-deaa7ec2-a939-4f64-9f80-d8b5fa30f427,DISK]
24/12/02 18:04:32 WARN DataStreamer: Excluding datanode DatanodeInfoWithStorage[172.18.0.3:9866,DS-deaa7ec2-a939-4f64-9f80-d8b5fa30f427,DISK]
24/12/02 18:04:32 WARN DataStreamer: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hadoop/data/people.csv/_temporary/0/_temporary/attempt_202412021803316929840877291183176_0003_m_000015_31/part-00015-964739f3-b6f7-4f07-97a4-fbf380d50050-c000.csv could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2350)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2989)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
at org.apache.hadoop.ipc.Client.call(Client.java:1558)
at org.apache.hadoop.ipc.Client.call(Client.java:1455)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
at com.sun.proxy.$Proxy39.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:530)
at sun.reflect.GeneratedMethodAccessor203.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy40.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088)
at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1915)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1717)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713)......
你应该更改代码
来自
.config("spark.hadoop.fs.defaultFS", "hdfs://localhost:8020") \
到
.config("spark.hadoop.fs.defaultFS", "hdfs://hdfs-namenode:8020") \
因为从你的hadoop hdfs-datanode无法访问hdfs://localhost:8020,所以应该是hdfs://hdfs-namenode:8020。