我正在尝试使用 Hadoop Java 库在我的 hadoop 集群上运行
distcp
命令,将内容从 HDFS 移动到 Google Cloud Bucket。我收到错误NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
下面是我的java代码:
import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.tools.DistCp;
import org.apache.hadoop.tools.DistCpOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class HadoopHelper {
private static Logger logger = LoggerFactory.getLogger(HadoopHelper.class);
private static final String FS_DEFAULT_FS = "fs.defaultFS";
private final Configuration conf;
public HadoopHelper(String hadoopUrl) {
conf = new Configuration();
conf.set(FS_DEFAULT_FS, "hdfs://" + hadoopUrl);
}
public void distCP(JsonArray files, String target) {
try {
List<Path> srcPaths = new ArrayList<>();
for (JsonElement file : files) {
String srcPath = file.getAsString();
srcPaths.add(new Path(srcPath));
}
DistCpOptions options = new DistCpOptions.Builder(
srcPaths,
new Path("gs://" + target)
).build();
logger.info("Using distcp to copy {} to gs://{}", files, target);
this.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem");
this.conf.set("fs.gs.auth.service.account.email", "[email protected]");
this.conf.set("fs.gs.auth.service.account.keyfile", "config/my-svc-account-keyfile.p12");
this.conf.set("fs.gs.project.id", "my-gcp-project");
DistCp distCp = new DistCp(this.conf, options);
Job job = distCp.execute();
job.waitForCompletion(true);
logger.info("Distcp operation success. Exiting");
} catch (Exception e) {
logger.error("Error while trying to execute distcp", e);
logger.error("Distcp operation failed. Exiting");
throw new IllegalArgumentException("Distcp failed");
}
}
public void createDirectory() throws IOException {
FileSystem fileSystem = FileSystem.get(this.conf);
fileSystem.mkdirs(new Path("/user/newfolder"));
logger.info("Done");
}
}
我在
pom.xml
中添加了以下依赖项:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-distcp</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop3-2.2.4</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util</artifactId>
<version>2.2.4</version>
</dependency>
如果我在集群本身上运行 distcp 命令,如下所示:
hadoop distcp /user gs://my_bucket_name/
distcp 操作生效,内容被复制到云桶中。
您是否已将 jar 添加到 hadoop 的类路径中?
将连接器 jar 添加到 Hadoop 的类路径 将连接器 jar 放在 HADOOP_COMMON_LIB_JARS_DIR 目录中应该足以让 Hadoop 加载该 jar。或者,为了确保 jar 已加载,您可以将 HADOOP_CLASSPATH=$HADOOP_CLASSPATH: 添加到 Hadoop 配置目录中的 hadoop-env.sh 中。
这需要在这行代码之前对 DisctCp conf(在您的代码中
this.conf
)完成:
this.conf.set("HADOOP_CLASSPATH","$HADOOP_CLASSPATH:/tmp/gcs-connector-latest-hadoop2.jar")
DistCp distCp = new DistCp(this.conf, options);
如果有帮助的话,有一个故障排除部分。
我遇到了同样的问题,通过将此配置添加到 Spark 会话来修复
sc.hadoopConfiguration.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
您可以在此链接
阅读更多内容