如何将awsglue文件输出写入特定名称

Question

我有一个 awsgluepython 作业，它连接两个 Aurora 表并将输出以 json 格式写入/接收到 s3 存储桶。工作进展顺利，符合预期。默认情况下，输出文件以此名称格式/模式“run-123456789-part-r-00000”写入 s3 存储桶 [在幕后其在 hadoop 集群中运行 pyspark 代码，因此文件名类似于 hadoop]

现在，我的问题是如何编写具有特定名称（如“Customer_Transaction.json”）而不是“run-***-part****”的文件

我尝试转换为DataFrame，然后写入json，如下所示，但没有成功

customerDF.repartition(1).write.mode("覆盖").json("s3://bucket/aws-glue/Customer_Transaction.json")

Answer 1

引擎盖下的胶水是一项火花工作。这就是 Spark 保存文件的方式。解决方法：保存 DataFrame 后，重新命名结果文件。

spark 作业范围内的类似 quetins：将 DataFrame 保存为 CSV 时指定文件名

Answer 2

我想我找到了解决方案。这是在我本地的 hadoop-spark 环境中运行的代码片段。需要在 AWS Glue 中进行测试

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
FileStatus = sc._gateway.jvm.org.apache.hadoop.fs.FileStatus

fs = FileSystem.get(sc._jsc.hadoopConfiguration())
srcpath = Path("/user/cloudera/IMG_5252.mov")
dstpath = Path("/user/cloudera/IMG_5252_123.mov")
if(fs.exists(srcpath) == False):
    print("Input path does not exists")
else:
    #print("Path exists")
    srcpath.rename(srcpath,dstpath)

如何将awsglue文件输出写入特定名称

问题描述投票：0回答：2

2个回答

最新问题

如何将awsglue文件输出写入特定名称

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2