将json对象文件保存为json数组而不是s3上的json对象

问题描述 投票:0回答:1

我想在s3上将DF保存为json格式。它保存为json对象文件,但我想要json数组文件。

I have csv file on s3, which i am loading into dataframe in aws glue. after performing some transformation i am writing DF to S3 format as json. But it is creating json objects file like: 

{obj1} {obj2}但是我想把它保存为json数组文件,如:[{obj1},{obj2}]

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type =“s3”,connection_options = {“paths”:[s3_path],“useS3ListImplementation”:True,“recurse”:True},format =“csv”,format_options = {“withHeader” :是的, “分隔符”: “|”})

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("cdw_zip_id", "string", "cdw_zip_id", "string"), ("zip_code", "string", "zip_code", "string"), ("cdw_terr_id", "string", "cdw_terr_id", "string")], transformation_ctx = "applymapping1")

applymapping2 = applymapping1.toDF()applymapping2.coalesce(1).write.format(“org.apache.spark.sql.json”)。mode(“overwrite”)。save(args ['DEST_PATH'])

实际值为:{obj1} {obj2}预期为:[{obj1},{obj2}]

json amazon-web-services amazon-s3 apache-spark-sql aws-glue
1个回答
0
投票

当调用df.write操作时,Spark会进行延迟评估,即所有转换都应用于单个读取操作中从所有分区读取的所有记录,同时在所有已配置的节点(存在于其中的分区)中执行工作量。

由于所有任务都独立写入输出,我们可以预期只有单个记录写入目标,而不是整个json文件。

如果执行合并操作,您将只能合并分区数据而不是spark写操作的行为。

© www.soinside.com 2019 - 2024. All rights reserved.