我有一个非常简单的 Glue 作业,将数据从 S3 加载到 Redshift,中间有一个 Transform 来重命名字段并更改其类型:
首次执行运行(几乎)没有问题 - 数据加载到 Redshift 中。 以下所有执行均失败。原因是,Glue 正确创建了 Redshift 表(首次加载),但在它已经存在时错误地处理它。
所有转换为十进制的字段都会发生这种情况(但没有测试所有其他类型)。
CSV 文件:
Text value,Average whatever,Another string,Just a number
A1,2.2,test,5
A2,5,test2,7
转换(更改架构):
生成的代码(我没有编辑代码,它仍然是一个“视觉”工作):
...
# Script generated for node Amazon S3
AmazonS3_node1710618800725 = glueContext.create_dynamic_frame.from_options(format_options={"quoteChar": "\"", "withHeader": True, "separator": ","}, connection_type="s3", format="csv", connection_options={"paths": ["s3://<source-s3-bucket>/test/gonna_fail/data.csv"]}, transformation_ctx="AmazonS3_node1710618800725")
# Script generated for node Change Schema
ChangeSchema_node1710691042153 = ApplyMapping.apply(frame=AmazonS3_node1710618800725, mappings=[("Text value", "string", "text_value", "string"), ("Average whatever", "string", "average_whatever", "decimal"), ("Another string", "string", "another_string", "string"), ("Just a number", "string", "just_a_number", "decimal")], transformation_ctx="ChangeSchema_node1710691042153")
# Script generated for node Amazon Redshift
AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153, connection_type="redshift", connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/", "useConnectionProperties": "true", "dbtable": "raw_data.gonna_fail", "connectionName": "serverless-redshift", "preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);"}, transformation_ctx="AmazonRedshift_node1710618808047")
7:26:26 PM CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN
表已正确创建。之后出现错误 - “预期命令状态开始,已创建表” - (如何避免它?)但作业在 30 秒后重试成功:
7:26:56 PM CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN
7:26:56 PM DROP TABLE IF EXISTS raw_data.gonna_fail
7:26:56 PM CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL)
7:26:56 PM COPY "raw_data"."gonna_fail" ("text_value","average_whatever","another_string","just_a_number") FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/63e30430-67f0-4ab2-b539-22180ae2920b/manifest.json' FORMAT AS CSV NULL AS '@NULL@' manifest CREDENTIALS ''
对于每个十进制字段,都会创建一个新字段,其名称由名称和数据类型连接而成:
7:29:19 PM ALTER TABLE raw_data.gonna_fail add "average_whatever_decimal(10,2)" DECIMAL(10,2) default NULL;
7:29:19 PM ALTER TABLE raw_data.gonna_fail add "just_a_number_decimal(10,2)" DECIMAL(10,2) default NULL;
此加载也失败(没有检查原因)并在 30 秒后重试:
创建表被执行(不知道为什么创建语句被执行两次,“自动”,并使用预操作):
7:29:54 PM CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2), "just_a_number_decimal(10,2)" DECIMAL(10,2), "average_whatever_decimal(10,2)" DECIMAL(10,2)) DISTSTYLE EVEN
准备工作:
7:29:54 PM DROP TABLE IF EXISTS raw_data.gonna_fail
7:29:54 PM CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL)
不正确的复制声明:
7:29:54 PM 复制“raw_data”。“gonna_fail”(“text_value”,“average_whatever”,“another_string”,“just_a_number”,“just_a_number_decimal(10,2)”,“average_whatever_decimal(10,2)”) FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/d46ca4ae-86cc-4444-addd-6c54c376a2a1/manifest.json' 格式为 CSV NULL AS '@NULL@' 清单凭证 ''
此操作失败,Spark重试3次,加载失败。 胶水中可见错误:
Caused by: com.amazon.redshift.util.RedshiftException: ERROR: column "just_a_number_decimal(10,2)" of relation "gonna_fail" does not exist
我在框架的 .schema().fields 中没有找到那些附加/不正确的字段。
我认为视觉工具能做的不多。克隆作业并更新脚本如下:
AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153,
connection_type="redshift",
connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/",
"dbtable": "raw_data.temp_gonna_fail",
"connectionName": "serverless-redshift",
"preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);",
"postactions": "BEGIN; INSERT INTO raw_data.gonna_fail SELECT * from raw_data.temp_gonna_fail; drop table if exists raw_data.temp_gonna_fail; END;"
},
transformation_ctx="AmazonRedshift_node1710618808047"),
在connection_options中添加了“postactions”并删除了“useConnectionProperties”:“true”(如果脚本因此失败,请再次添加)