我有一个自编的Glue脚本和一个存储在Glue目录中的JDBC Connection。我无法弄清楚如何使用PySpark从我的JDBC连接指向的RDS中存储的MySQL数据库中执行select语句。我还使用了Glue Crawler来推断我有兴趣查询的RDS表的模式。如何使用WHERE子句查询RDS数据库?
我查看了DynamicFrameReader和GlueContext类的文档,但似乎都没有指出我正在寻找的方向。
这取决于你想做什么。例如,如果你想做一个select * from table where <conditions>
,有两个选择:
假设您创建了一个爬虫并在您的AWS Glue作业上插入了源代码,如下所示:
# Read data from database
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "students", redshift_tmp_dir = args["TempDir"])
# Select the needed fields
selectfields1 = SelectFields.apply(frame = datasource0, paths = ["user_id", "full_name", "is_active", "org_id", "org_name", "institution_id", "department_id"], transformation_ctx = "selectfields1")
filter2 = Filter.apply(frame = selectfields1, f = lambda x: x["org_id"] in org_ids, transformation_ctx="filter2")
# Change DynamicFrame to Spark DataFrame
dataframe = DynamicFrame.toDF(datasource0)
# Create a view
dataframe.createOrReplaceTempView("students")
# Use SparkSQL to select the fields
dataframe_sql_df_dim = spark.sql("SELECT user_id, full_name, is_active, org_id, org_name, institution_id, department_id FROM assignments WHERE org_id in (" + org_ids + ")")
# Change back to DynamicFrame
selectfields = DynamicFrame.fromDF(dataframe_sql_df_dim, glueContext, "selectfields2")