使用AWS Glue覆盖MySQL表

Question

我有一个lambda进程，偶尔会轮询API以获取最新数据。这个数据有唯一的密钥，我想用Glue来更新MySQL中的表。是否可以使用此密钥覆盖数据？（类似于Spark的模式=覆盖）。如果没有 - 我可以在插入所有新数据之前截断Glue中的表吗？

谢谢

Answer 1

我提出的解决方法，比发布的替代方案稍微简单，如下：

在mysql中创建一个临时表，并将新数据加载到此表中。
运行命令：REPLACE INTO myTable SELECT * FROM myStagingTable;
截断登台表

这可以通过以下方式完成：

import sys from awsglue.transforms
import * from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

import pymysql
pymysql.install_as_MySQLdb()
import MySQLdb
db = MySQLdb.connect("URL", "USERNAME", "PASSWORD", "DATABASE")
cursor = db.cursor()
cursor.execute("REPLACE INTO myTable SELECT * FROM myStagingTable")
cursor.fetchall()

db.close()
job.commit()

Answer 2

我遇到了与Redshift相同的问题，我们可以提出的最佳解决方案是创建一个加载MySQL驱动程序并发出截断表的Java类：

package com.my.glue.utils.mysql;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;

@SuppressWarnings("unused")
public class MySQLTruncateClient {
    public void truncate(String tableName, String url) throws SQLException, ClassNotFoundException {
        Class.forName("com.mysql.jdbc.Driver");
        try (Connection mysqlConnection = DriverManager.getConnection(url);
            Statement statement = mysqlConnection.createStatement()) {
            statement.execute(String.format("TRUNCATE TABLE %s", tableName));
        }
    }
}

将JAR上传到S3以及MySQL Jar依赖项，并使您的工作依赖于这些。在PySpark脚本中，您可以使用以下命令加载truncate方法：

java_import(glue_context._jvm, "com.my.glue.utils.mysql.MySQLTruncateClient")
truncate_client = glue_context._jvm.MySQLTruncateClient()
truncate_client.truncate('my_table', 'jdbc:mysql://...')

Answer 3

我发现在Glue中使用JDBC连接的一种更简单的方法。当您将数据写入Redshift集群时，Glue团队建议截断表的方式是通过以下示例代码：

datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = resolvechoice4, catalog_connection = "<connection-name>", connection_options = {"dbtable": "<target-table>", "database": "testdb", "preactions":"TRUNCATE TABLE <table-name>"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")

哪里

connection-name your Glue connection name to your Redshift Cluster
target-table    the table you're loading the data in 
testdb          name of the database 
table-name      name of the table to truncate (ideally the table you're loading into)

使用AWS Glue覆盖MySQL表

问题描述投票：6回答：3

3个回答

最新问题

使用AWS Glue覆盖MySQL表

问题描述 投票：6回答：3

3个回答

最新问题

问题描述投票：6回答：3