使用 Pandas 将数据从 Excel 文件加载到现有 Redshift 表时,我遇到性能缓慢的问题。 Excel 文件有 10 多列和 20000 多行,该操作需要 7 个多小时才能完成。有没有办法优化代码并提高性能?请注意,我无权访问 S3,也无法将其用作选项。
# Establish a connection to Redshift
conn = psycopg2.connect(
host='your-redshift-host',
port='your-redshift-port',
user='your-username',
password='your-password',
dbname='your-database'
)
# Copy the contents of excel_file to a dataframe
df = pd.read_excel(excel_file)
# Insert dataframe records into the table
cur = conn.cursor()
cur.execute('TRUNCATE TABLE table_name')
insert_query = "INSERT INTO table_name (tablecolumn1, tablecolumn2, tablecolumn3) VALUES (%s, %s, %s)"
for index, row in df.iterrows():
cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3']))
conn.commit()
cur.close()
您正在使用 df.iterrows() 迭代数据帧,不建议使用此函数并使用循环迭代任何数据帧(source)。
我建议使用 apply 函数,请在这里找到更多关于 apply 的信息。
在您的情况下,以下修改可以提供更好的性能-
# Establish a connection to Redshift
conn = psycopg2.connect(
host='your-redshift-host',
port='your-redshift-port',
user='your-username',
password='your-password',
dbname='your-database'
)
# Copy the contents of excel_file to a dataframe
df = pd.read_excel(excel_file)
# Insert dataframe records into the table
cur = conn.cursor()
cur.execute('TRUNCATE TABLE table_name')
insert_query = "INSERT INTO table_name (tablecolumn1, tablecolumn2, tablecolumn3) VALUES (%s, %s, %s)"
df.apply(lambda row: cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3'])), axis=1)
# for index, row in df.iterrows():
# cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3']))
conn.commit()
cur.close()