使用Pandas，使用python从excel加载数据到redshift，能够加载数据。但是当 Excel 有 20000 多行时，需要 7 个小时以上。优化方法

Question

使用 Pandas 将数据从 Excel 文件加载到现有 Redshift 表时，我遇到性能缓慢的问题。 Excel 文件有 10 多列和 20000 多行，该操作需要 7 个多小时才能完成。有没有办法优化代码并提高性能？请注意，我无权访问 S3，也无法将其用作选项。

# Establish a connection to Redshift
conn = psycopg2.connect(
    host='your-redshift-host',
    port='your-redshift-port',
    user='your-username',
    password='your-password',
    dbname='your-database'
)

# Copy the contents of excel_file to a dataframe
df = pd.read_excel(excel_file)

# Insert dataframe records into the table
cur = conn.cursor()
cur.execute('TRUNCATE TABLE table_name')
insert_query = "INSERT INTO table_name (tablecolumn1, tablecolumn2, tablecolumn3) VALUES (%s, %s, %s)"

for index, row in df.iterrows():
    cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3']))

conn.commit()
cur.close()

Answer 1

您正在使用 df.iterrows() 迭代数据帧，不建议使用此函数并使用循环迭代任何数据帧（source）。

我建议使用 apply 函数，请在这里找到更多关于 apply 的信息。

在您的情况下，以下修改可以提供更好的性能-

# Establish a connection to Redshift
conn = psycopg2.connect(
    host='your-redshift-host',
    port='your-redshift-port',
    user='your-username',
    password='your-password',
    dbname='your-database'
)

# Copy the contents of excel_file to a dataframe
df = pd.read_excel(excel_file)


# Insert dataframe records into the table
cur = conn.cursor()
cur.execute('TRUNCATE TABLE table_name')
insert_query = "INSERT INTO table_name (tablecolumn1, tablecolumn2, tablecolumn3) VALUES (%s, %s, %s)"

df.apply(lambda row: cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3'])), axis=1)

# for index, row in df.iterrows():
    # cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3']))

conn.commit()
cur.close()

使用Pandas，使用python从excel加载数据到redshift，能够加载数据。但是当 Excel 有 20000 多行时，需要 7 个小时以上。优化方法

问题描述投票：0回答：1

1个回答

最新问题

使用Pandas，使用python从excel加载数据到redshift，能够加载数据。但是当 Excel 有 20000 多行时，需要 7 个小时以上。优化方法

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1