我正在尝试重新创建我使用 Python 和 Pandas 编写的 SQL 查询。我对 SQL 不太熟悉,所以我可能遗漏了一些明显的东西。我只是想对“问题”是“美元金额”的行进行求和,然后按子域进行分组。但我不断得到截然不同的结果。我正在使用 MySql。
SQL 查询:(它们都会产生以下相同的结果)
SELECT subdomain,
SUM(CAST(REPLACE(answer, '$', '') AS DECIMAL(10, 2))) AS amounts
FROM (
SELECT subdomain, answer, entry_date
FROM assessments AS a
WHERE question = 'Dollar Amount'
AND answer IS NOT NULL
AND answer != ''
AND DATE(entry_date) BETWEEN '2024-07-01' AND '2024-09-30'
AND entry_date = (
SELECT MAX(entry_date)
FROM assessments
WHERE form_submission_id = a.form_submission_id
)
) AS latest_entries
GROUP BY subdomain
ORDER BY amounts DESC;
SELECT subdomain, SUM(CAST(REPLACE(answer, '$', '') AS DECIMAL(10, 2))) AS amounts FROM assessments
WHERE form_name LIKE '%-support-services-fund%'
AND question = 'Dollar Amount'
AND answer IS NOT NULL
AND answer != ''
AND DATE(entry_date) BETWEEN '2024-07-01' AND '2024-09-30'
GROUP BY subdomain
ORDER BY amounts DESC;
输出前5行:
subdomain 1,82263.89
subdomain 2,74560.26
subdomain 3,28501.65
subdomain 4,8764.40
subdomain 5,8493.30
Python 脚本
from sqlalchemy import create_engine
import pandas as pd
def create_db_engine():
# Database connection details
user =
password =
host =
port = "3306"
database =
# Local SSL file paths
ssl_args = {
"ssl": {
"ca": ,
"cert": ,
"key": ,
}
}
# Create engine with SSL connection
engine = create_engine(
f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}",
connect_args=ssl_args
)
return engine
engine = create_db_engine()
table_df = pd.read_sql_table(
"assessments",
con=engine,
columns=[ # a list of all column names as strings (I'm redacting this)
],)
ssf = table_df[table_df['form_name'].str.contains('support-services-fund')]
date_1 = '2024-07-01'
date_2 = '2024-09-30'
date_restricted_ssf = ssf[ssf['entry_date'].between(date_1, date_2)]
# Filter strictly for rows where 'question' matches 'Dollar Amount' exactly
ssf_filtered = date_restricted_ssf[date_restricted_ssf['question'] == 'Dollar Amount']
# Convert answer values to numeric, ensuring only relevant rows are processed
ssf_filtered['amounts'] = pd.to_numeric(ssf_filtered['answer'].str.replace('$', '', regex=True), errors='coerce')
# Group by 'subdomain' and sum the amounts
grouped_df = ssf_filtered.groupby('subdomain', as_index=False)['amounts'].sum()
# Sort the results by 'amounts' in descending order
result_df = grouped_df.sort_values(by='amounts', ascending=False)
# Display the results
print(result_df)
Python 输出的前 5 行:
subdomain amounts
1 subdomain 1 66734.00
6 subdomain 2 61436.02
8 subdomain 3 28501.65
3 subdomain 5 8053.67
5 subdomain 6 5739.55
将
pd.to_numeric
行更新为:
ssf_filtered['amounts'] = pd.to_numeric(ssf_filtered['answer'].str.replace('$', ''), errors='coerce')