Pandas 和 SQL 计算美元金额的方式不同

问题描述 投票:0回答:1

我正在尝试重新创建我使用 Python 和 Pandas 编写的 SQL 查询。我对 SQL 不太熟悉,所以我可能遗漏了一些明显的东西。我只是想对“问题”是“美元金额”的行进行求和,然后按子域进行分组。但我不断得到截然不同的结果。我正在使用 MySql。

SQL 查询:(它们都会产生以下相同的结果)

SELECT subdomain,
       SUM(CAST(REPLACE(answer, '$', '') AS DECIMAL(10, 2))) AS amounts
FROM (
    SELECT subdomain, answer, entry_date
    FROM assessments AS a
    WHERE question = 'Dollar Amount'
      AND answer IS NOT NULL
      AND answer != ''
      AND DATE(entry_date) BETWEEN '2024-07-01' AND '2024-09-30'
      AND entry_date = (
          SELECT MAX(entry_date)
          FROM assessments
          WHERE form_submission_id = a.form_submission_id
      )
) AS latest_entries
GROUP BY subdomain
ORDER BY amounts DESC;
SELECT subdomain, SUM(CAST(REPLACE(answer, '$', '') AS DECIMAL(10, 2))) AS amounts FROM assessments
WHERE form_name LIKE '%-support-services-fund%'
    AND question = 'Dollar Amount'
    AND answer IS NOT NULL
    AND answer != ''
    AND DATE(entry_date) BETWEEN '2024-07-01' AND '2024-09-30'
GROUP BY subdomain
ORDER BY amounts DESC;

输出前5行:

subdomain 1,82263.89
subdomain 2,74560.26
subdomain 3,28501.65
subdomain 4,8764.40
subdomain 5,8493.30

Python 脚本

from sqlalchemy import create_engine
import pandas as pd

def create_db_engine():
    # Database connection details
    user = 
    password = 
    host = 
    port = "3306"  
    database = 

    # Local SSL file paths
    ssl_args = {
        "ssl": {
            "ca": ,
            "cert": ,
            "key": ,
        }
    }

    # Create engine with SSL connection
    engine = create_engine(
        f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}",
        connect_args=ssl_args
    )

    return engine

engine = create_db_engine()

table_df = pd.read_sql_table(
    "assessments",
    con=engine,
    columns=[ # a list of all column names as strings (I'm redacting this)
            ],)

ssf = table_df[table_df['form_name'].str.contains('support-services-fund')]

date_1 = '2024-07-01'
date_2 = '2024-09-30'
date_restricted_ssf = ssf[ssf['entry_date'].between(date_1, date_2)]


# Filter strictly for rows where 'question' matches 'Dollar Amount' exactly
ssf_filtered = date_restricted_ssf[date_restricted_ssf['question'] == 'Dollar Amount']

# Convert answer values to numeric, ensuring only relevant rows are processed
ssf_filtered['amounts'] = pd.to_numeric(ssf_filtered['answer'].str.replace('$', '', regex=True), errors='coerce')

# Group by 'subdomain' and sum the amounts
grouped_df = ssf_filtered.groupby('subdomain', as_index=False)['amounts'].sum()

# Sort the results by 'amounts' in descending order
result_df = grouped_df.sort_values(by='amounts', ascending=False)

# Display the results
print(result_df)

Python 输出的前 5 行:

               subdomain   amounts
1            subdomain 1  66734.00
6            subdomain 2  61436.02
8            subdomain 3  28501.65
3            subdomain 5   8053.67
5            subdomain 6   5739.55
python sql mysql pandas
1个回答
0
投票

pd.to_numeric
行更新为:

ssf_filtered['amounts'] = pd.to_numeric(ssf_filtered['answer'].str.replace('$', ''), errors='coerce')
© www.soinside.com 2019 - 2024. All rights reserved.