我有一个 MySQL 表,
my_table
,大小约为 38 GB,具有以下结构:
DESC my_table;
+------------------+----------+-----+-----+
| Field | Type | Null| Key |
+------------------+----------+-----+-----+
| date_available | date | NO | PRI |
| transaction_date | datetime | NO | PRI |
| customer_id | char(36) | NO | PRI |
| transaction_count| bigint | YES | |
+------------------+----------+-----+-----+
对于每个 transaction_date,我想获取过去 7 天的不同客户的数量。例如,对于
2020-03-07
,我想计算 2020-03-01
和 2020-03-07
(含)之间的所有唯一客户。
我最初的尝试是使用带有 group_concat 的子查询:
SELECT group_concat(distinct customer_id) AS cust_id, transaction_date
FROM my_table
WHERE transaction_date BETWEEN "2019-12-26" AND "2023-09-05"
GROUP BY transaction_date;
但是,这个查询大约需要 14 分钟才能执行,只能为我提供每天的唯一客户。 为了在 Python 中进一步处理这些数据,我将结果加载到 DataFrame 中:
with ENGINE.connect() as conn:
conn.execute("SET SESSION group_concat_max_len = 999999999999999999")
%time res = pd.read_sql(text(query), conn)
为了展开分组的客户ID并获取滚动窗口的计数,我尝试使用爆炸方法:
res['cust_id'] = res.explode('cust_id')[['cust_id', 'transaction_date']]
但是,这行代码导致我的 Python 内核死机,可能是因为每个日期都有大量唯一客户导致内存耗尽。
我也很难在查询本身中进行此计算,而不是尝试在 Python 中进行此计算。我尝试使用以下查询扩展上述查询,以考虑每个交易日期的最后 7 天:
SELECT group_concat(cust_id) AS concatenated_cust_ids,transaction_date
FROM
(
SELECT group_concat(DISTINCT customer_id) AS cust_id, transaction_date
FROM my_table
WHERE transaction_date BETWEEN '2019-12-26' AND '2023-09-05'
GROUP BY transaction_date
) AS intermediate
WHERE transaction_date BETWEEN DATE_SUB(transaction_date, INTERVAL 6 DAY) AND transaction_date
GROUP BY transaction_date;
对于每个
transaction_date
,想法是将group_concat应用于内部group_concat 7天获得的结果(前6天+transaction_date
本身)
但是,我意识到此查询中的逻辑存在缺陷,因为在同一 WHERE 子句中使用 DATE_SUB(transaction_date, INTERVAL 6 DAY) 和 transaction_date 会导致自引用条件。本质上,对于子查询中的每个日期,外部查询的条件始终评估为 true。
我还尝试了基于连接的解决方案,如下所示:
SELECT t1.transaction_date, COUNT(DISTINCT t2.customer_id) as unique_customers
FROM
my_table t1 JOIN my_table t2
ON t2.transaction_date BETWEEN DATE_SUB(t1.transaction_date, INTERVAL 6 DAY)
AND t1.transaction_date
WHERE t1.transaction_date BETWEEN "2019-12-26" AND "2023-09-05"
GROUP BY t1.transaction_date
ORDER BY t1.transaction_date;
但是,鉴于我的表大小约为 38GB,我担心使用 JOIN 的性能影响。
有人可以指导我吗:
你可以试试下面这个,比较简单:
SELECT t1.transaction_date, COUNT(DISTINCT t2.customer_id) AS distinct_customers
FROM my_table AS t1 LEFT JOIN my_table AS t2 ON
t2.transaction_date BETWEEN DATE_SUB(t1.transaction_date, INTERVAL 6 DAY) AND t1.transaction_date
GROUP BY t1.transaction_date
ORDER BY t1.transaction_date;