我有一本这样的计数词典:
{1:2, 2:1, 3:1}
我需要由此计算 q1、中位数和 q3。对于奇数数组来说,这是非常简单的,但对于偶数情况,我似乎无法弄清楚。 我想在不使用任何库(如 numpy)的情况下完成此操作。
示例:
counts = {
"4": 1,
"1": 2,
"5": 1
}
results = {
"q1": 1,
"median": 2.5,
"q3": 4,
}
到目前为止,我已经有了一些类似的东西,但这并不能处理所有情况。
def get_ratings_stats(counts):
""""This function will return min, q1, median, q3 and max value from list of ratings."""
cumulative_sum = 0
cumulative_dict = {}
for key, value in sorted(counts.items()):
cumulative_sum += value
cumulative_dict[key] = cumulative_sum
q1_index = math.floor(cumulative_sum * 0.25)
q3_index = math.ceil(cumulative_sum * 0.75)
median_index = cumulative_sum * 0.5
q1, q3, median = None, None, None
print('indexes: ', q1_index, median_index, q3_index)
for key, sum in cumulative_dict.items():
if not q1 and sum >= q1_index:
q1 = key
if not q3 and sum >= q3_index:
q3 = key
if not median and sum >= median_index:
median = key
OP的代码已经差不多完成了,只是最后部分有问题。公开不同的实现并测量不同的执行时间。
import math
import statistics as st # used for stats_with_stats & workbench
def stats_with_stats(data:dict):
# flat the data
f_table = []
for v, freq in data.items():
f_table.extend([v]*freq)
return st.quantiles(f_table)
def stats_by_cards(data:dict):
n = sum(data.values()) # total frequency
q1_i = math.floor(n * 0.25)
q2_i = n * 0.5
q3_i = math.ceil(n * 0.75)
qs = iter((q1_i, q2_i, q3_i))
out_stats = []
q = next(qs)
cum_f = 0
for v, freq in sorted(data.items()):
cum_f_new = cum_f + freq
if cum_f <= q < cum_f_new:
out_stats.append(v)
q = next(qs, None)
if q is None:
break
cum_f = cum_f_new
return out_stats
def stats_by_learner(data:dict):
tmp_data = {}
f_cum = 0
for v, f in sorted(data.items()):
f_cum_new = f_cum + f
tmp_data[v] = (f_cum, f_cum_new) # <- pairs
f_cum = f_cum_new
q1_i = math.floor(f_cum * 0.25)
q2_i = f_cum * 0.5
q3_i = math.ceil(f_cum * 0.75)
qs = iter((q1_i, q2_i, q3_i))
out_stats = []
q = next(qs)
for v, (lower_freq, upper_freq) in tmp_data.items():
if lower_freq <= q < upper_freq:
out_stats.append(v)
q = next(qs, None)
if q is None:
break
return out_stats
使用以下数据集计时
from collections import Counter
import random
# test with sample dataset
random.seed(123456) # for sake of "reproducibility"
dataset = Counter([random.randint(1, 100) for _ in range(100)])
输出
check outputs:
stats_by_learner [23, 49, 75]
stats_by_cards [23, 49, 75]
stats_with_stats [23.0, 49.0, 75.0]
quartiles with "stats_by_learner"
times [41.32080510599917, 36.5191725270015, 36.58397209500254, 36.66133224499936, 36.83490775700193]
mean 37.5840379460009
std 2.0922524181898896
quartiles with "stats_by_cards"
times [27.217588879000687, 27.218666459000815, 29.070444919001602, 27.207161409998662, 31.960372033001477]
mean 28.53484674000065
std 2.076736386875994
quartiles with "stats_with_stats"
times [81.97632466700088, 84.27796363499874, 90.61311744499835, 85.13804757300022, 82.74273506200188]
mean 84.94963767640002
std 3.401200405652809
关于四分位数定义的评论:四分位数的实现方式(如OP)可能不一致:
check outputs (with 50 terms & seed=123456)
stats_by_learner [13, 42, 66]
stats_by_cards [13, 42, 66]
stats_with_stats [12.75, 40.5, 65.25]