如何从频率数据中找到分位数？

Question

假设我有一个数据表，客户已经购买了这样的东西：

Customer|Price|Quantity Sold  
a       | 200 |   3.3  
b       | 120 |   4.1  
c       | 040 |   12.0  
d       | 030 |   16.76

这应该是数据表的粗略表示，其中包含为同一产品销售的客户，价格和数量。

我想弄清楚如何计算此信息的中位数购买价格。

我对方法学有点困惑，因为我得到了大熊猫的分位数很容易，因为data[row].quantile(x)

但由于每行真的代表不止一个观察，我不确定如何获得分位数。

编辑：最重要的是，主要问题是销售量不是离散的。这是一个连续变量。（我们喜欢说米，kgs等，所以创建更多行不是一种选择。）

Answer 1

对于一组离散值，通过排序和取中心值找到中值。然而，由于你有连续的Quantity值，看起来你真的在寻找概率分布的中位数，其中Price以Quantity给出的相对频率分布。通过订购数据并获取累积的Quantity，我们可以为您的问题提供图形表示：

您可以从该图中看到中值为40（X中点的y值）。这是应该预料到的，因为以两个最低价格出售的数量非常大。中位数可以从您的数据框计算如下：

df = df.sort_values('Price')
cumul = df['Quantity Sold'].cumsum()
# Get the row index where the cumulative quantity reaches half the total.
total = df['Quantity Sold'].sum()
index = sum(cumul < 0.5 * total)
# Get the price at that index
result = df['Price'].iloc[index]

可以使用不同的总比率来计算相同数据的任何其他分位数。

Answer 2

您可以循环销售数量并将每个项目添加到一个大的list_of_all_sold（还有其他方法可以执行此操作，这是一个示例）：

c = ['a', 'b', 'c']
p = [200, 120, 40]
qs = [3,4,12]

list_of_all_sold = []
for i in range(len(qs)):
    for x in range(qs[i]):
        a.append(p[i])

然后，Python 3.4+有一个统计包，可用于查找中位数：

from statistics import median
median(list_of_all_sold)

编辑以查找连续数量的中位数：

您可以制作一个pandas数据帧，然后按价格对数据框进行排序，然后找到中位数并减去排序数据框中每个价格点的销售数量，逐行，直到找到中间点。像这样的东西：

c = ['a', 'b', 'c', 'd']
p = [200, 120, 40, 30]
qs = [3.3, 4.1, 12.0, 16.76]
# Create a pandas dataframe
import pandas as pd
df = pd.DataFrame({'price' : p, 'qs' : qs}, index = c)
# Find the index of the median number
median_num_idx = sum(qs) / 2
# Go down dataframe sorted by price
for index, row in df.sort_values('price').iterrows():
    # Subtract the quantity sold at that price point from the median number index
    median_num_idx = median_num_idx - row['qs']
    # Check if you have reach the median index point
    if median_num_idx <= 0:
        print (row['price'])
        break

如何从频率数据中找到分位数？

问题描述投票：2回答：2

2个回答

最新问题

如何从频率数据中找到分位数？

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2