在Python中绘制pandas系列的CDF

Question

有办法做到这一点吗？我似乎没有一种简单的方法可以将 pandas 系列与绘制 CDF（累积分布函数）连接起来。

Answer 1

我相信您正在寻找的功能位于 Series 对象的 hist 方法中，该方法将 hist() 函数包装在 matplotlib 中

这是相关文档

In [10]: import matplotlib.pyplot as plt

In [11]: plt.hist?
...
Plot a histogram.

Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
    If `True`, then a histogram is computed where each bin gives the
    counts in that bin plus all bins for smaller values. The last bin
    gives the total number of datapoints.  If `normed` is also `True`
    then the histogram is normalized such that the last bin equals 1.
    If `cumulative` evaluates to less than 0 (e.g., -1), the direction
    of accumulation is reversed.  In this case, if `normed` is also
    `True`, then the histogram is normalized such that the first bin
    equals 1.

...

例如

In [12]: import pandas as pd

In [13]: import numpy as np

In [14]: ser = pd.Series(np.random.normal(size=1000))

In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>

In [16]: plt.show()

Answer 2

如果您还对数值感兴趣，而不仅仅是情节。

import pandas as pd

# If you are in jupyter
%matplotlib inline

这始终有效（离散和连续分布）

# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)

# Get the frequency, PDF and CDF for each value in the series

# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})

# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])

# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df

# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.

# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)

从连续分布中抽取样本的替代示例，或者您有很多单独的值：

# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')

# ... all the same calculation stuff to get the frequency, PDF, CDF

# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)

仅适用于连续分布

请注意 如果假设样本中每个值仅出现一次是非常合理的（通常在连续分布的情况下遇到），则

groupby()

+

agg('count')

是不必要的（因为计数始终为 1)。

在这种情况下，可以使用百分比排名直接获得 cdf。

走这种捷径时请运用您的最佳判断！ :)

# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)

# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)

# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

Answer 3

我来这里寻找这样的图，其中有条形和CDF线：

可以这样实现：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
    series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')

如果您想删除垂直线，那么这里解释了如何实现这一点。或者你可以这样做：

ax.set_xlim((ax.get_xlim()[0], series.max()))

我还看到了一个优雅的解决方案这里如何使用

seaborn

。

Answer 4

CDF 或累积分布函数图基本上是一个图表，其中 X 轴为排序值，Y 轴为累积分布。因此，我将创建一个新系列，其中排序值作为索引，累积分布作为值。

首先创建一个示例系列：

import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))

系列排序：

ser = ser.sort_values()

现在，在继续之前，再次附加最后一个（也是最大的）值。此步骤对于获得无偏 CDF 的小样本尤其重要：

ser[len(ser)] = ser.iloc[-1]

创建一个新系列，将排序后的值作为索引，将累积分布作为值：

cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)

最后，将函数绘制为步骤：

ser_cdf.plot(drawstyle='steps')

Answer 5

这是最简单的方法。

import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )

累积直方图图像

Answer 6

我在“纯”Pandas 中找到了另一种解决方案，它不需要指定直方图中要使用的 bin 数量：

import pandas as pd
import numpy as np # used only to create example data

series = pd.Series(np.random.normal(size=10000))

cdf = series.value_counts().sort_index().cumsum()
cdf.plot()

Answer 7

升级@wroscoe的答案

df[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)

您还可以提供一些所需的垃圾箱。

Answer 8

对我来说，这似乎是一个简单的方法：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

heights = pd.Series(np.random.normal(size=100))

# empirical CDF
def F(x,data):
    return float(len(data[data <= x]))/len(data)

vF = np.vectorize(F, excluded=['data'])

plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))

Answer 9

如果您想要绘制一个“真正的”经验 CDF，它恰好在数据集的值处跳跃，并且每个值的跳跃与该值的频率成正比，NumPy 有内置函数可以执行作品：

a

对

import matplotlib.pyplot as plt
import numpy as np

def ecdf(a):
    x, counts = np.unique(a, return_counts=True)
    y = np.cumsum(counts)
    x = np.insert(x, 0, x[0])
    y = np.insert(y/y[-1], 0, 0.)
    plt.plot(x, y, drawstyle='steps-post')
    plt.grid(True)
    plt.savefig('ecdf.png')

的调用按排序顺序返回数据值及其相应的频率。

unique()

调用中的选项

drawstyle='steps-post'

可确保跳转发生在应有的位置。为了强制跳转到最小数据值，代码在

plot()

和

前面插入一个附加元素。

使用示例：

y

另一种用法：

xvec = np.array([7,1,2,2,7,4,4,4,5.5,7]) ecdf(xvec)

输出：

enter image description here

Answer 10

Raphvanns 的回答

。它很有帮助，因为它不仅可以生成绘图，还可以帮助我理解 pdf、cdf 和 ccdf 是什么。我有两件事要添加到 Raphvanns 的解决方案中：（1）明智地使用

df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])

使过程更容易； (2) 在计算 pdf、cdf 和 ccdf 之前记得排序（升序）

collections.Counter

。

value

生成随机数：

import pandas as pd import numpy as np import matplotlib.pyplot as plt from collections import Counter

按照 Raphvanns 的建议构建数据框：

s = pd.Series(np.random.randint(1000, size=(1000)))

计算 PDF、CDF 和 CCDF：

dic = dict(Counter(s)) df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])

剧情：

df['pdf'] = df.frequency/sum(df.frequency) df['cdf'] = df['pdf'].cumsum() df['ccdf'] = 1-df['cdf']

你可能想知道为什么我们要在计算 PDF、CDF 和 CCDF 之前对

df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)

进行排序。好吧，我们来说说如果我们不排序的话结果会怎样（注意

value

会自动对项目进行排序，下面我们将随机排序）。

dict(Counter(s))

这是剧情：

为什么会发生这样的事？嗯，CDF 的本质是“到目前为止我们看到的数据点的数量”，引用

YY

他的数据可视化课程的讲座幻灯片。因此，如果 dic = dict(Counter(s)) df = pd.DataFrame(s.items(), columns = ['value', 'frequency']) # randomize the order of `value`: df = df.sample(n=1000) df['pdf'] = df.frequency/sum(df.frequency) df['cdf'] = df['pdf'].cumsum() df['ccdf'] = 1-df['cdf'] df.plot(x = 'value', y = ['cdf'], grid = True) 的顺序没有排序（升序或降序都可以），那么当您绘制时，

value

轴按升序排列，

值当然会变得一团糟。

如果您应用降序排列，您可以想象 CDF 和 CCDF 将交换它们的位置：

我会给这篇文章的读者留下一个问题：如果我像上面那样随机化

y

的顺序，那么在计算 PDF、CDF 和 CCDF 之后（而不是之前）对

value

进行排序是否可以解决问题？

value

Answer 11

dic = dict(Counter(s)) df = pd.DataFrame(s.items(), columns = ['value', 'frequency']) # randomize the order of `value`: df = df.sample(n=1000) df['pdf'] = df.frequency/sum(df.frequency) df['cdf'] = df['pdf'].cumsum() df['ccdf'] = 1-df['cdf'] # Will this solve the problem? df = df.sort_values(by='value') df.plot(x = 'value', y = ['cdf'], grid = True)

在Python中绘制pandas系列的CDF

问题描述投票：0回答：11

11个回答

这始终有效（离散和连续分布）

仅适用于连续分布

最新问题

在Python中绘制pandas系列的CDF

问题描述 投票：0回答：11

11个回答

这始终有效（离散和连续分布）

仅适用于连续分布

最新问题

问题描述投票：0回答：11