我有数据,只有离散的数据点,并且知道数据是如何分布的,即:
y = w * gamma.pdf(x, alpha1, scale=scale1) + (1-w) * gamma.pdf(x, alpha2, scale=scale2)
如何准确推断这五个参数? 我的代码如下:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.optimize import curve_fit
from scipy.stats import gaussian_kde,gamma
data1 = np.random.gamma(4, 1, 200)
data2 = np.random.gamma(6,2, 200)
data = np.concatenate((data1, data2))
x=np.linspace(0,np.max(data)+2,800)
y=gaussian_kde(data)(x)
initial_params = [0.5, 2, 1, 2, 1]
params, params_covariance = curve_fit(two_gamma, x, y, p0=initial_params, maxfev=50000)
w, alpha1, scale1, alpha2, scala2 = params
plt.figure(figsize=(10, 6))
sns.histplot(data, bins=20, kde=False, color='y', label='Data density', alpha=0.5, stat='probability')
plt.plot(x, y, marker='o', linestyle='', markersize=1, label='Data distribution')
y_fit=w*gamma.pdf(x, alpha1, scale=scale1)+(1-w)*gamma.pdf(x, alpha2, scale=scala2)
plt.plot(x, y_fit, 'r-', linewidth=1, alpha=0.7,label='Mixture gamma distribution')
plt.legend(fontsize=8, loc='upper right')
plt.title("Expression distribution of gamma mixture")
plt.xlabel("Expression")
我使用 curve_fit 函数来拟合不会产生双峰的形式,我希望能够精确地逼近函数的参数。使用神经网络和其他方法是可以接受的。
让我们说明您的型号如下:
import numpy as np
import pandas as pd
from scipy import stats, optimize
import matplotlib.pyplot as plt
def model(p, data):
return p[0] * stats.gamma.pdf(data, p[1], scale=p[2]) + (1 - p[0]) * stats.gamma.pdf(data, p[3], scale=p[4])
这也是一个 PDF,因为它是 PDF 的加权和,其中权重和是单一的。
对于给定的一组参数:
p = [0.5, 4, 1, 6, 2]
假设我们从这个分布中随机采样(简单的实现,大量的计算):
class BiGamma(stats.rv_continuous):
def _pdf(self, x, w, a1, s1, a2, s2):
return model([w, a1, s1, a2, s2], x)
bigamma = BiGamma(shapes='w, a1, s1, a2, s2')
law = bigamma(*p)
data = law.rvs(3000)
pd.DataFrame({"x": data}).to_csv("data.csv", index=False)
或者为了方便起见,使用此示例:
data = pd.read_csv("https://pastebin.com/raw/1JpzfT2E")["x"].values
如果采样数据是IID,我们可以执行MLE来回归参数:
def likelihood(p, data):
return - np.sum(np.log10(model(p, data)))
p0 = [1, 5, 1, 5, 1]
sol = optimize.minimize(likelihood, x0=p0, args=(data,), bounds=[(0, 1)] + [(0, np.inf)] * 4)
# message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
# success: True
# status: 0
# fun: 3783.3002631839786
# x: [ 5.316e-01 5.375e+00 2.137e+00 4.123e+00 9.111e-01]
# nit: 31
# jac: [-3.060e-02 4.502e-03 1.546e-02 -1.396e-02 -4.925e-02]
# nfev: 192
# njev: 32
# hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
注意,我们需要仔细选择初始值并提供边界,以使算法收敛。
如果只有分箱数据(直方图),那么我们可以直接对分箱数据而不是采样数据执行NLLS:
freq, bins = np.histogram(data, density=True, bins=35)
centers = (bins[:-1] + bins[1:]) * 0.5
def model_proxy(x, w, a1, s1, a2, s2):
return model([w, a1, s1, a2, s2], x)
popt, pcov = optimize.curve_fit(model_proxy, centers, freq, bounds=[[0] * 5, [1] + [np.inf] * 4])
# (array([0.51763272, 5.44778038, 2.14335038, 3.82036594, 1.00628071]),
# array([[ 0.00066593, -0.01337379, 0.00424389, 0.00256379, -0.00140308],
# [-0.01337379, 0.29725577, -0.09720244, -0.04583584, 0.0263838 ],
# [ 0.00424389, -0.09720244, 0.03236382, 0.0140486 , -0.00816902],
# [ 0.00256379, -0.04583584, 0.0140486 , 0.01511191, -0.00692502],
# [-0.00140308, 0.0263838 , -0.00816902, -0.00692502, 0.00344378]]))
对于 MLE,NLLS 需要有根据的猜测和界限来使算法收敛。
最后我们根据数据给出解决方案:
xlin = np.linspace(data.min(), data.max(), 200)
fig, axe = plt.subplots()
axe.hist(data, density=True, bins=35, alpha=0.75, label="Raw Data")
axe.plot(xlin, model(p, xlin), label="Model")
axe.plot(xlin, model(sol.x, xlin), label="MLE Fit")
axe.scatter(centers, freq, marker='.', label="Binned Data")
axe.plot(xlin, model_proxy(xlin, *popt), label="NLLS Fit")
axe.legend()
axe.grid()