我的问题涉及统计学和Python,我是两者的初学者。我正在运行模拟,对于自变量 (X) 的每个值,我都会为因变量 (Y) 生成 1000 个值。我所做的是计算每个 X 值的 Y 平均值,并使用 scipy.optimize.curve_fit 拟合这些平均值。曲线拟合得很好,但我还想绘制置信区间。我不确定我正在做的事情是否正确,或者我想做的事情是否可以完成,但我的问题是如何从 curve_fit 生成的协方差矩阵中获取置信区间。该代码首先从文件中读取平均值,然后仅使用 curve_fit。
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def readTDvsTx(L, B, P, fileformat):
# L should be '_Fixed_' or '_'
TD = []
infile = open(fileformat.format(L, B, P), 'r')
infile.readline() # To remove header
for line in infile:
l = line.split() # each line contains TxR followed by CD followed by TD
if eval(l[0]) >= 70 and eval(l[0]) <=190:
td = eval(l[2])
TD.append(td)
infile.close()
tdArray = np.array(TD)
return tdArray
def rec(x, a, b):
return a * (1 / (x**2)) + b
fileformat = 'Densities_file{}BS{}_PRNTS{}.txt'
txR = np.array(range(70, 200, 20))
parents = np.array(range(1,6))
disc_p1 = readTDvsTx('_Fixed_', 5, 1, fileformat)
popt, pcov = curve_fit(rec, txR, disc_p1)
plt.plot(txR, rec(txR, popt[0], popt[1]), 'r-')
plt.plot(txR, disc_p1, '.')
print(popt)
plt.show()
这是一个快速但错误的答案:您可以将
a
和 b
参数的协方差矩阵的误差近似为其对角线的平方根:np.sqrt(np.diagonal(pcov))
。然后可以使用参数不确定性来绘制置信区间。
答案是错误的,因为在将数据拟合到模型之前,您需要估计平均
disc_p1
点的误差。求平均值时,您丢失了有关总体分散程度的信息,导致 curve_fit
相信您提供给它的 y 点是绝对且无可争议的。这可能会导致低估您的参数错误。
为了估计平均 Y 值的不确定性,您需要估计其离散度并将其传递给
curve_fit
,同时表明您的误差是绝对的。下面是如何对随机数据集执行此操作的示例,其中每个点都包含从正态分布中抽取的 1000 个样本。
from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np
# model function
func = lambda x, a, b: a * (1 / (x**2)) + b
# approximating OP points
n_ypoints = 7
x_data = np.linspace(70, 190, n_ypoints)
# approximating the original scatter in Y-data
n_nested_points = 1000
point_errors = 50
y_data = [func(x, 4e6, -100) + np.random.normal(x, point_errors,
n_nested_points) for x in x_data]
# averages and dispersion of data
y_means = np.array(y_data).mean(axis = 1)
y_spread = np.array(y_data).std(axis = 1)
best_fit_ab, covar = curve_fit(func, x_data, y_means,
sigma = y_spread,
absolute_sigma = True)
sigma_ab = np.sqrt(np.diagonal(covar))
from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)
# plotting the unaveraged data
flier_kwargs = dict(marker = 'o', markerfacecolor = 'silver',
markersize = 3, alpha=0.7)
line_kwargs = dict(color = 'k', linewidth = 1)
bp = plt.boxplot(y_data, positions = x_data,
capprops = line_kwargs,
boxprops = line_kwargs,
whiskerprops = line_kwargs,
medianprops = line_kwargs,
flierprops = flier_kwargs,
widths = 5,
manage_ticks = False)
# plotting the averaged data with calculated dispersion
#plt.scatter(x_data, y_means, facecolor = 'silver', alpha = 1)
#plt.errorbar(x_data, y_means, y_spread, fmt = 'none', ecolor = 'black')
# plotting the model
hires_x = np.linspace(50, 190, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
color = 'black', alpha = 0.15)
plt.text(140, 800, text_res)
plt.xlim(40, 200)
plt.ylim(0, 1000)
plt.show()
编辑: 如果您不考虑数据点的内在错误,那么您可能可以使用我之前提到的“快速且错误”的情况。然后可以使用协方差矩阵对角线条目的平方根来计算置信区间。然而,请注意,由于我们已经放弃了不确定性,置信区间已经缩小了:
from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np
func = lambda x, a, b: a * (1 / (x**2)) + b
n_ypoints = 7
x_data = np.linspace(70, 190, n_ypoints)
y_data = np.array([786.31, 487.27, 341.78, 265.49,
224.76, 208.04, 200.22])
best_fit_ab, covar = curve_fit(func, x_data, y_data)
sigma_ab = np.sqrt(np.diagonal(covar))
# an easy way to properly format parameter errors
from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)
plt.scatter(x_data, y_data, facecolor = 'silver',
edgecolor = 'k', s = 10, alpha = 1)
# plotting the model
hires_x = np.linspace(50, 200, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
color = 'black', alpha = 0.15)
plt.text(140, 630, text_res)
plt.xlim(60, 200)
plt.ylim(0, 800)
plt.show()
如果您不确定是否包含绝对误差或如何在您的案例中估计它们,您最好在Cross Validated寻求建议,因为Stack Overflow主要讨论回归方法的实现,而不是讨论有关基础统计数据的讨论。
这里是我编写的一些 Jupyter Notebooks 和 Python 脚本的链接,展示了如何使用最佳参数的输出以及 scipy.optimize.curve_fit 或 lmfit 的协方差矩阵,通过 delta 方法计算置信区间和预测区间: