我在列标题中有一个带有独立变量的数据框,每行都是一组独立的因变量:
5.032530 6.972868 8.888268 10.732009 12.879130 16.877655
0 2.512298 2.132748 1.890665 1.583538 1.582968 1.440091
1 5.628667 4.206962 4.179009 3.162677 3.132448 1.887631
2 3.177090 2.274014 2.412432 2.066641 1.845065 1.574748
3 5.060260 3.793109 3.129861 2.617136 2.703114 1.921615
4 4.153010 3.354411 2.706463 2.570981 2.020634 1.646298
我想在每一行中加入Y = A * x ^ B类型的曲线。我需要为A&B解决大约5000行,每行6个数据点。我能够使用np.apply执行此操作,但执行此操作大约需要40秒。我可以加速使用Cython或以某种方式进行矢量化吗?我需要精度到大约4位小数
这是我有的:
import pandas as pd
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\File.csv')
def curvefita(y):
return curve_fit(lambda x,a,b: a*np.power(x,b), df.iloc[:,3:].columns, y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][0]
def curvefitb(y):
return curve_fit(lambda x,a,b: a*np.power(x,b), df.iloc[:,3:].columns, y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][1]
avalues = df.iloc[:,3:].apply(curvefita, axis=1)
bvalues = df.iloc[:,3:].apply(curvefitb, axis=1)
df['a']=avalues
df['b']=bvalues
colcount = len(df.columns)
#build power fit - make the matrix
powerfit = df.copy()
for column in range(colcount-2):
powerfit.iloc[:,column] = powerfit.iloc[:,colcount-2] * (powerfit.columns[column]**powerfit.iloc[:,colcount-1])
#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')
你实际上每行做两个curve_fits,一个用于a
,一个用于b
。尝试找到一种同时插入它们的方法,这样可以将执行时间减半:
def func(x, a, b):
return a * np.power(x, b)
def curvefit(y):
return tuple(curve_fit(func, df.iloc[:,3:].columns, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])
df[["a", "b"]] = df.iloc[:,3:].apply(curvefit, axis=1).apply(pd.Series)
print(df)
# 5.03253 6.972868 8.888268 10.732009 12.87913 16.877655 a \
# 0 2.512298 2.132748 1.890665 1.583538 1.582968 1.440091 2.677070
# 1 5.628667 4.206962 4.179009 3.162677 3.132448 1.887631 39.878792
# 2 3.177090 2.274014 2.412432 2.066641 1.845065 1.574748 8.589886
# 3 5.060260 3.793109 3.129861 2.617136 2.703114 1.921615 13.078827
# 4 4.153010 3.354411 2.706463 2.570981 2.020634 1.646298 27.715207
# b
# 0 -0.215338
# 1 -1.044384
# 2 -0.600827
# 3 -0.656381
# 4 -1.008753
为了使这更可重用,我会让curvefit
也采取x值和函数,可以传入functools.partial
:
from functools import partial
def curvefit(func, x, y):
return tuple(curve_fit(func, x, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])
fit = partial(curvefit, func, df.iloc[:,3:].columns)
df[["a", "b"]] = df.iloc[:,3:].apply(fit, axis=1).apply(pd.Series)
按照@Brenlla的建议,我能够将运行时间缩短到550毫秒。此代码使用类似于Excel的未加权/偏向公式,这对我的目的来说足够好(@kennytm讨论它here)
df = pd.read_csv(r'C:\File.csv')
df2=np.log(df)
df3=df2.iloc[:,3:].copy()
df3.columns=np.log(df3.columns)
def curvefit(y):
return tuple(np.polyfit(df3.columns, y ,1))
df[["b", "a"]] = df3.apply(curvefit,axis=1).apply(pd.Series)
df['a']=np.exp(df['a'])
colcount = len(df.columns)
powerfit = df.copy()
for column in range(colcount-2):
powerfit.iloc[:,column] = powerfit.iloc[:,colcount-1] * (powerfit.columns[column]**powerfit.iloc[:,colcount-2])
#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')