我有一个数据集,其中包含一个长 x 值数组和一个同样长的 y 值数组。对于每个 (x,y) 对,我想找到已知函数 y(x) 上最近的点。
原则上我可以循环每一对并执行最小化,例如 scipy.optimize.cobyla,但在 python 中循环很慢。 Scipy 的 odr 包看起来很有趣,但我不知道如何让它简单地返回正交向量而不最小化整个事情(将最大迭代“maxit”设置为零并不能给我我想要的)。
有没有一种简单的方法可以利用 numpy 数组的速度来完成此任务?
答案很简单:
我冒昧地将函数 y(x) 重命名为 f(z) 以避免混淆。
import numpy as np
# x and y are your numpy arrays of point coords
x = np.array([1,2])
y = np.array([3,4])
# this is your "y(x)" function
def f(z):
return z**2
xmin = x.min()
xmax = x.max()
step = 0.01 # choose your step at the precision you want
# find distances to every point
zpoints = np.arange(xmin,xmax,step)
distances_squared = np.array([(y-f(z))**2+(x-z)**2 for z in zpoints])
# find z coords of closest points
zmin = zpoints[distances_squared.argmin(axis=0)]
fmin = np.array([f(z) for z in zmin])
for i in range(len(x)):
print("point on the curve {},{} is closest to {},{}".format(zmin[i],fmin[i],x[i],y[i]))
曲线上的点 1.6700000000000006,2.788900000000002 最接近 1,3
曲线上的点 1.9900000000000009,3.9601000000000033 最接近 2,4
有一种方法可以加快 Hennadii Madan 的方法,即要求 numpy 而不是 python 进行循环。与往常一样,这是以增加 RAM 为代价的。
下面是我现在用于 2d 的函数。一个很好的功能是它是对称的——可以交换数据集,并且计算时间是相同的。
def find_nearests_2d(x1, y1, x2, y2):
"""
Given two data sets d1 = (x1, y1) and d2 = (x2, y2), return the x,y pairs
from d2 that are closest to each pair from x1, the difference vectors, and
the d2 indices of these closest points.
Parameters
----------
x1
1D array of x-values for data set 1.
y1
1D array of y-values for data set 1 (must match size of x1).
x2
1D array of x-values for data set 2.
y2
1D array of y-values for data set 2 (must match size of x2).
Returns x2mins, y2mins, xdiffs, ydiffs, indices
-------
x2mins
1D array of minimum-distance x-values from data set 2. One value for each x1.
y2mins
1D array of minimum-distance y-values from data set 2. One value for each y1.
xdiffs
1D array of differences in x. One value for each x1.
ydiffs
1D array of differences in y. One value for each y1.
indices
Indices of each minimum-distance point in data set 2. One for each point in
data set 1.
"""
# Generate every combination of points for subtracting
x1s, x2s = _n.meshgrid(x1, x2)
y1s, y2s = _n.meshgrid(y1, y2)
# Calculate all the differences
dx = x1s - x2s
dy = y1s - y2s
d2 = dx**2 + dy**2
# Find the index of the minimum for each data point
n = _n.argmin(d2, 0)
# Index for extracting from the meshgrids
m = range(len(n))
return x2s[n,m], y2s[n,m], dx[n,m], dy[n,m], d2[n,m], n
还可以使用它来快速估计 x,y 对与函数之间的距离:
def find_nearests_function(x, y, f, *args, fpoints=1000):
"""
Takes a data set (arrays of x and y values), and a function f(x, *args),
then estimates the points on the curve f(x) that are closest to each of
the data set's x,y pairs.
Parameters
----------
x
1D array of x-values for data set 1.
y
1D array of y-values for data set 1 (must match size of x).
f
A function of the form f(x, *args) with optional additional arguments.
*args
Optional additional arguments to send to f (after argument x).
fpoints=1000
Number of evenly-spaced points to search in the x-domain (automatically
the maximum possible range).
"""
# Make sure everything is a numpy array
x = _n.array(x)
y = _n.array(y)
# First figure out the range we need for f. Since the function is single-
# valued, we can put bounds on the x-range: for each point, calculate the
# y-distance, and subtract / add this to the x-values
dys = _n.abs(f(x)-y)
xmin = min(x-dys)
xmax = max(x+dys)
# Get "dense" function arrays
xf = _n.linspace(xmin, xmax, fpoints)
yf = f(xf,*args)
# Find all the minima
xfs, yfs, dxs, dys, d2s, n = find_nearests_2d(x, y, xf, yf)
# Return this info plus the function arrays used
return xfs, yfs, dxs, dys, d2s, n, xf, yf
如果这是正交距离回归的一部分(就像在我的例子中),则可以通过误差条数据集轻松缩放差异 dx 和 dy,而无需太多开销,这样返回的距离就是学生化(无单位)残差。
最终,这种“统一搜索各处”技术只会让您接近,并且如果函数在 x 数据范围内不是特别平滑,则会失败。
快速测试代码:
x = [1,2,5]
y = [1,-1,1]
def f(x): return _n.cos(x)
fxmin, fymin, dxmin, dymin, d2min, n, xf, yf = find_nearests_function(x, y, f)
import pylab
pylab.plot(x,y, marker='o', ls='', color='m', label='input points')
pylab.plot(xf,yf, color='b', label='function')
pylab.plot(fxmin,fymin, marker='o', ls='', color='r', label='nearest points')
pylab.legend()
pylab.show()
产生