我有一组 x 和 y 数据。绘制时,除了数据偏离直线(例如驼峰)的特定部分外,数据都是线性的。我的目标是开发一个程序来识别驼峰中的点(即偏离直线的点)。附图准确地显示了我想要实现的目标。
我尝试过对输入数据拟合线性趋势线,计算残差并排除残差较大的部分;然而,我的方法并不是很成功。趋势线似乎没有通过正确的点。这是我的代码:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Data provided
x = np.array([134, 147, 161, 175, 190, 206, 222, 237, 251, 263, 275, 291, 300, 312, 324, 337, 349, 360, 372, 382]).reshape(-1, 1)
y = np.array([0.788875116, 0.692846919, 0.605305046, 0.738780558, 0.826074803, 0.871572936, 0.776701184,
0.646403726, 0.677606953, 0.615950052, 0.357934847, 0.267171728, 0.217483944, 0.155336037,
0.071882007, 0.029383778, -0.008773924, -0.050609993, -0.102372909, -0.148741651])
# Fit linear regression
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x)
# Calculate residuals
residuals = y - y_pred
std_dev = np.std(residuals)
# Identify hump points by adjusting alpha
alpha = 2 / 3
threshold = std_dev * alpha
hump_points = np.where(np.abs(residuals) > threshold)[0]
# Visualize
plt.figure(figsize=(12, 6))
plt.plot(x, y, 'o-', label='Data', markersize=6)
plt.plot(x, y_pred, 'r--', label='Fitted Line', linewidth=2)
plt.axhline(0, color='gray', linestyle='--', label='Zero Residual')
plt.scatter(x[hump_points], y[hump_points], color='orange', label='Hump Point', s=100)
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Detecting Hump in Data')
plt.grid(True)
plt.show()
# Print hump points
hump_info = [(x[point][0], y[point]) for point in hump_points]
试试这个。请注意,如果不知道为什么您首先会期望一条直线,那么这可能是值得怀疑的做法。
为了记录,我已经:(1)删除了x数据上的“重塑”(主要是因为我不确定它应该做什么); (2) 将线放在第一个点和最后一个点之间; (3) 将 alpha 调低一点。
如果您通过此方法识别出不需要的数据,您可以将其删除,然后重新拟合最佳直线。
import numpy as np
import matplotlib.pyplot as plt
# Data provided
x = np.array([134, 147, 161, 175, 190, 206, 222, 237, 251, 263, 275, 291, 300, 312, 324, 337, 349, 360, 372, 382])
y = np.array([0.788875116, 0.692846919, 0.605305046, 0.738780558, 0.826074803, 0.871572936, 0.776701184,
0.646403726, 0.677606953, 0.615950052, 0.357934847, 0.267171728, 0.217483944, 0.155336037,
0.071882007, 0.029383778, -0.008773924, -0.050609993, -0.102372909, -0.148741651])
# Baseline
y_pred = y[0] + ( y[-1] - y[0] ) / ( x[-1] - x[0] ) * ( x - x[0] )
# Calculate residuals
residuals = y - y_pred
std_dev = np.std(residuals)
# Identify hump points by adjusting alpha
alpha = 1 / 3
threshold = std_dev * alpha
hump_points = np.where(np.abs(residuals) > threshold)[0]
# Visualize
plt.figure(figsize=(12, 6))
plt.plot(x, y, 'o-', label='Data', markersize=6)
plt.plot(x, y_pred, 'r--', label='Fitted Line', linewidth=2)
plt.scatter(x[hump_points], y[hump_points], color='orange', label='Hump Point', s=100)
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Detecting Hump in Data')
plt.grid(True)
plt.show()
# Print hump points
hump_info = [(x[point], y[point]) for point in hump_points]