例如有2个光谱:
频谱 15 有一个近对称的峰值,并且具有近高斯形状,因此它是一个信号
频谱 371 根本没有峰值,应被视为噪声
我知道我可以对每个光谱进行高斯拟合并检查误差是否很小,但我有〜1000个光谱要检查,并且这种方法非常慢
是否有任何汇总统计数据可供我计算并判断频谱是否具有峰值信号(它将以任何速度出现)或噪声
例如信号是
k>0
,噪声是 k<0
您已经有了检查数据是否适合给定曲线的正确想法,但它可以非常有效地完成。使用匹配过滤器的运行时间为 <100 microseconds on a graph with 512 points is attainable. The solution I present here has an approx 98% accuracy in identifying the signal and fails when the signal is very narrow (in this case when its less than about 5 samples) or very wide (in this case when it occupies most of the window). If this is common, the matched filter can be adjusted to compensate.
方法:
numpy.convolve
应用匹配的过滤器scipy.signal.find_peaks
确定峰值响应在哪里这种方法有一个额外的好处,它还可以高精度地告诉您峰值发生的位置(通常是 <1% error).
这是一个示例实现:
import numpy as np
import matplotlib.pyplot as plt
import scipy
def get_normal_density_fun(x_data: np.ndarray, center: float = 0.0, std: float = 1.0, amp: float = 1.0):
# https://stats.stackexchange.com/a/143633
return amp / (0.4 / std) * (1 / (std * np.sqrt(2 * np.pi))) * np.exp(-((x_data - center) ** 2) / (2 * std ** 2))
def make_data(n: int, peak_location: float | None, rng: np.random._generator.Generator = None) -> tuple[np.ndarray, np.ndarray]:
# This function is an attempt to recreate the data shown in your graph upload, not required during typical use
NOISE_ABS = 0.1
MAX_CENTER = 1.0
X_MIN = 100
X_MAX = 190
INIT_SLOPE_END_X = 115
INIT_SLOPE_END_Y = -0.05
if rng is None:
rng = np.random.default_rng(42)
slope_end_index = int(n / (X_MAX - X_MIN) * (INIT_SLOPE_END_X - X_MIN))
slope_start_val = rng.random() * 0.8 + 0.2
out_data_x = np.linspace(X_MIN, X_MAX, n)
out_data = rng.random(size=(n,)) * NOISE_ABS
out_data[:slope_end_index] += np.linspace(start=slope_start_val, stop=INIT_SLOPE_END_Y, num=slope_end_index)
if peak_location is not None:
std = (peak_location - INIT_SLOPE_END_X) * rng.random()
amp = MAX_CENTER * (0.25 + 0.75 * rng.random())
out_data += get_normal_density_fun(out_data_x, peak_location, std, amp)
return out_data_x, out_data
def find_response(x: np.ndarray, y: np.ndarray, plot_all: bool = False):
# This is the filter we are trying to match the signal filter to
# Adjust std and amp to fit
filter = get_normal_density_fun(np.linspace(-15, 15, 50), std=8, amp=5.0)
y_trimmed = y.copy()
# Due to some strange values at the beginning, 0 out the beginning
y_trimmed[:len(y) // 8] = 0.0
# https://dsp.stackexchange.com/a/94212
matched_filter_output = np.convolve(y_trimmed, np.flip(filter), mode='same')
# Use scipy to find all the peaks in the filter response
peaks, properties = scipy.signal.find_peaks(matched_filter_output)
# This block is for testing purposes, to understand how the system is working
if plot_all:
def scale(vec: np.ndarray, min_val: float = 0.0, max_val: float = 1.0, ref_vec: np.ndarray = None):
if ref_vec is None:
ref_vec = vec
return (vec - np.min(ref_vec)) / (np.max(ref_vec) - np.min(ref_vec)) * (max_val - min_val) + min_val
plt.plot(x, scale(matched_filter_output), label='filter response')
plt.plot(x, scale(y), label='orig data')
plt.plot(x, scale(y_trimmed, ref_vec=y), label='trimmed data')
plt.legend()
plt.show()
# Capture the prominence of each peak (how much does the peak stick up)
prominences, _, _ = scipy.signal.peak_prominences(matched_filter_output, peaks)
# Find the maximum prominence (which hopefully matches our signal)
max_prom_index = np.argmax(prominences)
# Find the x-value that the peak occurred at
main_peak = peaks[max_prom_index]
return x[main_peak], prominences[max_prom_index]
def _main():
import sklearn.metrics
import time
rng = np.random.default_rng(seed=42)
n = 512
tests = 2 ** 10
peak_truth = []
peak_pred = []
peak_error = []
peak_error_percent = []
tot_time = 0.0
show_mismatches = False
for has in [True, False]:
first_run = True
for i in range(tests // 2):
if has:
peak = rng.random() * 50 + 120
else:
peak = None
x, y = make_data(n, peak, rng)
if first_run:
first_run = False
find_response(x, y, True)
tot_time -= time.perf_counter()
xprom, prom = find_response(x, y)
has_guess = prom > 30
tot_time += time.perf_counter()
# print(f'{peak if has else 0.0:8.1f} - {xprom:8.1f} : {prom:8.5f}')
peak_truth.append(has)
peak_pred.append(has_guess)
if has:
peak_error.append(xprom - peak)
peak_error_percent.append((xprom - peak) / peak * 100)
if show_mismatches and not has_guess == has:
plt.plot(x, y)
plt.show()
print(f'Run time / prediction: {tot_time / tests * 1000:.2f} ms')
total_correct = np.sum(np.array(peak_truth) == np.array(peak_pred))
total_wrong = np.sum(np.array(peak_truth) != np.array(peak_pred))
print(f'Correct / Wrong (%) : {total_correct} / {total_wrong} ({total_correct / len(peak_truth) * 100:.1f}%)')
print(f'Mean error (%) : {np.mean(np.abs(peak_error)):.2f} ({np.mean(np.abs(peak_error_percent)):.2f}%)')
print(f'Max / min error : {np.max(np.abs(peak_error)):.2f} / {np.min(np.abs(peak_error)):.2f}')
# https://stackoverflow.com/a/51477080
plt.hist(peak_error, bins=100, weights=np.ones(len(peak_error)) / len(peak_error) * 100)
plt.xlabel('Error'); plt.ylabel('% of occurrances'); plt.title('Absolute Error')
plt.show()
cm = sklearn.metrics.confusion_matrix(peak_truth, peak_pred).astype(float)
for row in range(cm.shape[0]):
cm[row] /= np.round(np.sum(cm[row]),3)
disp = sklearn.metrics.ConfusionMatrixDisplay(confusion_matrix=cm * 100)
disp.plot()
plt.show()
if __name__ == '__main__':
_main()
这会产生以下控制台:
Run time / prediction: 0.05 ms
Correct / Wrong (%) : 1004 / 20 (98.0%)
Mean error (%) : 0.77 (0.50%)
Max / min error : 49.64 / 0.00
在下面的图中,第一个和第二个应该分别与问题中的第一个和第二个类似:
在本例中,使用我使用的值,它每次都正确预测没有信号,并且 96% 的时间预测存在信号。
在使用时,您可以调整
filter = get_normal_density_fun( ...
线上的值以匹配您的数据,这些值与我生成的玩具数据匹配。您可能还需要将行 has_guess = prom > 30
调整为与您的值匹配的数字。该测试块中注释掉的打印行显示了突出部分,因此您可以在实际数据上运行它以查看典型值并以此方式进行调整。
此功能的典型用例可能类似于:
for fname in filelist:
filedatax, filedatay = loadfile(fname)
xprom, prom = find_response(filedatax, filedatay)
if prom > 30:
process_signal(filedatax, filedatay, xprom)
如果您有任何疑问,请告诉我!
参考资料: