我的主要目标是确定在大型数组或 CSV 文件(初始数据存储在大约 1.5GB+ 的 csv 文件中)中查找具有允许偏差 (+-5%) 的连续数字模式的最有效方法).
问题是我是一名开始学习 Python 的前 PHP 程序员,所以我不太熟悉大多数可以帮助我们实现结果的模块和功能,但是我编写了一个小脚本来搜索模式通过数据数组,但问题是它非常慢,因为当我用真正的 CSV 文件喂它时,找到匹配项的过程花了我一段时间。而且我有很多这些文件,而且模式不时发生变化,所以我需要一些更快的方法......
当然,我并没有要求为我编写代码,我只是要求我指出一些我可以深入研究的模块和功能,它们可以帮助我处理大量数据。也许我应该尝试 SQLite 或者 Pandas 就足够了?也许他们有适合我的现有方法和功能?
这是我的代码和评论:
def check_pattern(numlist, pattern, deviation = 0.05):
# Loop through the numbers list to split it into the list's elements matching the length of the pattern
for i in range(len(numlist) - len(pattern) + 1):
partial_numbers = numlist[i:i+len(pattern)]
# Loop through obtained list to check if all of its elements are matching query numbers within deviation
# Loop process is written using the comprehended list syntaxis and is generating booleans True or False while iterating through zip() function
if all(n*(1-deviation) <= p <= n*(1+deviation) for n, p in zip(partial_numbers, pattern)):
return numlist[i]
return False
#number list: the array of data where I need to search for a pattern
numlist = [80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200]
#pattern: a pattern of numbers which I need to find in numlist
pattern = [115, 125, 135, 145]
match=check_pattern(numlist, pattern)
if match:
print("There is a match in 'numlist' starting from element #", match)
else:
print("No match")
提前谢谢你!