为什么基本字符串格式化优于内置的 re.compile() 函数？

Question

因此，我使用 dis 库（Python 字节码的反汇编器）来理解是什么导致了如此有趣的性能差异，这种差异在不同的规模下观察到，即使是使用 1000 个字符串，与下面的一个字符串示例不同。然而我还是一无所知。

我使用标准时间库和timeit进行计算。

import timeit

def find_overlapped_instances(regexx, string):
    found_any = False
    for start in range(len(string)):
        if(found_any):#addition
            break
        for end in range(start, len(string)):
            cur = string[start:end+1]
            if re.match(regex, cur):
                found_any = True
                break
    return found_any

#regex = re.compile(r'F.{4}D.AX')
regex = r'^{}$'.format(r'F.{4}D.AX') 


fasta_file = "voodoo.fasta"

total_time = 0
num_hits = 0
for record in SeqIO.parse(fasta_file, "fasta"):
    sequence = str(record.seq)
    starttime = timeit.default_timer()
    found_1 = find_overlapped_instances(regex, sequence)
    total_time += timeit.default_timer()-starttime
    if(found_1):
        num_hits += 1
print(num_hits)
print(total_time)

格式化似乎以良好的方式影响搜索性能（这是主要且几乎唯一的计算来源），这在某种程度上导致了约 25% 的性能差异：

在n个字符串中搜索模式	re.compile()	r'^{}$'.format()
n = 200	64.3秒	47.1秒
n = 400	136.7秒	101.7秒

有人可以帮助我理解为什么会发生这种情况吗？

Answer 1

除了测试应该使用相同的正则表达式（即带有或不带有

和

锚点）这一事实之外，主要问题是当你拥有 compiled 正则表达式对象时，你不应该通过它作为

re.match

的第一个参数，但在该对象上调用

match

。只有这样你才能从该汇编中受益。

这是一个更简单的测试，使用三种场景显示了这种差异：

在已编译的正则表达式对象上调用
```
match
```
（最快）
使用模式字符串作为参数调用
```
re.match
```
（中）
使用已编译的正则表达式对象作为参数调用
```
re.match
```
（最差）

这是一个用于测试的脚本：

import re
import timeit

pattern = r'F.{4}D.AX'
compiled = re.compile(pattern)
s = "qsdf F9999D9AX los F999D99AX"

def test1():
    compiled.match(s)

def test2():
    re.match(pattern, s)

def test3():
    re.match(compiled, s)

for test in (test1, test2, test3):   
    print(timeit.timeit(test, number=1000000))

这是我得到的输出：

0.7618284979998862
2.8180372250008077
4.692441613999108

为什么基本字符串格式化优于内置的 re.compile() 函数？

问题描述投票：0回答：1

1个回答

最新问题

为什么基本字符串格式化优于内置的 re.compile() 函数？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1