我有一个文本文件,大小约为1MB。我想从中检索一些数据,这需要多个正则表达式。
例如,要获取原子、基函数、GTF、总能量、维里比等,我必须使用多个正则表达式:
Atoms:\s+(\d+)
、Basis functions:\s+(\d+)
、GTFs:\s+(\d+)
、Total energy:\s+(-?\d+\.\d+)
、Virial ratio:\s+(-?\d+\.\d+)
我希望,第一个正则表达式匹配后,应该从第一个正则表达式匹配的地方开始匹配下一个正则表达式,以此类推。
我知道简单的方法是在每次匹配后将字符串截断为
text = text[match.end():]
,
但这对于大型文本文件来说似乎效率低下。
有更好的方法吗?
我发现有一个问题正则表达式在另一个正则表达式匹配文本之后匹配文本对于同一问题,但答案不满意。
Converting basis function information to GTF information...
Back converting basis function information from Cartesian to spherical type...
Generating density matrix based on SCF orbitals...
Generating overlap matrix...
Total/Alpha/Beta electrons: 112.0000 56.0000 56.0000
Net charge: 0.00000 Expected multiplicity: 1
Atoms: 25, Basis functions: 293, GTFs: 517
Total energy: -992.117436714606 Hartree, Virial ratio: 2.00276853
This is a restricted single-determinant wavefunction
Orbitals from 1 to 56 are occupied
Title line of this file: 000001
Loaded 000001.fchk successfully!
Formula: H11 C9 O4 P1 Total atoms: 25
Molecule weight: 214.15535 Da
Point group: C1
"q": Exit program gracefully "r": Load a new file
************ Main function menu ************
0 Show molecular structure and view orbitals
1 Output all properties at a point 2 Topology analysis
3 Output and plot specific property in a line
如果每个组中始终存在以下所有项目:原子、基函数、GTF、总能量和维里比,则可以创建单个复合正则表达式并使用
re.VERBOSE
。
import re
pattern = r"""
Atoms:\s+(?P<atoms>\d+),\s+
Basis\sfunctions:\s+(?P<basisfn>\d+),\s+
GTFs:\s+(?P<gtfs>\d+),?\s+
Total\senergy:\s+(?P<totalenergy>-?\d+\.\d+)\s+\w+,\s+
Virial\sratio:\s+(?P<virialratio>-?\d+\.\d+)
"""
s = """Converting basis function information to GTF information...
Back converting basis function information from Cartesian to spherical type...
Generating density matrix based on SCF orbitals...
Generating overlap matrix...
Total/Alpha/Beta electrons: 112.0000 56.0000 56.0000
Net charge: 0.00000 Expected multiplicity: 1
Atoms: 25, Basis functions: 293, GTFs: 517
Total energy: -992.117436714606 Hartree, Virial ratio: 2.00276853
This is a restricted single-determinant wavefunction
Orbitals from 1 to 56 are occupied
Title line of this file: 000001
Loaded 000001.fchk successfully!
"""
for match in re.finditer(pattern, s, re.VERBOSE):
print(match.groupdict())
# prints:
{'atoms': '25', 'basisfn': '293', 'gtfs': '517', 'totalenergy': '-992.117436714606', 'virialratio': '2.00276853'}