我已经和它搏斗了一段时间,并且空手而归。
我在将文件转储到mysql之前将文件解析为pandas数据帧,并且我有一组带有变量的行,例如:
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE .35 11/01 01:00:00
我试图将每一行转换为4组,而不管组是否为空。到目前为止,我已经使用以下正则表达式取得了进展,但是如果在datestamp之前没有\ d。\ d {2}匹配则它不起作用:
(\d{6}).([^\d]*).([^\s][\.]\d+[\*]*).(\d{2}\/\d{2}\s\d{2}\:\d{2}\:\d{2})
这个想法是将一条线分组
969722 MS-DARE 1.35 11/01 01:00:00
像这样:
969722
MS-DARE
1.35
11/01 01:00:00
这适用于像969722 MS-DARE 1.35 11/01 01:00:00
这样的行,但是当group2在其中有空格时会中断,例如:
969722 MS-DARE PIN .35 11/01 01:00:00
我想组成像969722
MS-DARE PIN
1.35
11/01 01:00:00
总体而言,最终目标是拥有所有这些变体组,例如:
523421
F-INV PROC
.95
11/01 01:00:00
634312
MA-BAREAUTH
11/01 01:00:00
523421
MK-PERM YEAR
11/01 01:00:00
123512
G5-FSB
3.00
11/01 01:00:00
864982
JA-PAREN
4.25*
11/01 01:00:00
934821
4.00
11/01 01:00:00
620021
I-MAS DIN
5.25*
11/01 01:00:00
969722
MS-DARE
.35
11/01 01:00:00
我如何考虑所有这些变化,以便我总是有4组,如果有3.00或.35这样的数量,它是第3组还是空的否则?
更新:
https://regex101.com/r/lL8rIj/1/
越来越近了,但如果没有金额,我需要每场比赛一个空组3 ..
看来你可以用
^ # start of line
(?P<group1>\d+)\s # capture numbers, match whitespace
(?P<group2>(?:(?!\d*\.\d+|\d{2}/).)+)? # capture as long as the formats
# of group 3 and 4 are not met
# the group is optional
(?P<group3>\d*\.\d+\*?)?\s+ # format of group 3...
(?P<group4>\d+/\d+.+) # ... and 4 respectively
$ # end of line
Python
and pandas
this would be:
import re, pandas as pd
string = """
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE AUT .35 11/01 01:00:00
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
"""
rx = re.compile(r'''
^
(?P<group1>\d+)\s
(?P<group2>(?:(?!\d*\.\d+|\d{2}/).)+)?
(?P<group3>\d*\.\d+\*?)?\s+
(?P<group4>\d+/\d+.+)
$''', re.VERBOSE | re.MULTILINE)
records = ((m.group(1), m.group(2).rstrip() if m.group(2) else None,
m.group(3), m.group(4))
for m in rx.finditer(string))
df = pd.DataFrame(records)
print(df)
0 1 2 3
0 864982 JA-PAREN 4.25* 11/01 01:00:00
1 934821 None 4.00 11/01 01:00:00
2 620021 I-MAS DIN 5.25* 11/01 01:00:00
3 969722 MS-DARE AUT .35 11/01 01:00:00
4 523421 F-INV PROC None 11/01 01:00:00
5 634312 MA-BAREAUTH None 11/01 01:00:00
6 523421 MK-PERM YEAR None 11/01 01:00:00
7 123512 G5-FSB 3.00 11/01 01:00:00
你可以试试这个:
import re
s = ['523421 F-INV PROC 11/01 01:00:00', '634312 MA-BAREAUTH 11/01 01:00:00', '523421 MK-PERM YEAR 11/01 01:00:00', '123512 G5-FSB 3.00 11/01 01:00:00', '864982 JA-PAREN 4.25* 11/01 01:00:00', '934821 4.00 11/01 01:00:00', '620021 I-MAS DIN 5.25* 11/01 01:00:00', '969722 MS-DARE .35 11/01 01:00:00']
final_s = [re.split('\s(?=[\d\W])|(?<=[\d\W])\s', i) for i in s]
输出:
[['523421', 'F-INV PROC', '11/01', '01:00:00'],
['634312', 'MA-BAREAUTH', '11/01', '01:00:00'],
['523421', 'MK-PERM YEAR', '11/01', '01:00:00'],
['123512', 'G5-FSB', '3.00', '11/01', '01:00:00'],
['864982', 'JA-PAREN', '4.25*', '11/01', '01:00:00'],
['934821', '4.00', '11/01', '01:00:00'],
['620021', 'I-MAS DIN', '5.25*', '11/01', '01:00:00'],
['969722', 'MS-DARE', '.35', '11/01', '01:00:00']]
我想提议the next solution:
import re
data = """
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE AUT .35 11/01 01:00:00
969722 MS-DARE 99/99 AUT .35 11/01 01:00:00
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
523421 MK-PERM 3. YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
"""
rx = re.compile(r"""
^
(\d+)
(?:
\s([a-z].*[a-z])
\s(\d?\.\d+)\*?\s
|(?:
\s([a-z].*[a-z])(\s)
|(\s)(\d*\.\d+)\*?\s
)
)
(\d\d(?:[/\s:]\d\d){4})
$
""", re.I | re.M | re.X)
for m in rx.finditer(data):
print(tuple(e for e in m.groups() if e))
结果:
('864982', 'JA-PAREN', '4.25', '11/01 01:00:00')
('934821', ' ', '4.00', '11/01 01:00:00')
('620021', 'I-MAS DIN', '5.25', '11/01 01:00:00')
('969722', 'MS-DARE AUT', '.35', '11/01 01:00:00')
('969722', 'MS-DARE 99/99 AUT', '.35', '11/01 01:00:00')
('523421', 'F-INV PROC', ' ', '11/01 01:00:00')
('634312', 'MA-BAREAUTH', ' ', '11/01 01:00:00')
('523421', 'MK-PERM YEAR', ' ', '11/01 01:00:00')
('523421', 'MK-PERM 3. YEAR', ' ', '11/01 01:00:00')
('123512', 'G5-FSB', '3.00', '11/01 01:00:00')