我有一个日志文件 (FIX),我正在尝试将其转换为带有标题的 csv,例如:
8=FIX.4.2|9=435|35=8|34=8766|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8767|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8768|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
希望规范化为 csv 如下:
8,9,35,34,49,50,52,56
FIX.4.2,435,8,8766,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8,8767,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8,8768,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
读取文件日志:
with open(some.txt) as file:
data = file.read()
fix = pd.read_csv(data,sep='|')
print(fix)
我有一些标题和解析数据的正则表达式:
# regexforheader = re.compile("(?<=\|)(.*?)(?==)")
# regexRowData = re.compile="(?<=\=)(.*?)(?=\|)"
不知道如何把它们放在一起。
使用pandas,您可以使用
str.split
和pivot
:
df = pd.read_csv('in.csv', sep='|', header=None)
(df.stack().str.split('=', n=1, expand=True)
.droplevel(1).pivot(columns=0, values=1)
.sort_index(axis=1, key=lambda x: x.astype(int))
.to_csv('out.csv', index=False)
)
csv
模块(并假设第一行定义了所有列):
import re
import csv
with open('in.csv') as f_in, open('out.csv', 'w') as f_out:
first = True
for l in csv.reader(f_in, delimiter='|'):
d = dict(x.split('=') for x in l if x)
if first:
writer = csv.DictWriter(f_out, d.keys(), delimiter=',')
writer.writeheader()
first = False
writer.writerow(d)
输出:
8,9,34,35,49,50,52,56
FIX.4.2,435,8766,8,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8767,8,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8768,8,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
300k 行输入:
# pandas version
3.36 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# csv version
1.61 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas.Series.str.extract
和pandas.Series.str.split
from io import StringIO
import pandas as pd
file = """8=FIX.4.2|9=435|35=8|34=8766|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8767|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8768|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|"""
df = pd.read_csv(StringIO(file), sep="|", header=None)
# drop all nan column
df = df.dropna(axis="columns", how="all")
df.columns = df.apply(lambda ser: ser.str.extract("(\d+)", expand=False)).drop_duplicates().iloc[0]
df = df.apply(lambda ser: ser.str.split("=").str[-1])
df.to_csv("out.csv", index=False)
8,9,35,34,49,50,52,56
FIX.4.2,435,8,8766,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8,8767,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8,8768,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID