我有一个大型文本文档(〜20000行),其正文看起来像这样:
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
Charge Group
Portfolio Fee
Date
Our / Your Ref
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Debt Instruments
PORTFOLIO FEE
CS
USD
USD 219.12 USD 219.12
14,136,666.31
Invoice Account / Name:
021346676343/ abcdefgcopr
M0919-031 / Page 3 of 35
Charge Group
Portfolio Fee
Date
Our / Your Re
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Equity Instruments
USD 788,640.00 USD 12.22
USD 12.22
PORTFOLIO FEE-
EC_CS
Invoice Account / Name:
123498761233/ somethingelsecorporation
Charge Group
Portfolio Fee
Date
Our / Your Ref
这样的块重复了数千次。尝试输出:
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
Market Value
Invoice Account / Name:
021346676343/ abcdefgcopr
Market Value
Invoice Account / Name:
123498761233/ somethingelsecorporation
Market Value
由于我以前从未尝试过类似的事情,所以我有两个问题:1.如何识别和保留这样的句子:
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
哪个没有固定长度?
为此使用nltk是否明智?还是可以使用正则表达式和字符串处理来处理?
result = []
with open('num.txt', 'r') as file:
data = list(file.readlines())
for indx, row in enumerate(data):
if 'Invoice Account' in row:
accountnumber = data[indx+1].split('/')[0].strip() # Get account number from next line
companyname = data[indx+1].split('/')[1].strip() # Get company name from next line
# Store all results in a dictionary, you could print, store in other ways as well.
info = {'Account Number': accountnumber,
'Company Name': companyname,
'Market Value': '',
}
# Append the dictionary to a list called result
result.append(info)
然后您可以直接从每个词典访问数据,这些词典将仅包含各个公司的值。
for data in result: print(f"""Account Name: {data['Company Name']} Account Number: {data['Account Number']} Market Value: {data['Market Value']} """)
输出:
Account Name: XYZCORPORATIONS Account Number: 0234523454 Market Value: Account Name: abcdefgcopr Account Number: 021346676343 Market Value: Account Name: somethingelsecorporation Account Number: 123498761233 Market Value: