解析大型文本文档,仅保留“帐号”和特定的关键字(“市场价值”)

问题描述 投票:0回答:1

我有一个大型文本文档(〜20000行),其正文看起来像这样:

Invoice Account / Name: 
0234523454 / XYZCORPORATIONS
Charge Group
Portfolio Fee
Date
Our / Your Ref
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Debt Instruments
PORTFOLIO FEE
CS
USD 
USD 219.12 USD 219.12
14,136,666.31
 Invoice Account / Name: 
021346676343/ abcdefgcopr
M0919-031  / Page 3 of 35
Charge Group
Portfolio Fee
Date
Our / Your Re
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Equity Instruments
USD 788,640.00 USD 12.22
USD 12.22
PORTFOLIO FEE-
EC_CS
 Invoice Account / Name: 
123498761233/ somethingelsecorporation
Charge Group
Portfolio Fee
Date
Our / Your Ref

这样的块重复了数千次。尝试输出:

Invoice Account / Name: 
    0234523454 / XYZCORPORATIONS
Market Value
Invoice Account / Name: 
    021346676343/ abcdefgcopr
Market Value
Invoice Account / Name: 
    123498761233/ somethingelsecorporation
Market Value

由于我以前从未尝试过类似的事情,所以我有两个问题:1.如何识别和保留这样的句子:

Invoice Account / Name: 
0234523454 / XYZCORPORATIONS

哪个没有固定长度?

  1. 除此以外,如何仅保留关键字“市场价值”。

为此使用nltk是否明智?还是可以使用正则表达式和字符串处理来处理?

python python-3.x nltk
1个回答
0
投票
您可以仅使用字符串处理来搜索和查找所需内容。

result = [] with open('num.txt', 'r') as file: data = list(file.readlines()) for indx, row in enumerate(data): if 'Invoice Account' in row: accountnumber = data[indx+1].split('/')[0].strip() # Get account number from next line companyname = data[indx+1].split('/')[1].strip() # Get company name from next line # Store all results in a dictionary, you could print, store in other ways as well. info = {'Account Number': accountnumber, 'Company Name': companyname, 'Market Value': '', } # Append the dictionary to a list called result result.append(info)

然后您可以直接从每个词典访问数据,这些词典将仅包含各个公司的值。

for data in result: print(f"""Account Name: {data['Company Name']} Account Number: {data['Account Number']} Market Value: {data['Market Value']} """)

输出:

Account Name: XYZCORPORATIONS Account Number: 0234523454 Market Value: Account Name: abcdefgcopr Account Number: 021346676343 Market Value: Account Name: somethingelsecorporation Account Number: 123498761233 Market Value:

© www.soinside.com 2019 - 2024. All rights reserved.