我有一个.txt报告,其中包含以.txt为报告格式的帐号,地址和信用额度
具有分页符,但通常看起来像这样
Customer Address Credit limit
A001 Wendy's 20000
123 Main Street
City, State
Zip
我希望我的数据框看起来像这样
Customer Address Credit Limit
A001 Wendy's 123 Main Street, City, Statement 20000
这里是我正在处理的示例csv的链接。
http://faculty.tlu.edu/mthompson/IDEA%20files/Customer.txt
我试图跳过行,但这没用。
好吧,使用这种格式没有什么困难,但它不是csv。因此,既不能使用Python csv模块,也不能使用pandas read_csv
。我们将不得不解析它[[手工。
fieldpos = [(5,19), (23,49), (57,77), (90, -1)] # position of fields in the initial line
inblock = False # we do not start inside a block
account_pat = re.compile(r'[A-Z]+\d+\s*$') # regex patterns are compiled once for performance
limit_pat = re.compile(r'\s*\d+$')
data = [] # a list for the accounts
with open(file) as fd:
for line in fd:
if not inblock:
if (len(line) > 100):
row = [line[f[0]:f[1]].strip() for f in fieldpos]
if account_pat.match(row[0]) and limit_pat.match(row[-1]):
inblock = True
data.append(row)
else:
line = line.strip()
if len(line) > 0:
row[2] += ', ' + line
else:
inblock = False
# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])
最终给出:
Account Number Name Address Credit Limit 0 A001 Dan Ackroyd Audenshaw, 125 New Street, Montreal, Quebec, H... 20000 1 A123 Mike Atsil The Vetinary House, 123 Dog Row, Thunder Bay, ... 20000 2 A128 Ivan Aker The Old House, Ottawa, Ontario, P1D 8D4 10000 3 B001 Kim Basinger Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9 12000 4 B002 Richard Burton Eagle Castle, Leafy Lane, Sudbury, Ontario, L3... 9000 5 B004 Jeff Bridges Arrow Road North, Lakeside, Kenora, Ontario, N... 20000 6 B008 Denise Bent The Dance Studio, Covent Garden, Montreal, Que... 20000 7 B010 Carter Bout Removals Close, No Fixed Abode Road, Toronto, ... 20000 8 B022 Ronnie Biggs Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3 5000 9 C001 Tom Cruise The Firm, Gunnersbury, Waskaganish, Quebec, G1... 25000 10 C003 John Candy The Sweet Shop, High Street, Trois Rivieres, Q... 15000