我有一个 .csv 文件,格式如下:
Cash
Serial,Date,Balance
1,2021-03-05,34
2,2021-05-04,54
Credit
Serial,Date,Balance
18,2021-03-05,898
21,2021-04-01,654
Savings
Serial,Date,Balance
3,2021-03-18,19384
34,2021-12-04,472
我想将其加载到具有以下结构的 pandas DataFrame 中
Serial,Asset,Date,Balance
1,Cash,2021-03-05,34
2,Cash,2021-05-04,54
18,Credit,2021-03-05,898
21,Credit,2021-04-01,654
3,Savings,2021-03-18,19384
34,Savings,2021-12-04,472
我已经可以使用以下代码将文件加载到 DataFrame 中:
import numpy as np
FILE = r"/myfile.csv"
with open(FILE, 'r') as temp_f:
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
column_names = [i for i in range(0, max(col_count))]
df = pd.read_csv(FILE, header=None, delimiter=",", names=column_names)
df['Asset'] = np.nan
print(df)
但我现在不知道如何删除带有“序列号、日期、余额”的行,并用相应的条目(“现金”、“信用”等)填充资产列中的行。 感谢任何建议。
CSV 应该有一个标题,但它会按原样读取:
import pandas as pd
import csv
df = pd.DataFrame(columns='Serial Asset Date Balance'.split())
with open('myfile.csv', 'r', newline='') as temp_f:
reader = csv.reader(temp_f)
for line in reader:
if len(line) == 1: # Only one thing in the line?
asset = line[0] # remember it as the asset type
next(reader) # and skip the header line below it
else: # add to the end of the dataframe
df.loc[len(df.index)] = line[0], asset, line[1], line[2]
print(df)
df.to_csv('output.csv', index=False)
输出:
Serial Asset Date Balance
0 1 Cash 2021-03-05 34
1 2 Cash 2021-05-04 54
2 18 Credit 2021-03-05 898
3 21 Credit 2021-04-01 654
4 3 Savings 2021-03-18 19384
5 34 Savings 2021-12-04 472
输出.csv:
Serial,Asset,Date,Balance
1,Cash,2021-03-05,34
2,Cash,2021-05-04,54
18,Credit,2021-03-05,898
21,Credit,2021-04-01,654
3,Savings,2021-03-18,19384
34,Savings,2021-12-04,472
我有一个具有以下格式的 .csv 文件
这显然不是 CSV 文件。 就是三个这样的文件:
以这种方式将它们存储在文件系统中。
读入三个独立的数据框。 然后以通常的方式对它们进行报告 产生一个单一的组合数据框。 提示:用每个小数据框补充 “现金”的常量文本列, 或“信用”或“储蓄”,将减轻您的任务。
您可以使用:
import io
# Separate sections
data = {}
with open('data.csv') as fp:
for row in fp:
if ',' not in row:
k = row.strip()
data[k] = []
else:
data[k].append(row.strip())
# Build individual dataframes
dfs = []
for asset, values in data.items():
df = pd.read_csv(io.StringIO('\n'.join(values)))
df.insert(1, 'Asset', asset)
dfs.append(df)
# Merge them
df = pd.concat(dfs, ignore_index=True)
输出:
>>> df
Serial Asset Date Balance
0 1 Cash 2021-03-05 34
1 2 Cash 2021-05-04 54
2 18 Credit 2021-03-05 898
3 21 Credit 2021-04-01 654
4 3 Savings 2021-03-18 19384
5 34 Savings 2021-12-04 472
re.finditer
的正则表达式来迭代块,使用io.StringIO
+pandas.read_csv
加载每个块和concat
将它们组合成一个DataFrame:
import re, io
import pandas as pd
with open('myfile.csv') as f:
out = pd.concat(
{m.group(1): pd.read_csv(io.StringIO(m.group(2)))
for m in re.finditer('(\w+)\n(.*?)\n(?=\w+\n|$)',
f.read(), flags=re.DOTALL)
}, names=['Asset']).reset_index('Asset')
输出:
Asset Serial Date Balance
0 Cash 1 2021-03-05 34
1 Cash 2 2021-05-04 54
0 Credit 18 2021-03-05 898
1 Credit 21 2021-04-01 654
0 Savings 3 2021-03-18 19384
1 Savings 34 2021-12-04 472