我正在尝试将带有空格分隔值的错误文本文件转换为干净的 csv 文件。请指导我。
以下是我的数据。 数据尚未与 csv 中的输出列正确匹配。
HP TRA ID CL ID IN/EId No Loop Element Name Freq STATUS Error Severity Error ID Message Report Source
13ZI 20712800032 1 Denied Error HP_DOSOlderTh Date of service is older than 12 months HP
13ZI 20712800032 1 1 Rejected Error CA16 Rejected at level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716. HP
13ZI 20712800032 2 1 Rejected Error CA16 Rejected at level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716. HP
13ZI 20712800032 3 1 Rejected Error CA16 Rejected at level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716. HP
1P8TY0J25 20712805263 1 Denied Error HP_DOSOlderTh Date of service is older than 12 months
我尝试了下面的代码,但没有成功。
df = pd.read_csv('file1.txt', sep='\t', index_col=False, dtype='object')
df.to_csv(r'Report.csv', index = None)
也在线下方。数据尚未与 csv 中的列正确匹配
df = pd.read_csv("file1.txt", sep=r"\s{2,}", engine="python")
df.to_csv(r'Report.csv', index = None)
我期待这样的输出
是的,这是一个错误的txt 文件。我不希望任何内置函数能够处理它。您可能能够编写一些自定义代码以将其转换为正确的 CSV 文件。例如,这里有一些适用于示例输入(非常具体)的代码,您可以进行修改以适用于真实文件。但也有可能该文件基本上是不明确的,您必须使用您对数据含义的了解来手动调整任何转换结果。
import sys, re, pprint
table = []
for (i, line) in enumerate(open('input.txt').readlines()):
if line[-1] == '\n': line = line[:-1]
if i == 0:
# header
# In the header, we can assume that no 'cell' would be empty,
# so we can just split on runs of 2 or more spaces.
row = re.split(r' {2,}', line)
table.append(row)
continue
if i == 1:
assert line == ''
# blank line between header and data
continue
# For all other lines, we have to look at runs of 2+ spaces
# and decide what they 'mean'.
def replfunc(mo):
L = len(mo.group(0))
# Some 'Message' values say:
# "Rejected at level. DupKeyID..."
# i.e., there's a run of 2 spaces *within* a cell value.
# Deal with this particular case.
if L == 2:
(start, end) = mo.span()
if (
line[:start].endswith('Rejected at')
and
line[end:].startswith('level.')
):
# Replace it with a single space.
return ' '
# Otherwise, this run of spaces is equivalent to
# one or more field-separators.
# We'll replace it with tabs and then split on tabs.
if L < 2:
assert 0
elif 2 <= L <= 12:
return '\t'
elif L == 17:
return '\t\t'
elif L == 43:
return '\t\t\t'
elif L == 61:
return '\t\t\t\t\t'
elif L == 120:
return '\t'
elif L == 263:
return '\t'
else:
return f'<{L}>'
tabbed_line = re.sub(r'\s{2,}', replfunc, line)
row = tabbed_line.split('\t')
table.append(row)
# ---------------------------
# The rest is just to display the resulting table nicely.
max_n_fields = max(
len(row)
for row in table
)
field_widths = []
for j in range(max_n_fields):
field_width = max(
len(row[j])
for row in table
if j < len(row)
)
field_widths.append(field_width)
for (i, row) in enumerate(table):
for (field, field_width) in zip(row, field_widths):
print(field.ljust(field_width), end='|')
print()
这是输出:
HP TRA ID|CL ID |IN/EId|No|Loop|Element Name|Freq|STATUS |Error Severity|Error ID |Message |Report Source|
13ZI |20712800032| | | | |1 |Denied |Error |HP_DOSOlderTh|Date of service is older than 12 months |HP ||
13ZI |20712800032| |1 | | |1 |Rejected|Error |CA16 |Rejected at level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP ||
13ZI |20712800032| |2 | | |1 |Rejected|Error |CA16 |Rejected at level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP ||
13ZI |20712800032| |3 | | |1 |Rejected|Error |CA16 |Rejected at level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP ||
1P8TY0J25|20712805263| | | | |1 |Denied |Error |HP_DOSOlderTh|Date of service is older than 12 months |