我有一个csv由于额外的逗号而中断,我只需要数据集中的一列但它出现在带有额外逗号的列之后

问题描述 投票:-1回答:2

如果我可以反向解析csv,无论错误如何都能得到正确的值。

df1 = pd.read_csv('MyData.csv', error_bad_lines=False)

我能够看到列前面的所有列都有额外的逗号显示正常。

import pandas as pd
import csv
with open('Myfile', 'rb') as f, 
   open('Newfile', 'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
    row = line.split(',', 2)
    writer.writerow(row)

我想在python pandas中这样做

示例csv:

id,name,place,address,age,type,dob,date
1,Murtaza,someplace,Street,MA,22,B,somedate,somedate,
2,Murtaza,someplace,somestreet,45,C,somedate,somedate,
3,Murtaza,someplace,somestreet,MA,44,V,somedate,somedate

Excel输出:

id  name    place       address    age  type  dob     date     newcolumn9

1  Murtaza someplace  somestreet    MA   22    B      somedate  somedate

2  Murtaza someplace  somestreet    45    C  somedate somedate

3  Murtaza someplace  somestreet    MA   44    V      somedate  somedate

我想要年龄栏。我无法发布原始csv或其输出plzz了解

python pandas
2个回答
1
投票

熊猫,或只是re.split()

import re

your_csv_file=open('your_csv_file.csv','r').read()
i_column=2      #index of desired column, counted from back
lines=re.split('\n',your_csv_file)[:-1] #eventually remove last (empty) line
your_column=[]
for line in lines:
  your_column.append(re.split(',',line)[-i_column])    #the minus affects indexing beginning at the end
print(your_column)

在.csv文件上执行,如下所示

4rth,askj,fpou,ABC,aekert
kjgf,poiuf,pejhh,,oeiu,DEF,akdhg
iuzrit,fslgk,gth,,rhf,,rhe,GHI,ozug
pwiuto,,,,eflgjkhrlguiazg,JKL,rgj

这回来了

['ABC', 'DEF', 'GHI', 'JKL']

0
投票

我认为最好的方法可能是编写一个单独的脚本来删除错误的逗号。但是如果你想忽略错误的行,那么可以通过将每行读入StringIO并忽略逗号数量不正确的行来完成。所以,如果你期待4列:

from cStringIO import StringIO
import pandas

s = StringIO()
correct_columns = 4
with open('MyData.csv') as file:
    for line in file:
        if len(','.split(line)) == correct_columns:
            s.write(line)
s.seek(0)
pandas.read_csv(s)
© www.soinside.com 2019 - 2024. All rights reserved.