无法在python中将文本数据读入pandas数据帧

问题描述 投票:1回答:1

我正在尝试将Pandas : populate column with if condition not working as expected中的文本数据读入数据帧。我的代码是:

dftxt = """
    0             1               2
1  10/1/2016    'stringvalue'     456
2  NaN          'anothersting'    NaN
3  NaN          'and another '    NaN
4  11/1/2016    'more strings'    943
5  NaN          'stringstring'    NaN
"""

from io import StringIO
df = pd.read_csv(StringIO(dftxt), sep='\s+')
print (df)

但我收到以下错误:

Traceback (most recent call last):
  File "mydf.py", line 16, in <module>
    df = pd.read_csv(StringIO(dftxt), sep='\s+')
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read
    data = parser.read()
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)
  File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)
  File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)
  File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)
  File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 5, saw 6

我无法理解错误地读取了哪6个字段:Expected 4 fields in line 5, saw 6。问题出在哪里?如何解决?

python pandas dataframe
1个回答
1
投票

第5行就是这一行 -

 3  NaN          'and another '    NaN
 1   2             3    4     5     6

问题出在你的分离器上。它将每个空格分隔的单词解释为单独的列。在这种情况下,你需要

  • 将你的sep论点改为\s{2,},和
  • 将您的引擎更改为'python'以禁止警告

df = pd.read_csv(StringIO(dftxt), sep='\s{2,}', engine='python')

另外,我使用str.strip摆脱了引号(它们是多余的) -

df.iloc[:, 1] = df.iloc[:, 1].str.strip("'")
df

           0             1      2
1  10/1/2016   stringvalue  456.0
2        NaN  anothersting    NaN
3        NaN  and another     NaN
4  11/1/2016  more strings  943.0
5        NaN  stringstring    NaN

最后,从一个熊猫用户到另一个,有一个叫做pd.read_clipboard的小便利功能,我想你应该看看。它从剪贴板读取数据并接受read_csv所做的每一个参数。

© www.soinside.com 2019 - 2024. All rights reserved.