如何将行分隔不正确的 CSV 解析为 pandas 数据框

问题描述 投票:0回答:1

如何在 pandas Datframe 中正确存储?

link to dataset:"https://lib.stat.cmu.edu/datasets/boston"

数据片段

[ 21 lines of header text]
 0.00632  18.00   2.310  0  0.5380  6.5750  65.20  4.0900   1  296.0  15.30
  396.90   4.98  24.00
 0.02731   0.00   7.070  0  0.4690  6.4210  78.90  4.9671   2  242.0  17.80
  396.90   9.14  21.60
 0.02729   0.00   7.070  0  0.4690  7.1850  61.10  4.9671   2  242.0  17.80
  392.83   4.03  34.70
 0.03237   0.00   2.180  0  0.4580  6.9980  45.80  6.0622   3  222.0  18.70
  394.63   2.94  33.40
# Importing
data_url = "http://lib.stat.cmu.edu/datasets/boston"
# Define your column names
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
boston = pd.read_csv(data_url, skiprows=21, sep =r'\s+',  names = column_names, header= None, engine='python')

boston.head(2)

当前方法的结果:

>>> boston.head()
        CRIM     ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO   B  LSTAT  MEDV
0    0.00632  18.00   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0     15.3 NaN    NaN   NaN
1  396.90000   4.98  24.00   NaN    NaN    NaN   NaN     NaN  NaN    NaN      NaN NaN    NaN   NaN
...
python regex csv
1个回答
0
投票

哇,这是一个格式错误的 CSV。

我认为最好的方法是解析它两次,并利用跳行可以将 lambda 函数作为输入的事实。

skiprows : int, list of int or Callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

因此,首先解析大多数列的每隔一行:

boston1 = pd.read_csv(data_url, skiprows=lambda n: n < 21 or n % 2 == 1, sep =r'\s+',  names = column_names[:-3], header= None, engine='python')

然后分别是最后三列:

boston2 = pd.read_csv(data_url, skiprows=lambda n: n < 21 or n % 2 == 0, sep =r'\s+',  names = column_names[-3:], header= None, engine='python')

最后将两者水平组合:

>>> pd.concat([boston1, boston2], axis=1)
        CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO       B  LSTAT  MEDV
0    0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296.0     15.3  396.90   4.98  24.0
1    0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242.0     17.8  396.90   9.14  21.6
2    0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242.0     17.8  392.83   4.03  34.7
© www.soinside.com 2019 - 2024. All rights reserved.