如何在 pandas Datframe 中正确存储?
link to dataset:"https://lib.stat.cmu.edu/datasets/boston"
数据片段
[ 21 lines of header text]
0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30
396.90 4.98 24.00
0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80
396.90 9.14 21.60
0.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.9671 2 242.0 17.80
392.83 4.03 34.70
0.03237 0.00 2.180 0 0.4580 6.9980 45.80 6.0622 3 222.0 18.70
394.63 2.94 33.40
# Importing
data_url = "http://lib.stat.cmu.edu/datasets/boston"
# Define your column names
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
boston = pd.read_csv(data_url, skiprows=21, sep =r'\s+', names = column_names, header= None, engine='python')
boston.head(2)
当前方法的结果:
>>> boston.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.00 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 NaN NaN NaN
1 396.90000 4.98 24.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
...
哇,这是一个格式错误的 CSV。
我认为最好的方法是解析它两次,并利用跳行可以将 lambda 函数作为输入的事实。
skiprows : int, list of int or Callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
因此,首先解析大多数列的每隔一行:
boston1 = pd.read_csv(data_url, skiprows=lambda n: n < 21 or n % 2 == 1, sep =r'\s+', names = column_names[:-3], header= None, engine='python')
然后分别是最后三列:
boston2 = pd.read_csv(data_url, skiprows=lambda n: n < 21 or n % 2 == 0, sep =r'\s+', names = column_names[-3:], header= None, engine='python')
最后将两者水平组合:
>>> pd.concat([boston1, boston2], axis=1)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7