使用 pandas.read_csv 的 na_values 正则表达式

问题描述 投票:0回答:2

我想使用

pandas.read_csv

读取这样的文件
1891, 91920,  7,       628,249, 59,51.0, 0.026, 0.028,   NaN,   NaN,   NaN,   NaN,   NaN,  0.156, 0.071,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,   21,500,   21,43.8, 0.005, 0.619,  NaN,45.6, 0.048, 0.053,   NaN,   NaN,   NaN,   NaN,   NaN, -0.180, 0.088,   20, 0.012, 1.107,  NaN, NaN,   NaN,   NaN,   NaN,   NaN,   NaN,   NaN,   NaN,    NaN,   NaN,  NaN,   NaN,   NaN,  NaN,     NaN,     NaN,     NaN
1891, 91920, 16,       628,135, 22,41.2, 0.093, 0.087,   NaN,   NaN,   NaN,   NaN,   NaN,  0.416, 0.212,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,   21,500,   20,23.3, 0.021, 2.023,  NaN, NaN,   NaN,   NaN,   NaN,   NaN,   NaN,   NaN,   NaN,    NaN,   NaN,  NaN,   NaN,   NaN,  NaN, NaN,   NaN,   NaN,   NaN,   NaN,   NaN,   NaN,   NaN,    NaN,   NaN,  NaN,   NaN,   NaN,  NaN,     NaN,     NaN,     NaN
1891, 91920,  3,       628, 28, 39,47.0, 0.041, 0.044,   NaN,   NaN,   NaN,   NaN,   NaN, -0.006, 0.064,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,   21,500,   21,37.5, 0.009, 0.964,  NaN,45.3, 0.054, 0.055,   NaN,   NaN,   NaN,   NaN,   NaN, -0.838, 0.228,   20, 0.013, 1.193,  NaN,51.8, 0.025, 0.026,   NaN,   NaN,   NaN,   NaN,   NaN, -0.021, 0.054,   21, 0.005, 0.540,  NaN,     NaN,     NaN,     NaN
1891, 91920,  6,       628,276, 20,40.0, 0.118, 0.101,   NaN,   NaN,   NaN,   NaN,   NaN, -0.767, 0.558,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,    NaN,   NaN,   21,500,   20,26.7, 0.032, 2.982,  NaN,41.0, 0.088, 0.089,   NaN,   NaN,   NaN,   NaN,   NaN, -0.141, 0.233,   20, 0.024, 2.074,  NaN,46.2, 0.053, 0.049,   NaN,   NaN,   NaN,   NaN,   NaN,  0.080, 0.034,   21, 0.012, 1.187,  NaN,     NaN,     NaN,     NaN

我在尝试读取它时遇到问题,因为 NaN 值。如果文件是 csv 文件(逗号分隔),我没有问题,但它有空格。当我阅读它时使用:

df = pd.read_csv(file,index_col=None, header=None)

显然,带有 NaN 的列会被读取为字符串,因为其中有空格。如果空间具有相同的尺寸,我的问题很容易。我可以使用:

df = pd.read_csv(file,index_col=None, header=None, na_values = "   NaN")

问题已解决,但有些列具有不同的空格。其中一些在 NaN 之前有 4 个空格,另一些有 6 个空格,依此类推。

所以,我的问题是:是否有一个正则表达式来指定

na_values
之类的东西?
    

python regex pandas nan
2个回答
0
投票

na_values = "\s+ NaN"

解析引擎设置为 
df = pd.read_csv(engine='python', index_col=None, sep=',\s*', header=None)

以避免在使用正则表达式作为分隔符时收到警告。

    


0
投票
python

-

“在分隔符后跳过空格。” 无需使用正则表达式。
skipinitialspace=True

抽查:

df = pd.read_csv(file, header=None, skipinitialspace=True)

	
© www.soinside.com 2019 - 2024. All rights reserved.