我在下面有一个代码,我使用
pd.read_csv
解析主机名文本文件,并根据 prefix
进行匹配,效果很好。但是,现在有一个要求,在 sj12
中我需要查找第四个字符作为字母表,例如 sj12 应匹配 sh12[a-z]
即 sj12a001
、 sj12u003
等。
我正在寻找 Pandas 是否有办法做到这一点:
#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
#To drop if all values in the row are nan
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 sj124000 cr000011 crn00001 euk000011 eu5000011
sj000002 sj125000 cr000012 crn00002 eu0000012 eu5000013
sj000003 sj12at00 cr000013 crn00003 eu0000013 eu5000014
sj000004 sj12bt00 cr000014 crn00004 eu0000014 eu5000015
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 sj12at00 cr000011 crn00001 euk000011 eu5000011
sj000002 sj12bt00 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
在上面的预期输出中,您会看到
sj124000
和 sj125000
已删除。
我用
str.extract
方法解决了。
df['sj12'] = df['sj12'].str.extract('(\w\w\d\d\w\*)', expand=True)
或
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)