我有一列数字,由以下字符之一分隔:
-,/,*,~,_
。我需要检查这些值是否包含任何字符,然后将该值拆分到另一列中。
是否有与下面所示不同的解决方案?
最后,
columns subnumber1, subnumber2 ...subnumber5
将合并为一列,而number5
列将没有字符。我需要在进一步的过程中使用这两列。
if gdf['column_name'].str.contains('~').any():
gdf[['number1', 'subnumber1']] = gdf['column_name'].str.split('~', expand=True)
gdf
if gdf['column_name'].str.contains('^').any():
gdf[['number2', 'subnumber2']] = gdf['column_name'].str.split('^', expand=True)
gdf
Input column:
column_name
152/6*3
163/1-6
145/1
163/6^3
output:
number5 |subnumber1 |subnumber2
152 | 6 | 3
163 | 1 | 6
145 | 1 |
163 | 6 | 3
Series.str.split
以及可能的分隔符列表并创建新的 DataFrame:
import re
L = ['-','/','*','~','_','^', '.']
#some values like `^.` are escape
pat = '|'.join(re.escape(x) for x in L)
df = df['column_name'].str.split(pat, expand=True).add_prefix('num')
print (df)
num0 num1 num2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
编辑:如果需要在使用值之前匹配值:
L = ["\-_",'\^|\*','~','/']
for val in L:
df[f'before {val}'] = df['column_name'].str.extract(rf'(\d+){[val]}')
#for last value not exist separator, so match $ for end of string
df['last'] = df['column_name'].str.extract(rf'(\d+)$')
print (df)
column_name before \-_ before \^|\* before ~ before / last
0 152/2~3_4*5 3 4 2 152 5
1 152/2~3-4^5 4 4 2 152 5
2 152/6*3 NaN 6 NaN 152 3
3 163/1-6 NaN NaN NaN 163 6
4 145/1 NaN NaN NaN 145 1
5 163/6^3 6 6 NaN 163 3
str.split
:
df['column_name'].str.split(r'[*,-/^_]', expand=True)
输出:
0 1 2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
str.extract
并命名捕获组:
regex = '(?P<number5>\d+)\D*(?P<subnumber1>\d*)\D*(?P<subnumber2>\d*)'
df['column_name'].str.extract(regex)
输出:
number5 subnumber1 subnumber2
0 152 6 3
1 163 1 6
2 145 1
3 163 6 3