我的数据框中有以下两列:
COL1 COL2
12 :402:agshhhjd:45:hghghgruru:12:fghg,hgh:22:hhhh
57 :42:ags,hhhjd:57:hghg,hgruru:120:fghgh,gh:12:hhhhhh
我需要创建另一个列COL3,如下所示:
COL1 COL2 COL3
12 :402:agshhhjd:45:hghghgruru,:12:fghg,hgh:22:hhhh fghg,hg
57 :42:agshhhjd:57:hghg,hgruru:120:fghghgh:12:hhhhhh hghg,hg
需要以这样的方式创建新列COL 3:它在COL2中搜索COL1的值为同一行,然后打印除“:”之外的7个字符。我尝试做的是使用切片,但它无法正常工作。有人可以帮助。
您可以使用属性replace
,但首先您必须更改第1列的数据类型。我们需要替换COL2中的所有内容,在COL1中的数字后保存字词,即:
.*12:(\w{7}).*
所以我们只捕获七个字母并通过后面的引用来调用它们,即值= \ 1。我们也在第二行做同样的事情。这可以很容易地完成,因为replace
是矢量化的。虽然这会很慢
df['COL3'] = df.COL2.replace(regex=r'.*'+ df.COL1.astype('str') +':(\\w{7}).*',value="\\1")
df
COL1 COL2 COL3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh fghghgh
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h... hghghgr
你也可以这样做:
import re
[re.sub(".*"+str(i)+":(\\w{7}).*","\\1",j) for i,j in zip(df.COL1,df.COL2)]
通过您的更新,您可以:
df.assign(COL3 = df.COL2.replace(regex=r'.*'+ df.COL1.astype('str')+':(.{7}).*',value="\\1"))
Out[102]:
COL1 COL2 COL3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghg,hgh,:22:... fghg,hg
1 57 :42:ags,hhhjd,:57:hghg,hgruru,:120:fghgh,gh,:1... hghg,hg
使用列表理解和re.findall
:
import re
df['COL3'] = [
re.findall('{}\:([a-z]{{7}})'.format(i), j) for i, j in zip(df.COL1, df.COL2)
]
COL1 COL2 COL3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh [fghghgh]
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h... [hghghgr]
您也可以使用列表推导和split
,但如果在COL2
中找不到第一个值,则会抛出错误:
[j.split('{}:'.format(i))[1][:7] for i, j in zip(df.COL1, df.COL2)]
# ['fghghgh', 'hghghgr']
如果您可以保证在COL2
中找到该值,那么使用split会更快:
df = pd.concat([df]*10000)
%timeit [re.findall('{}\:([a-z]{{7}})'.format(i), j) for i, j in zip(df.COL1, df.COL2)]
28.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [j.split('{}:'.format(i))[1][:7] for i, j in zip(df.COL1, df.COL2)]
12 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这个:
test = pd.DataFrame({'Col1': [12, 57], 'Col2': [':402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh', ':42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:hhhhhh']})
test
Col1 Col2
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h...
def my_val(col1num, col2text):
# Split columns by ':'
col2_ls = col2text.split(':')[1:]
# Create an empty dict to store key-value pairs
my_dict = {}
# Create your key-value pairs and update dict
for i, j in zip(range(0, len(col2_ls), 2), range(1, len(col2_ls)+1, 2)):
my_dict[col2_ls[i]] = col2_ls[j]
# If the key exists return the value
if str(col1num) in my_dict.keys():
val = my_dict[str(col1num)]
return val
else:
return 'Unavailable'
test['Col3'] = test.apply(lambda x: my_val(col1num=x['Col1'], col2text=x['Col2']), axis=1)
test
Col1 Col2 Col3
0 12 :402:agshhhjd,:45:hghghgruru,:12:fghghgh,:22:hhhh fghghgh,
1 57 :42:agshhhjd,:57:hghghgruru,:120:fghghgh,:12:h... hghghgruru,
希望这可以帮助