如果应用某些条件,我需要提取行。
col1
列应包含列表list_words
中的所有单词。 Story
ac
:这是我当前的代码:
import pandas as pd
df = pd.DataFrame({'col1': ['Draft SW Quality Assurance Story', 'alex ac', 'anny ac', 'antoine ac','aze epic', 'bella ac', 'Complete SW Quality Assurance Plan Story', 'celine ac','wqas epic', 'karmen ac', 'kameilia ac', 'Update SW Quality Assurance Plan Story', 'joseph ac','Update SW Quality Assurance Plan ac', 'joseph ac'],
'col2': ['aa', 'bb', 'cc', 'dd','ee', 'ff', 'gg', 'hh', 'ii', 'jj', 'kk', 'll', 'mm', 'nn', 'oo']})
print(df)
list_words="SW Quality Plan Story"
set_words = set(list_words.split())
#check if list_words is in the cell
df['TrueFalse']=pd.concat([df.col1.str.contains(word,regex=False) for word in list_words.split()],axis=1).sum(1) > 1
print('\n',df)
#extract last word
df["Suffix"] = df["col1"].str.split().str[-1]
print('\n',df)
df['ok']=''
for i in range (len(df)-1):
if ((df["Suffix"].iloc[i]=='Story') & (df["TrueFalse"].iloc[i]=='True') & (df["Suffix"].iloc[i+1]=='ac')):
df['ok'].iloc[i+1]=df["Suffix"].iloc[i+1]
print('\n',df)
输出:
col1 col2 TrueFalse Suffix ok
0 Draft SW Quality Assurance Story aa True Story
1 alex ac bb False ac
2 anny ac cc False ac
3 antoine ac dd False ac
4 aze epic ee False epic
5 bella ac ff False ac
6 Complete SW Quality Assurance Plan Story gg True Story
7 celine ac hh False ac
8 wqas epic ii False epic
9 karmen ac jj False ac
10 kameilia ac kk False ac
11 Update SW Quality Assurance Plan Story ll True Story
12 joseph ac mm False ac
13 Update SW Quality Assurance Plan ac nn True ac
14 joseph ac oo False ac
第13行应设置为False
所需输出:
col1 col2 TrueFalse Suffix
0 Draft SW Quality Assurance Story aa True Story
1 alex ac bb False ac
2 anny ac cc False ac
3 antoine ac dd False ac
6 Complete SW Quality Assurance Plan Story gg True Story
7 celine ac hh False ac
11 Update SW Quality Assurance Plan Story ll True Story
12 joseph ac mm False ac
这里是您可以完成此操作的一种方法。通过使用管道定界符来分割要搜索的字符串,从而利用正则表达式。检查同一列是否以故事结尾,并检查下一列(df.shift(-1))是否以ac结尾。
import pandas as pd
df = pd.DataFrame({'col1': ['Draft SW Quality Assurance Story', 'alex ac', 'anny ac', 'antoine ac','aze epic', 'bella ac', 'Complete SW Quality Assurance Plan Story', 'celine ac','wqas epic', 'karmen ac', 'kameilia ac', 'Update SW Quality Assurance Plan Story', 'joseph ac','Update SW Quality Assurance Plan ac', 'joseph ac'],
'col2': ['aa', 'bb', 'cc', 'dd','ee', 'ff', 'gg', 'hh', 'ii', 'jj', 'kk', 'll', 'mm', 'nn', 'oo']})
print(df)
list_words="SW Quality Plan Story"
set_words = set(list_words.split())
#check if list_words is in the cell
df['TrueFalse']=(df['col1'].str.contains('|'.join(word for word in set_words))) & (df['col1'].str.endswith('Story')) & (df['col1'].shift(-1).str.endswith('ac'))
print(df)
col1 col2 TrueFalse
0 Draft SW Quality Assurance Story aa True
1 alex ac bb False
2 anny ac cc False
3 antoine ac dd False
4 aze epic ee False
5 bella ac ff False
6 Complete SW Quality Assurance Plan Story gg True
7 celine ac hh False
8 wqas epic ii False
9 karmen ac jj False
10 kameilia ac kk False
11 Update SW Quality Assurance Plan Story ll True
12 joseph ac mm False
13 Update SW Quality Assurance Plan ac nn False
14 joseph ac oo False
这是您的不同条件。查看condition_1
现在如何工作:
# Condition 1: col1 minus all words in set_words is empty!
df["condition_1"] = df.col1.apply(lambda x: not bool(set_words - set(x.split())))
# Condition 2: the last word should be 'Story'
df["condition_2"] = df.col1.str.endswith("Story")
# Condition 3: the last word in the next row should be ac. See `shift(-1)`
df["condition_3"] = df.col1.str.endswith("ac").shift(-1)
print(df)
输出:
col1 col2 condition_1 condition_2 condition_3
0 Draft SW Quality Assurance Story aa False True True
1 alex ac bb False False True
2 anny ac cc False False True
3 antoine ac dd False False False
4 aze epic ee False False True
5 bella ac ff False False False
6 Complete SW Quality Assurance Plan Story gg True True True
7 celine ac hh False False False
8 wqas epic ii False False True
9 karmen ac jj False False True
10 kameilia ac kk False False False
11 Update SW Quality Assurance Plan Story ll True True True
12 joseph ac mm False False True
13 Update SW Quality Assurance Plan ac nn False False True
14 joseph ac oo False False NaN
这里是查找满足所有三个条件的所有行的方法:
>>> print(df[df.condition_1 & df.condition_2 & df.condition_3])
col1 col2 condition_2 condition_3 condition_1
6 Complete SW Quality Assurance Plan Story gg True True True
11 Update SW Quality Assurance Plan Story ll True True True
或者您可以将其存储为单独的列conditions
:
df["conditions"] = df.condition_1 & df.condition_2 & df.condition_3
>>> print(df)
col1 col2 condition_2 condition_3 condition_1 conditions
0 Draft SW Quality Assurance Story aa True True False False
1 alex ac bb False True False False
2 anny ac cc False True False False
3 antoine ac dd False False False False
4 aze epic ee False True False False
5 bella ac ff False False False False
6 Complete SW Quality Assurance Plan Story gg True True True True
7 celine ac hh False False False False
8 wqas epic ii False True False False
9 karmen ac jj False True False False
10 kameilia ac kk False False False False
11 Update SW Quality Assurance Plan Story ll True True True True
12 joseph ac mm False True False False
13 Update SW Quality Assurance Plan ac nn False True False False
14 joseph ac oo False NaN False False