我有一个包含两列Stg和Txt的数据框。任务是检查“行”列中每个Txt行中的所有单词,并将匹配的单词输出到新列中,同时保持单词大小写与Txt中一样。
Example Code:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
def words_in_string(word_list, a_string):
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration
df['new'] = ''
for i,values in enumerate(df['Txt']):
a=[]
b = []
for word in words_in_string(my_list, values):
a=word
b.append(a)
df['new'][i] = b
exit
上面的代码从Stg列返回大小写。有没有办法从Txt获得案例。另外,我想检查整个字符串,而不是子字符串,就像在文本“双向”的情况下一样,当前代码返回单词Way。
Current Output:
Stg Txt new
0 way An early term []
1 Early two-way allowed [way, allowed]
2 phone New Phone feature that allowed [allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
Expected Output:
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
您应该使用Series.str.findall
并带有否定性:
Series.str.findall
我认为您过多复制了变量。您可以像下面这样简单地做:
import pandas as pd
import re
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = pd.DataFrame(new,columns= ['Stg','Txt'])
pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in new["Stg"])
df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
print (df)
#
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
这会给你:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
df['new'] = ''
mystring = r"\b|\b".join(my_list)
pattern = r'\b{0}\b'.format(mystring)
print(pattern)
match_pattern = re.compile(pattern, re.IGNORECASE)
for i, values in enumerate(df['Txt']):
matches = re.findall(match_pattern, values)
df['new'][i] = matches