我需要在大量文本(数千个)中匹配不相邻的关键字。如果匹配,则分配一个标签,否则分配一个标签“未知”。
举个例子,我想在下面的文本片段中找到关键字销售代表和交易,并将其分配给类别关键字模式A:
文本:“销售代表处理了一切。知道他为我编制了最佳选择,这非常有帮助。”
- 关键字模式因此是销售代表和交易
- 由于销售代表也可能称为销售代表或客户代表,因此我需要匹配多个关键字。对于所处理的词也是如此。所以你看到事情变得复杂的地方了。
有许多解决方案可用于查找和匹配一元词或相邻词(n 元词)。我自己已经实现了这个。现在我需要找到不相邻的不同关键字并分配标签。另外,我不知道不同关键字之间写了什么。它可以是任何东西。
import pandas as pd
#creat mock dictionary
Dict = pd.DataFrame({'word1':['dealt','dealt','dealt',''],
'word2':['sales representative','sales rep', 'customer rep', 'options']
} )
#create sample text
texts = ["The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything."]
motive =[]
# only checks for the keyword in the first column
for item in texts:
item = str(item)
if any(x in item for x in Dict['word1']):
motive.append('keyword pattern A')
else:
motive.append('unkown')
仅当文本中出现 dealt 和 salesrep 时才应分配标签。因此句子 3 和 5 的分配是错误的。所以我更新了代码。我跑完了,但没有分配任何标签。
for item in texts:
#convert into string
item = str(item)
#check if keyword can be found in first column
tempM1 = {x for x in Dict['word1'] if x in item}
#check if keyword was found
if tempM1 != None:
#if yes, locate all of their positions in the dictionary
for i in tempM1:
i = -1
#get row index
ind = Dict.index[Dict['word1'] == list(tempM1)[i+1]]
#gives pandas.core.indexes.base.Index
#check if column next to given row index is no empty
if pd.isnull(Dict['word2'].iloc[ind]) is False:
#match keyword in second column
tempM2 = {x for x in Dict['word2'] if x in item}
#if second keyword was found
if tempM2 != None:
motive.append('keyword pattern A')
else:
#check again first keyword column
tempM3 = {x for x in Dict['word1'] if x in item}
if tempM3 != None:
motive.append('keyword pattern A')
else:
motive.append('unknown')
如何调整上面的代码?
我了解正则表达式(RegEx)。在我看来,考虑到关键字的数量(大约 700 到 1000 个)以及它们之间的组合,它将需要更多的代码行并且效率较低。不过很高兴被证明是错误的!
另外,我知道它可以被视为一个分类问题。该项目需要解释和透明度,因此深度学习及其类型不是一种选择。出于同样的原因,我不考虑嵌入。
谢谢!
您可以利用
all()
和 any()
来查找短语是否包含“所有”匹配列表中的“任何”匹配项吗?
phrases_to_find = [
[
["dealt"],
["sales representative", "sales rep", "customer rep"]
],
[
["option"]
]
]
texts = [
"The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything.",
"Here is some option."
]
motive =[]
for text in texts:
for index, test_phrases in enumerate(phrases_to_find):
if all(any(p in text for p in phrase) for phrase in test_phrases):
motive.append(f'keyword pattern {index}')
break
else:
motive.append('unknown')
print(motive)
这应该给你:
[
'keyword pattern 0',
'keyword pattern 0',
'unknown',
'unknown',
'unknown',
'keyword pattern 1'
]