我想根据句子列表是否与特定关键字的情感匹配将它们分成两个列表。例如:
valid_keyword = "Guest Accepted"
sentences = [
"Guest Allowed", "Max Guest Allowed 1", "Guest Not Allowed",
"No Guest Allowed", "Guest Restricted", "Max Guest Allowed 0",
"Guest Allowed on Request"
]
我想通过以下方式分隔列表:
valid_sentences,其中包含与 valid_keyword 的情绪相匹配的句子(“Guest Accepted”)。
invalid_sentences,其中包含与 valid_keyword 的情绪不匹配的句子。
预期输出为:
valid_sentences = ["Guest Allowed", "Max Guest Allowed 1", "Guest Allowed on Request"]
invalid_sentences = ["Guest Not Allowed", "No Guest Allowed", "Guest Restricted", "Max Guest Allowed 0"]
我已经尝试过VaderSentiment,但没有得到我所期望的。 我应该使用哪些合适的 Python 库来完成此过程,以提供最大的准确性?
HuggingFace 提供了一个使用零样本分类管道的整洁而有效的解决方案,它可以满足您的需求。安装库后,您可以按如下方式使用它:
from transformers import pipeline
pipe = pipeline(model="facebook/bart-large-mnli")
pipe([
"Guest Allowed", "Max Guest Allowed 1", "Guest Not Allowed",
"No Guest Allowed", "Guest Restricted", "Max Guest Allowed 0",
"Guest Allowed on Request"],
candidate_labels=["Guest Accepted", "Guest Not Accepted"],
)
# # output
# [{'sequence': 'Guest Allowed',
# 'labels': ['Guest Accepted', 'Guest Not Accepted'],
# 'scores': [0.9912997484207153, 0.008700196631252766]},
# {'sequence': 'Max Guest Allowed 1',
# 'labels': ['Guest Accepted', 'Guest Not Accepted'],
# 'scores': [0.9896429777145386, 0.010356982238590717]},
# {'sequence': 'Guest Not Allowed',
# 'labels': ['Guest Not Accepted', 'Guest Accepted'],
# 'scores': [0.9956033229827881, 0.004396690987050533]},
# {'sequence': 'No Guest Allowed',
# 'labels': ['Guest Not Accepted', 'Guest Accepted'],
# 'scores': [0.9929953217506409, 0.007004666142165661]},
# {'sequence': 'Guest Restricted',
# 'labels': ['Guest Not Accepted', 'Guest Accepted'],
# 'scores': [0.9871166348457336, 0.012883339077234268]},
# {'sequence': 'Max Guest Allowed 0',
# 'labels': ['Guest Not Accepted', 'Guest Accepted'],
# 'scores': [0.5833010673522949, 0.41669896245002747]},
# {'sequence': 'Guest Allowed on Request',
# 'labels': ['Guest Accepted', 'Guest Not Accepted'],
# 'scores': [0.9872869849205017, 0.012713048607110977]}]
您可以随意更改模型或候选标签,以使解决方案适合您的特定用例。
我注意到这些句子都很短。是否可以简单地列出所有可能的候选字符串,并使用精确(但不区分大小写)匹配?
并且,作为备份,当您得到的字符串不在列表中时,将其传递给 meti 答案中的 mnli 模型。 (并记录它,以便您将来可以将其添加到查找表中。)
这将使您能够可靠地将“允许的最大访客数 0”视为“不接受”,我认为这是 MNLI 模型唯一需要解决的问题。就 CPU 而言,它也会快得多。