计算文本中某些单词在其他一些关键词中的出现频率

问题描述 投票:0回答:1

我有两套单词列表 - 第一套我称之为

search words
,第二套我称之为
key words
。我的目标是计算
search words
的 10 个单词内
key words
出现的频率。例如,假设单词 - acquire - 在
key words
列表中,那么我将在
search words
列表中查找 acquire 的 10 个单词内的单词。 10个词以内的意思是,从关键词向前10个词,从关键词向后10个词,意思是向前和向后移动。

下面是我的

search word
key word
列表 -

search_words = ['access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
 'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
 'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
 'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
 'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
 'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security', 
 'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
 'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout', 
 'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security', 
 'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
 'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
 'Securion', 'security event management', 'security information and event management', 
 'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
 'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense', 
 'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm']

key_words = ['acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
 'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
 'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install', 
 'integrate', 'invest', 'lease',
 'modernize', 'modify', 'move', 'obtain', 'plan', 'project', 'purchase', 'replace', 'spend',
  'upgrade', 'use']

一个小例子-

text_dict = {
    'ITEM7':["Last year, from AVG we have acquired Alibaba Security. This year we are in the process \
    of adopting Symantec. We believe these technologies will improve our access control. \
        Moreover, we also integrated data security diagnostic program.",
        "We are planning to install end-point security, which will upgrade intrusion detection system."]
}

df = pd.DataFrame(text_dict)

我的预期结果是 -

                 ITEM7                          Frequency
Last year, from AVG we have acquired Alibaba S...   6
We are planning to install end-point security,...   2

对于

df
中的第一行,我们看到单词
AVG
Alibaba Security
来自
search_words
列表,围绕单词 acquired,其基本形式 - acquire - 位于
key_words
列表。同样,
Symantec
Access Control
data security
diagnostic program
来自
search_words
列表,这些单词位于
adopting
列表中的
improve
integrated
key_words
的10个单词以内。因此,总搜索词为 6 个(AVG+阿里巴巴安全+赛门铁克+访问控制+数据安全+诊断程序)。因此,在
Frequency
df
列中,值为 6。

请注意,

key_words
中的单词基本上都是基本形式,因此它们的变体(如采用、采用)也应算作关键词。

python pandas nlp
1个回答
0
投票

您需要通过识别

key_words
的出现并捕获它们周围的 10 个单词的窗口来处理每一行文本。在此窗口中,您需要检查多词 search_words,确保它们作为短语进行匹配。需要对这些窗口中找到的每个独特的
search_word
进行计数,避免跨行重复计数。将结果存储为每行的频率计数,准确反映 key_words 附近唯一 search_words 的数量。

import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import re

text_dict = {
    'ITEM7': [
        "Last year, from AVG we have acquired Alibaba Security. This year we are in the process "
        "of adopting Symantec. We believe these technologies will improve our access control. "
        "Moreover, we also integrated data security diagnostic program.",
        "We are planning to install end-point security, which will upgrade intrusion detection system."
    ]
}
df = pd.DataFrame(text_dict)

search_words = [
    'access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
    'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
    'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
    'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
    'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
    'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security',
    'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
    'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout',
    'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security',
    'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
    'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
    'Securion', 'security event management', 'security information and event management',
    'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
    'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense',
    'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm'
]

key_words = [
    'acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
    'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
    'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install',
    'integrate', 'invest', 'lease', 'modernize', 'modify', 'move', 'obtain', 'plan', 'project',
    'purchase', 'replace', 'spend', 'upgrade', 'use'
]

def preprocess_text_no_lemmatization(text):
    tokens = re.findall(r'\b\w+\b', text.lower())  
    return tokens

def calculate_final_frequency(row, search_phrases, key_phrases):
    text = row.lower()
    tokens = preprocess_text_no_lemmatization(text) 
    search_phrases = [phrase.lower() for phrase in search_phrases]  
    key_phrases = [phrase.lower() for phrase in key_phrases] 

    all_matches = set()
    token_len = len(tokens)
    
    for idx, token in enumerate(tokens):
        if any(token.startswith(key) for key in key_phrases):  
            window_start = max(0, idx - 10)
            window_end = min(token_len, idx + 10 + 1)
            window_tokens = tokens[window_start:window_end]
            window_text = " ".join(window_tokens)  

            for phrase in search_phrases:
                if phrase in window_text:
                    all_matches.add(phrase)  
    return len(all_matches)

df['Frequency'] = df['ITEM7'].apply(lambda x: calculate_final_frequency(x, search_words, key_words))

print(df)

哪个回报

                                               ITEM7  Frequency
0  Last year, from AVG we have acquired Alibaba S...          6
1  We are planning to install end-point security,...          2
© www.soinside.com 2019 - 2024. All rights reserved.