计算文本中某些单词在其他一些关键词中的出现频率

Question

我有两套单词列表 - 第一套我称之为

search words

，第二套我称之为

key words

。我的目标是计算

search words

的 10 个单词内

key words

出现的频率。例如，假设单词 - acquire - 在

key words

列表中，那么我将在

search words

列表中查找 acquire 的 10 个单词内的单词。 10个词以内的意思是，从关键词向前10个词，从关键词向后10个词，意思是向前和向后移动。

下面是我的

search word

和

key word

列表 -

search_words = ['access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
 'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
 'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
 'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
 'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
 'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security', 
 'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
 'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout', 
 'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security', 
 'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
 'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
 'Securion', 'security event management', 'security information and event management', 
 'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
 'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense', 
 'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm']

key_words = ['acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
 'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
 'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install', 
 'integrate', 'invest', 'lease',
 'modernize', 'modify', 'move', 'obtain', 'plan', 'project', 'purchase', 'replace', 'spend',
  'upgrade', 'use']

一个小例子-

text_dict = {
    'ITEM7':["Last year, from AVG we have acquired Alibaba Security. This year we are in the process \
    of adopting Symantec. We believe these technologies will improve our access control. \
        Moreover, we also integrated data security diagnostic program.",
        "We are planning to install end-point security, which will upgrade intrusion detection system."]
}

df = pd.DataFrame(text_dict)

我的预期结果是 -

                 ITEM7                          Frequency
Last year, from AVG we have acquired Alibaba S...   6
We are planning to install end-point security,...   2

对于

df

中的第一行，我们看到单词

AVG

和

Alibaba Security

来自

search_words

列表，围绕单词 acquired，其基本形式 - acquire - 位于

key_words 中

列表。同样，

Symantec

、

Access Control

、

data security

、

diagnostic program

来自

search_words

列表，这些单词位于

adopting

列表中的

improve

、

integrated

、

key_words

的10个单词以内。因此，总搜索词为 6 个（AVG+阿里巴巴安全+赛门铁克+访问控制+数据安全+诊断程序）。因此，在

Frequency

的

df

列中，值为 6。

请注意，

key_words

中的单词基本上都是基本形式，因此它们的变体（如采用、采用）也应算作关键词。

Answer 1

您需要通过识别

key_words

的出现并捕获它们周围的 10 个单词的窗口来处理每一行文本。在此窗口中，您需要检查多词 search_words，确保它们作为短语进行匹配。需要对这些窗口中找到的每个独特的

search_word

进行计数，避免跨行重复计数。将结果存储为每行的频率计数，准确反映 key_words 附近唯一 search_words 的数量。

import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import re

text_dict = {
    'ITEM7': [
        "Last year, from AVG we have acquired Alibaba Security. This year we are in the process "
        "of adopting Symantec. We believe these technologies will improve our access control. "
        "Moreover, we also integrated data security diagnostic program.",
        "We are planning to install end-point security, which will upgrade intrusion detection system."
    ]
}
df = pd.DataFrame(text_dict)

search_words = [
    'access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
    'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
    'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
    'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
    'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
    'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security',
    'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
    'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout',
    'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security',
    'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
    'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
    'Securion', 'security event management', 'security information and event management',
    'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
    'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense',
    'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm'
]

key_words = [
    'acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
    'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
    'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install',
    'integrate', 'invest', 'lease', 'modernize', 'modify', 'move', 'obtain', 'plan', 'project',
    'purchase', 'replace', 'spend', 'upgrade', 'use'
]

def preprocess_text_no_lemmatization(text):
    tokens = re.findall(r'\b\w+\b', text.lower())  
    return tokens

def calculate_final_frequency(row, search_phrases, key_phrases):
    text = row.lower()
    tokens = preprocess_text_no_lemmatization(text) 
    search_phrases = [phrase.lower() for phrase in search_phrases]  
    key_phrases = [phrase.lower() for phrase in key_phrases] 

    all_matches = set()
    token_len = len(tokens)
    
    for idx, token in enumerate(tokens):
        if any(token.startswith(key) for key in key_phrases):  
            window_start = max(0, idx - 10)
            window_end = min(token_len, idx + 10 + 1)
            window_tokens = tokens[window_start:window_end]
            window_text = " ".join(window_tokens)  

            for phrase in search_phrases:
                if phrase in window_text:
                    all_matches.add(phrase)  
    return len(all_matches)

df['Frequency'] = df['ITEM7'].apply(lambda x: calculate_final_frequency(x, search_words, key_words))

print(df)

哪个回报

                                               ITEM7  Frequency
0  Last year, from AVG we have acquired Alibaba S...          6
1  We are planning to install end-point security,...          2

计算文本中某些单词在其他一些关键词中的出现频率

问题描述投票：0回答：1

1个回答

最新问题

计算文本中某些单词在其他一些关键词中的出现频率

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1