使用Python检测已在文本段中散列的字符串

Question

我想要做的主要是拆分网址并从网址中提取单词，但是，在很多情况下，网址可能包含字母数字哈希或字符形式难以理解的字符串。

举几个例子：http://www.example.com/wp-content/uploads/2017/10/15321408dd97beb7b5a94f0957b215cf/black-and-white-photography-portrait-of-brad-pitt.jpg

我们有一个描述性网址，可以从中提取关键字。这些关键字的最大问题是15321408dd97beb7b5a94f0957b215cf就是其中之一。运行拼写检查不一定是最佳选择，因为它可能会过滤掉某些尚未添加到拼写检查模型中的关键字。手动策划这是不可能的。此外，该字符串的长度不同并且不一致。虽然有问题的字符串似乎是md5，但我们知道数字到字母的位置可能会发生变化，所以我们需要考虑到这一点以及可变长度。大声思考，几乎可以确定字符串是否为哈希值...

我正在使用的工具：Python Spacy

我的下面的脚本给出了以下输出：

WP
内容
上传
15321408个Ddshabibhboukhas 4 P 0957 15负担筛
黑色
和
白色
摄影
肖像
的
布拉德
皮特
JPG

这是我到目前为止的地方：

import spacy
from nltk.tokenize import WordPunctTokenizer
from urllib.parse import urlparse


# Check if the word is noise
def is_noise(token, noisy_tags, min_token_length):     
    is_noise = False
    if token.pos_ in noisy_tags:
        is_noise = True 
    elif token.is_digit == True:
        is_noise = True
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) < min_token_length:
        is_noise = True
    return is_noise 

# Clean word
def clean_word(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()


nlp = spacy.load('en')

word_tokenizer = WordPunctTokenizer()

parsed_uri = urlparse(url)
text = '{uri.path} {uri.query} {uri.fragment}'.format(uri=parsed_uri)
text = re.sub('[^a-zA-Z\d\s]+', ' ', text)
text = ' '.join(word_tokenizer.tokenize(text))

document = nlp(text)
noisy_pos_tags = ['PROP']
cleaned_words = [clean_word(word.string) for word in document \
                                if not is_noise(word, noisy_pos_tags, 2)]

print(cleaned_words)

更新：这里我使用Spacy添加词性标注的输出：

word: 15321408dd97beb7b5a94f0957b215cf
word.lemma_: 15321408dd97beb7b5a94f0957b215cf
word.dep_: amod
word.shape_: ddddxxddxxxdxdxddxddddxdddxx
word.is_alpha: False
word.is_stop: False

更新：我尝试了另外两种方法，其中一种方法让我很接近。

方法1我无法根据好词和坏词实际重新训练分类器：https://github.com/rrenaud/Gibberish-Detector

方法2这是第二种方法，但只适用于较长的文本。在某些情况下，只提取了1个单词，并且使用这种方法，它总是被认为是胡言乱语：https://www.codeproject.com/Articles/894766/Gibberish-Classification-Algorithm-and-Implementat

Answer 1

除了使用乱码检测器模块提供的良好的概率方法。您还可以使用NLTK和Spacy中内置语料库提供的多个信号。 NLTK示例（使用brown和wordnet语料库）

>>> from nltk.corpus import wordnet
>>> from nltk.corpus import brown
>>> from nltk.corpus import webtext
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> 'brad' in brown.words()
True
>>> 'pitt' in brown.words()
False
>>> '15321408dd97beb7b5a94f0957b215cf' in brown.words()
False
>>> 'photography' in brown.words()
True
>>> wordnet.synsets('photography')
[Synset('photography.n.01'), Synset('photography.n.02'), Synset('photography.n.03')]
>>> wordnet.synsets('brad')
[Synset('brad.n.01'), Synset('brad.v.01')]
>>> wordnet.synsets('pitt')
[Synset('pitt.n.01'), Synset('pitt.n.02'), Synset('pitt.n.03')]
>>> wordnet.synsets('15321408dd97beb7b5a94f0957b215cf')
[]
>>> wordnet.synsets('anothergibberishhash')
[]

但是，这些查找并未优化，您必须创建优化的数据结构才能实时查找这些信号，而不是对语料库中的单词进行完全迭代。

使用Python检测已在文本段中散列的字符串

问题描述投票：0回答：1

1个回答

最新问题

使用Python检测已在文本段中散列的字符串

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1