我试图解决一个nlp问题,我有一个单词的词典,如:
list_1={'phone':'android','chair':'netflit','charger':'macbook','laptop','sony'}
现在,如果输入是“电话”,我可以轻松地使用“输入”操作员按键来获取电话及其数据的描述,但问题是输入是“电话”还是“电话”。
我想如果我输入'手机'然后我得到像这样的话
'phone' ==> 'Phones','phones','Phone','Phone's','phone's'
我不知道哪个word2vec可以使用,哪个nlp模块可以提供这样的解决方案。
第二个问题是,如果我说“狗”,我可以得到像'小狗','小猫','狗','狗'等字样?
我试过这样的东西,但它给出了同义词:
from nltk.corpus import wordnet as wn
for ss in wn.synsets('phone'): # Each synset represents a diff concept.
print(ss)
但它的回归:
Synset('telephone.n.01')
Synset('phone.n.02')
Synset('earphone.n.01')
Synset('call.v.03')
相反,我想:
'phone' ==> 'Phones','phones','Phone','Phone's','phone's'
WordNet索引概念(又名Synsets
)而不是单词。
使用lemma_names()
访问WordNet中的根词(aka Lemma
)。
>>> from nltk.corpus import wordnet as wn
>>> for ss in wn.synsets('phone'): # Each synset represents a diff concept.
... print(ss.lemma_names())
...
['telephone', 'phone', 'telephone_set']
['phone', 'speech_sound', 'sound']
['earphone', 'earpiece', 'headphone', 'phone']
['call', 'telephone', 'call_up', 'phone', 'ring']
作为根形式或单词的引理不应该有其他词缀,因此您不会找到您在所需单词列表中列出的复数或不同形式的单词。
也可以看看:
此外,单词含糊不清,可能需要通过上下文或我的词性(POS)消除歧义,然后才能得到“相似”的单词,例如,您看到动词含义中的“phone”并不完全相同电话和“名词”一样。
>>> for ss in wn.synsets('phone'): # Each synset represents a diff concept.
... print(ss.lemma_names(), '\t', ss.definition())
...
['telephone', 'phone', 'telephone_set'] electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds
['phone', 'speech_sound', 'sound'] (phonetics) an individual sound unit of speech without concern as to whether or not it is a phoneme of some language
['earphone', 'earpiece', 'headphone', 'phone'] electro-acoustic transducer for converting electric signals into sounds; it is held over or inserted into the ear
['call', 'telephone', 'call_up', 'phone', 'ring'] get or try to get into communication (with someone) by telephone