我正在尝试构建一个word2vec相似度字典。我能够构建一个字典,但相似性没有正确填充。我在代码中遗漏了什么吗?
输入样本数据文本
TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG
我的代码:
import gensim
from gensim import corpora,similarities,models
class AccCorpus(object):
def __init__(self):
self.path = ''
def __iter__(self):
for sentence in data["Adj_Addr"]:
yield [word.lower() for word in sentence.split()]
def build_corpus():
model = gensim.models.word2vec.Word2Vec(alpha=0.05, min_alpha=0.05,window=2,sg=1)
sentences = AccCorpus()
model.build_vocab(sentences)
for epoch in range(1):
model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
model_name = "word2vec_model"
model.save(model_name)
return model
model=build_corpus()
我的结果:
model.most_similar("wan")
[('want', 0.6867533922195435),
('puiwan', 0.6323356032371521),
('wan.', 0.6132887005805969),
('wanstreet', 0.5945449471473694),
('aupuiwan', 0.594132661819458),
('futan', 0.5883135199546814),
('fotan', 0.5817855000495911),
('shanmei', 0.5807071924209595),
('30-33', 0.5789132118225098),
('61-63au', 0.5711270570755005)]
以下是我对相似性的预期输出:上环,湾仔,柴湾。我猜我的skipgrams不能正常工作。我怎样才能解决这个问题?
正如评论中已经建议的那样,除非你确定它是必要的(在你的情况下它不是,很可能),所以不需要调整alpha
和其他内部参数。
你得到了很多额外的结果,因为它在你的数据中。我不知道Adj_Addr
是什么,但它不仅仅是你提供的文字:puiwan
,futan
,fotan
,...... - 这些都不在上面的文字中。
这是干净的测试,就像你希望它工作一样(我只留下相关的部分,随意添加sg=1
- 也可以):
import gensim
text = """TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG"""
sentences = text.split('\n')
class AccCorpus(object):
def __init__(self):
self.path = ''
def __iter__(self):
for sentence in sentences:
yield [word.lower() for word in sentence.split()]
def build_corpus():
model = gensim.models.word2vec.Word2Vec()
sentences = AccCorpus()
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
return model
model = build_corpus()
print(model.most_similar("wan"))
结果是:
[('chai', 0.04687393456697464), ('rd', -0.03181878849864006), ('sheung', -0.06769674271345139)]