我是 gensim 的新手,尤其是 gensim 4。老实说,我发现很难理解文档如何微调预训练的 word2vec 模型。 我有一个保存在本地的二进制预训练模型。我想根据新数据微调这个模型。
我的问题是;
到目前为止,我已经创建了以下代码:
# path to pretrained model
pretrained_path = '../models/german.model'
# new data
sentences = df.stem_token_wo_sw.to_list() # Pandas column containing text data
# Create new model
w2v_de = Word2Vec(
min_count = min_count,
vector_size = vector_size,
window = window,
workers = workers,
)
# Build vocab
w2v_de.build_vocab(sentences)
# Extract number of examples
total_examples = w2v_de.corpus_count
# Load pretrained model
model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)
# Add previous words from pretrained model
w2v_de.build_vocab([list(model.key_to_index.keys())], update=True)
# Train model
w2v_de.train(sentences, total_examples=total_examples, epochs=2)
# create array of vectors
vectors = np.asarray(w2v_de.wv.vectors)
# create array of labels
labels = np.asarray(w2v_de.wv.index_to_key)
# create dataframe of vectors for each word
w_emb = pd.DataFrame(
index = labels,
columns = [f'X{n}' for n in range(1, vectors.shape[1] + 1)],
data = vectors,
)
训练后,我使用 PCA 将维度从 300 减少到两个,以绘制词嵌入空间。
# create pipeline
pipeline = Pipeline(
steps = [
# ('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
]
)
# fit pipeline
pipeline.fit(w_emb)
# Transform vectors
vectors_transformed = pipeline.transform(w_emb)
w_emb_transformed = (
pd.DataFrame(
index = labels,
columns = ['PC1', 'PC2'],
data = vectors_transformed,
)
)
labels
和 vectors
只包含新词,而不包含旧词和新词,我的情节和 PCA 值也是如此。