我正在尝试构建一个多类别的多标签模型,根据情节对电影类型进行分类。有24种不同的电影类型,这是按流派的电影数量:
genre number_of_movies
Drama 3965
Comedy 3046
Thriller 2024
Romance 1892
Crime 1447
Action 1303
Adventure 1024
Horror 954
Mystery 759
Sci-Fi 723
Fantasy 707
Family 682
Documentary 419
Biography 373
War 348
Music 341
History 273
Musical 271
Sport 261
Animation 260
Western 237
Film-Noir 168
Short 92
News 7
我正在使用CountVectorizer()创建功能,如下所述:
vect = CountVectorizer(max_features=4412, stop_words='english', ngram_range=(1, 3), binary=True)
X = vect.fit_transform(df['plot'])
X.shape
输出:
(7895, 4412)
和MultiLabelBinarizer()用于创建y_genres:
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])
y_genres.shape
输出:
(7895, 24)
目标是使用来自imblearn.over_sampling的RandomOverSampler和SMOTE重新采样除大多数类之外的所有类。不过,使用时:
ros = RandomOverSampler(random_state=42)
X_resampled, Y_resampled = ros.fit_sample(X, y_genres)
Y_resampled.shape
输出:
(52690, 22)
sm = SMOTE(random_state=42)
X_resampled, Y_resampled = sm.fit_sample(X, y_genres)
错误:
Expected n_neighbors <= n_samples, but n_samples = 2, n_neighbors = 6
我该怎么做才能解决之前描述的2个问题?
sm.fit_resample可能是救援。