Sci-kit学习：研究错误分类的数据

Question

Context。我的工作流程看起来像这样：

启动了一堆JSON BLOBS，即原始数据。这是Pandas DataFrame。

提取建模的相关作品，将其称为数据。这是Pandas DataFrame。

在上面的训练数据上训练森林，X_Train，这是一个较大的稀疏基质。
获取假阳性和假阴性结果的索引。这些是x_test的索引，一个稀疏的矩阵。
将索引阵列访问split_test_train（）函数，该功能将在索引数组上应用相同的随机器，并将其返回为火车和测试数据（idx_test）
为误报和假否定的索引，这些是nd.arrays

用这些查找索引数组中的原始位置，例如index = idx_test [false_example] forse_example在false_neg array中

使用该索引查找原始数据，data.iloc [索引]是原始数据

the data.Index [index]将在需要时将索引值返回到原始数据中

使用Tweets的示例关联的代码。同样，这起作用了，但是是否有更直接/更聪明的方法可以做到这一点？
# take a sample of our original data data=tweet_df[0:100]['texts'] y=tweet_df[0:100]['truth'] # create the feature vectors vec=TfidfVectorizer(analyzer="char",ngram_range=(1,2)) X=vec.fit_transform(data) # this is now feature matrix # split the feature matrix into train/test subsets, keeping the indices back into the original X using the # array indices indices = np.arange(X.shape[0]) X_train, X_test, y_train, y_test,idx_train,idx_test=train_test_split(X,y,indices,test_size=0.2,random_state=state) # fit and test a model forest=RandomForestClassifier() forest.fit(X_train,y_train) predictions=forest.predict(X_test) # get the indices for false_negatives and false_positives in the test set false_neg, false_pos=tweet_fns.check_predictions(predictions,y_test) # map the false negative indices in the test set (which is features) back to it's original data (text) print "False negatives: \n" pd.options.display.max_colwidth = 140 for i in false_neg: original_index=idx_test[i] print data.iloc[original_index]
和校验预测功能：
def check_predictions(predictions,truth): # take a 1-dim array of predictions from a model, and a 1-dim truth vector and calculate similarity # returns the indices of the false negatives and false positives in the predictions. truth=truth.astype(bool) predictions=predictions.astype(bool) print sum(predictions == truth), 'of ', len(truth), "or ", float(sum(predictions == truth))/float(len(truth))," match" # false positives print "false positives: ", sum(predictions & ~truth) # false negatives print "false negatives: ",sum( ~predictions & truth) false_neg=np.nonzero(~predictions & truth) # these are tuples of arrays false_pos=np.nonzero(predictions & ~truth) return false_neg[0], false_pos[0] # we just want the arrays to return

raw数据 - >特征 - >拆分 - >火车 - >预测 - >标签上的错误分析

预测矩阵和特征矩阵之间的行相对应关系，因此，如果要对特征进行错误分析，则应该没有问题。如果您想查看哪些原始数据与错误关联，则必须在原始数据上进行拆分，或者跟踪映射到哪个测试行（您当前方法）的数据行。

第一个选项看起来像：

原始数据上的拟合变压器 - >拆分原始数据 - >分别转换火车/测试 - >火车/测试 - > ...

Answer 1

也就是说，它在分裂之前和分裂后使用

fit

，使您的原始数据与标签相同。

这个问题正是我正在考虑的事情，但我想进一步走。一旦我们拥有错误标记的数据的所有“原始数据”索引：我们该怎么办？可以采取什么样的事情来分析为什么某些东西被错误地标记了？

有很多有关如何评分模型性能的信息，但是下一个故障排除的级别是什么？我们如何开始分析为什么事情被错误标记？

Sci-kit学习：研究错误分类的数据

问题描述投票：0回答：1

1个回答

最新问题

Sci-kit学习：研究错误分类的数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1