Question

然后我将

.toarray()

用于

X_train

，我得到了以下内容：

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

从以前的问题开始，我知道我需要减少numpy阵列的维度，因此我也这样做：

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=300)                                
X_reduced_train = pca.fit_transform(X_train)               

from sklearn.ensemble import RandomForestClassifier                 
classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(X_reduced_train, y_train)                            
prediction = classifier.predict(X_testing)

然后我得到了这个例外：

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__
    raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

我尝试了以下内容： prediction = classifier.predict(X_train.getnnz())

得到了：

File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) TypeError: object of type 'int' has no len()

从中提出了两个问题：如何使用随机森林正确进行分类？

X_train

？

然后我尝试了以下内容：

df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])



X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)

from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)

我对

sklearn

的了解不多，尽管我隐约记得通过使用稀疏矩阵触发的一些早期问题。  在内部，一些矩阵必须用

m.toarray()

或

m.todense()

替换。

但要让您了解错误消息的内容，请考虑

Answer 1

len()

通常在python中用于计算列表的第一级项的数量。当应用于2D数组时，它是行的数量。但是

A.shape[0]

是计算行的更好方法。

M.shape[0]

是一样的。在这种情况下，您对

.getnnz

不感兴趣，这是稀疏矩阵的非零项的数量。

A

没有此方法，尽管可以源自

A.nonzero()

。

是否要将相同的数据结构（类型和形状）传递给分类器的方法和

fit

方法有点不清楚。随机森林将需要很长时间才能具有大量功能，因此建议降低您链接到的帖子中的维度。

您应该将SVD应用于培训和测试数据，以便在与您希望预测的数据相同的数据的培训中进行分类器。检查对拟合的输入，并且预测方法的输入具有相同数量的特征，并且都是数组而不是稀疏矩阵。

the以下示例：

使用DataFrame

predict

注意SVD发生在分为训练和测试集之前，因此传递给预测变量的数组与dray the

Answer 2

方法的数组相同。

I与OneHotEncoder遇到了同样的问题，并且成功使用Sparse_Output =false.

n

问题描述投票：0回答：2

2个回答

最新问题

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2