我被 Python/Sci-Kit Learn/Pipelines 的问题难住了。我收到一个错误,表明数据通过管道时的形状不是预期的。
具体错误:
blocks[0,:] has incompatible row dimensions. Got blocks[0,6].shape[0] == 4, expected 794.
我要发送到 TFIDF 的内容
X['Subject'].head()
5 FW: Customer PO 345 \\ HAC 73054 and 7345
8 Insured return request, o# 35693
10 Issue with a new Feature - QAR
13 FW: ABC / TSS PO catchup
15 WTM request - 1deaSe sales orders for CDE PO TSSe9-1r9
相关代码如下:
#FeatureSelector selects a list of columns from a data
#frame and returns them to a pipeline step for processing
class FeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, keys, description = ""):
self.keys = keys
self.description = description
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
print("Keys out", self.keys)
return X[self.keys]
# Custom transformer to help with debugging.
class Debugger(BaseEstimator, TransformerMixin):
def __init__(self, stepName = ""):
self.stepName = stepName
def transform(self, data):
print("Step Name", self.stepName)
print("Contents of data", data.columns)
print("Shape of Pre-processed Data:", data.shape)
#print(pd.DataFrame(data).head())
return data
def fit(self, data, y=None, **fit_params):
# No need to fit anything, because this is not an actual transformation.
return self
y = df['Assigned To']
X = df
# Construct Pipelines
ohe_pipe = Pipeline([
("feature_selector", FeatureSelector(df.select_dtypes(exclude=['int64','object']).columns.to_list())),
('feature selector debugger', Debugger()),
("ohe", OneHotEncoder()),
('ohe debugger', Debugger())],
verbose = True)
text_pipe = Pipeline([
("feature_selector", FeatureSelector(df.select_dtypes(include='object').columns.to_list())),
('feature selector debugger', Debugger()),
("tfidf_vectorizer", TfidfVectorizer()),
('tfidf debugger', Debugger())],
verbose = True)
knn_pipe = Pipeline([
("feature_union", FeatureUnion([
("ohe_pipe", ohe_pipe),
("text_pipe", text_pipe)
])),
("classifier", KNeighborsClassifier())
])
knn_grid = GridSearchCV(
estimator=knn_pipe,
param_grid = {'classifier__n_neighbors': [x for x in range(5, 20, 1) if x % 2 != 0],
'classifier__weights': ['distance'],
'classifier__leaf_size': range(20,40,1),
'feature_union__text_pipe__tfidf_vectorizer__min_df': [.01, .02, .03, .04, .05],
'feature_union__text_pipe__tfidf_vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
'feature_union__text_pipe__tfidf_vectorizer__stop_words': ['english'],
'feature_union__text_pipe__tfidf_vectorizer__lowercase': [True],
'feature_union__ohe_pipe__ohe__sparse_output': [False]
},
scoring = {'accuracy': make_scorer(accuracy_score)
# ,
# 'f1': make_scorer(f1_score),
# 'precision': make_scorer(precision_score),
# 'recall': make_scorer(recall_score),
# 'roc_auc': make_scorer(roc_auc_score)
},
cv = 3,
refit = 'accuracy',
error_score='raise')
knn_grid.fit(X, y)
根据我的调试跟踪,您可以看到(下面以粗体显示)TFIDF 步骤返回的形状为 (1,1),而不是 (529, 1)。如果我在管道之外运行所有这些步骤(包括 TFIDF),TFIDF 将返回 (529,1)。我想使用管道来实现网格搜索功能。
Keys out ['Issue Classification', 'Application', 'Case Submitter']
[Pipeline] .. (step 1 of 4) Processing feature_selector, total= 0.0s
Step Name
Shape of Pre-processed Data: (529, 3)
[Pipeline] (step 2 of 4) Processing feature selector debugger, total= 0.0s
[Pipeline] ............... (step 3 of 4) Processing ohe, total= 0.0s
Step Name
Shape of Pre-processed Data: (529, 66)
[Pipeline] ...... (step 4 of 4) Processing ohe debugger, total= 0.0s
Keys out ['Subject']
[Pipeline] .. (step 1 of 4) Processing feature_selector, total= 0.0s
Step Name
Shape of Pre-processed Data: (529, 1)
[Pipeline] (step 2 of 4) Processing feature selector debugger, total= 0.0s
[Pipeline] .. (step 3 of 4) Processing tfidf_vectorizer, total= 0.0s
Step Name
**Shape of Pre-processed Data: (1, 1)
[Pipeline] .... (step 4 of 4) Processing tfidf debugger, total= 0.0s**
我想我已经弄清楚了。在遵循从 CountVectorizer 继承的 TFIDFVectorizer 的源定义(这是在 python 中这么说吗?)之后,我能够分别将一些打印语句放入 TFIDF 和 CV 中的 transform 和 _count_vocab 方法中。
我看到,虽然我的数据帧列成功传递到 TFIDF 变换方法,然后传递到 CV 变换方法,然后传递到 CV _count_vocab 方法,但当 _count_vocab 方法去迭代该数据帧列时,它不能。该实现的代码如下:
#print("REMOVE****** RawDocs in _count_vocab->", raw_documents)
values = _make_int_array()
indptr.append(0)
for doc in raw_documents:
#MY ERROR Showed up Here, this for loop only iterated once
#print("REMOVE****** Individual Doc in _count_vocab->", doc)
feature_counter = {}
for feature in analyze(doc):
try:
feature_idx = vocabulary[feature]
if feature_idx not in feature_counter:
feature_counter[feature_idx] = 1
else:
feature_counter[feature_idx] += 1
except KeyError:
# Ignore out-of-vocabulary items for fixed_vocab=True
continue
解决方案是修改我的 FeatureSelector 以包含返回可迭代列表而不是数据框列的选项。
我不知道在不使用管道时如何能够传递数据框列,但是,这似乎在管道内部工作。