使用 TFIDF 的 SKL 管道中的数据形状问题

问题描述 投票:0回答:1

我被 Python/Sci-Kit Learn/Pipelines 的问题难住了。我收到一个错误,表明数据通过管道时的形状不是预期的。

具体错误:

blocks[0,:] has incompatible row dimensions. Got blocks[0,6].shape[0] == 4, expected 794.

我要发送到 TFIDF 的内容

X['Subject'].head()
5          FW: Customer PO 345 \\ HAC 73054 and 7345
8                            Insured return request, o# 35693
10                             Issue with a new Feature - QAR
13                                       FW: ABC / TSS PO catchup
15    WTM request - 1deaSe sales orders for CDE PO TSSe9-1r9

相关代码如下:

#FeatureSelector selects a list of columns from a data
#frame and returns them to a pipeline step for processing
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, keys, description = ""):
        self.keys = keys
        self.description = description
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        print("Keys out", self.keys)
        return X[self.keys]

# Custom transformer to help with debugging. 
class Debugger(BaseEstimator, TransformerMixin):

    def __init__(self, stepName = ""):
        self.stepName = stepName

    def transform(self, data): 
        print("Step Name", self.stepName)
        print("Contents of data", data.columns)
        print("Shape of Pre-processed Data:", data.shape)
        #print(pd.DataFrame(data).head())
        return data

    def fit(self, data, y=None, **fit_params):
        # No need to fit anything, because this is not an actual  transformation. 
        return self


y = df['Assigned To']
X = df

# Construct Pipelines
ohe_pipe = Pipeline([
    ("feature_selector", FeatureSelector(df.select_dtypes(exclude=['int64','object']).columns.to_list())),
    ('feature selector debugger', Debugger()),
    ("ohe", OneHotEncoder()),
    ('ohe debugger', Debugger())],
    verbose = True)
text_pipe = Pipeline([
    ("feature_selector", FeatureSelector(df.select_dtypes(include='object').columns.to_list())),
    ('feature selector debugger', Debugger()),
    ("tfidf_vectorizer", TfidfVectorizer()),
    ('tfidf debugger', Debugger())], 
    verbose = True)

knn_pipe = Pipeline([
    ("feature_union", FeatureUnion([
        ("ohe_pipe", ohe_pipe),
        ("text_pipe", text_pipe)
        ])),
    ("classifier", KNeighborsClassifier())
    ])

knn_grid = GridSearchCV(
                     estimator=knn_pipe,
                     param_grid = {'classifier__n_neighbors': [x for x in range(5, 20, 1) if x % 2 != 0],
                                   'classifier__weights': ['distance'],
                                   'classifier__leaf_size': range(20,40,1),
                                   'feature_union__text_pipe__tfidf_vectorizer__min_df': [.01, .02, .03, .04, .05],
                                   'feature_union__text_pipe__tfidf_vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)], 
                                   'feature_union__text_pipe__tfidf_vectorizer__stop_words': ['english'],
                                   'feature_union__text_pipe__tfidf_vectorizer__lowercase': [True],
                                   'feature_union__ohe_pipe__ohe__sparse_output': [False]
                                   },
                     scoring = {'accuracy': make_scorer(accuracy_score)
                                # ,
                                # 'f1': make_scorer(f1_score),
                                # 'precision': make_scorer(precision_score),
                                # 'recall': make_scorer(recall_score),
                                # 'roc_auc': make_scorer(roc_auc_score)
                                },
                     cv = 3, 
                     refit = 'accuracy', 
                     error_score='raise')
knn_grid.fit(X, y)

根据我的调试跟踪,您可以看到(下面以粗体显示)TFIDF 步骤返回的形状为 (1,1),而不是 (529, 1)。如果我在管道之外运行所有这些步骤(包括 TFIDF),TFIDF 将返回 (529,1)。我想使用管道来实现网格搜索功能。

Keys out ['Issue Classification', 'Application', 'Case Submitter']
[Pipeline] .. (step 1 of 4) Processing feature_selector, total=   0.0s
Step Name 
Shape of Pre-processed Data: (529, 3)
[Pipeline]  (step 2 of 4) Processing feature selector debugger, total=   0.0s
[Pipeline] ............... (step 3 of 4) Processing ohe, total=   0.0s
Step Name 
Shape of Pre-processed Data: (529, 66)
[Pipeline] ...... (step 4 of 4) Processing ohe debugger, total=   0.0s
Keys out ['Subject']
[Pipeline] .. (step 1 of 4) Processing feature_selector, total=   0.0s
Step Name 
Shape of Pre-processed Data: (529, 1)
[Pipeline]  (step 2 of 4) Processing feature selector debugger, total=   0.0s
[Pipeline] .. (step 3 of 4) Processing tfidf_vectorizer, total=   0.0s
Step Name 
**Shape of Pre-processed Data: (1, 1)
[Pipeline] .... (step 4 of 4) Processing tfidf debugger, total=   0.0s**
python machine-learning scikit-learn pipeline tf-idf
1个回答
0
投票

我想我已经弄清楚了。在遵循从 CountVectorizer 继承的 TFIDFVectorizer 的源定义(这是在 python 中这么说吗?)之后,我能够分别将一些打印语句放入 TFIDF 和 CV 中的 transform 和 _count_vocab 方法中。

我看到,虽然我的数据帧列成功传递到 TFIDF 变换方法,然后传递到 CV 变换方法,然后传递到 CV _count_vocab 方法,但当 _count_vocab 方法去迭代该数据帧列时,它不能。该实现的代码如下:

#print("REMOVE****** RawDocs in _count_vocab->", raw_documents)
values = _make_int_array()
indptr.append(0)
for doc in raw_documents:
    #MY ERROR Showed up Here, this for loop only iterated once
    #print("REMOVE****** Individual Doc in _count_vocab->", doc)
    feature_counter = {}
    for feature in analyze(doc):
        try:
            feature_idx = vocabulary[feature]
            if feature_idx not in feature_counter:
                feature_counter[feature_idx] = 1
            else:
                feature_counter[feature_idx] += 1
        except KeyError:
            # Ignore out-of-vocabulary items for fixed_vocab=True
            continue

解决方案是修改我的 FeatureSelector 以包含返回可迭代列表而不是数据框列的选项。

我不知道在不使用管道时如何能够传递数据框列,但是,这似乎在管道内部工作。

© www.soinside.com 2019 - 2024. All rights reserved.