如果我在Python管道中有一个定制的集成模型，如何进行交叉验证和网格搜索

Question

我正在构建一个定制的集成模型，并且想使用管道在 python 中进行交叉验证和网格搜索。我该怎么办？

我有一个包含网页内容的数据集。我想做的是

将单个网页的内容分成两部分。拆分的原因是因为文本来自页面的不同位置，我想单独处理它们。
我仅使用第 1 部分中的特征训练模型 1，并仅使用第 2 部分中的特征训练模型 2。
假设我从模型 1 得到的分数为 S1，从模型 2 得到的分数为 S2。我训练另一个模型，即逻辑回归模型，将这两个分数整合为最终分数 S。

通过整个过程，有没有一种方法可以使用 sklearn 中的 ML 管道进行交叉验证和网格搜索？

我很欣赏 Dev 在下面的回复，但是当我尝试做同样的事情时，我遇到了新问题。我的代码如下：

data = pd.DataFrame(columns = ['landingVector', 'contentVector', 'label'])

def extractLandingData(X):
        return X['landingVector']

def extractContentData(X):
        return X['contentVector']



svm_landing = Pipeline([
    ("extractLanding", FunctionTransformer(extractLandingData)),
    ("svmLanding", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])
svm_content = Pipeline([
    ("extractContent", FunctionTransformer(extractContentData)),
    ("svmContent", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])

stage_pipeline = FeatureUnion([
    ("svmForLanding", svm_landing),
    ("svmForContent", svm_content),
])

full_pipeline = Pipeline([
    ("stagePipeline", stage_pipeline),
    ("lr", LogisticRegression())
])

params = [
    {
        "stagePipeline__svmForLanding__svmLanding__C": [3,5,10],
        "full_pipeline__lr__C": [1, 5, 10],
        "full_pipeline__lr__penalty": ['l1', 'l2']
    }
]

grid_search = GridSearchCV(full_pipeline, params, cv=3, verbose=3, return_train_score=True, n_jobs=-1)
X_train = df[['landingVector', 'contentVector']]
y_train = df['label']
grid_search.fit(X_train, y_train)

然后我收到一条错误消息

--------------------------------------------------------------------------- 
TypeError                                 Traceback (most recent call last) <ipython-input-15-d8c4e61918e3> in <module>
     23 stage_pipeline = FeatureUnion([
     24     ("svmForLanding", svm_landing),
---> 25     ("svmForContent", svm_content),
     26 ])
     27 

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in
__init__(self, transformer_list, n_jobs, transformer_weights)
    672         self.n_jobs = n_jobs
    673         self.transformer_weights = transformer_weights
--> 674         self._validate_transformers()
    675 
    676     def get_params(self, deep=True):

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in
_validate_transformers(self)
    716                 raise TypeError("All estimators should implement fit and "
    717                                 "transform. '%s' (type %s) doesn't" %
--> 718                                 (t, type(t)))
    719 
    720     def _iter(self):

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
     steps=[('extractLanding', FunctionTransformer(accept_sparse=False, check_inverse=True,
          func=<function extractLandingData at 0x7fdc48700950>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=None)), ('svmLanding', SVC(C=1.0, cache_size=200...inear', max_iter=-1, probability=True, random_state=0,   shrinking=True, tol=0.001, verbose=False))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't

Answer 1

假设您将您的合奏分为两个阶段。 1. 第一阶段模型，即模型 1 和模型 2。 2. 基于第一阶段模型的输出构建的逻辑回归模型。

所以，你可以在第一阶段使用GridSearchCV。这将帮助您找到最佳参数。因为 GridSearchCV 内部使用交叉验证，并且有一个参数“cv”表示折叠数。在不同的数据折叠上选择最佳参数。

对于第 2 阶段模型，即逻辑回归，您实际上不需要执行 GridSearchCV。但是，仍然可以使用“cross_val_score”来计算不同数据子集的分数

Answer 2

是的，您可以使用 GridSearchCv 或 RandomizedSearchCv 为您的管道模型找到最佳超参数。

您可以将模型定义为顺序或并行管道的组合
然后你可以在GridSearchCV中使用最终的管道
在 grid_params 中，您可以通过将管道名称与“__”双下划线连接来引用每个内部管道的超参数

请看下面与您类似的案例示例。查看管道如何链接以及如何在 grid_params 中引用管道项的超参数

email_body_to_wordcount = Pipeline([
    ("convert_to_text", MapTransformer(email_to_text)),
    ("strip_html", MapTransformer(strip_html)),
    ("replace_urls", MapTransformer(replace_urls)),
    ("replace_numbers", MapTransformer(replace_numbers)),
    ("replace_non_word_characters", MapTransformer(replace_non_word_characters)),
    ("count_word_stem", CountStemmedWord()),   
], memory="cache")

subject_to_wordcount =  Pipeline([

    ("process_text", Pipeline([
        ("get_subject", MapTransformer(get_email_subject)),
        ("replace_numbers", MapTransformer(replace_numbers)),
        ("replace_non_word_characters", MapTransformer(replace_non_word_characters)),
    ], memory="cache")),
    ("count_word_stem", CountStemmedWord(importance=5)),
])

email_to_word_count = FeatureUnion([
    ("email_to_wordcount", email_body_to_wordcount),
    ("subject_to_wordcount", subject_to_wordcount)
])

content_type_pipeline = Pipeline([
   ("get_content_type", MapTransformer(email.message.EmailMessage.get_content_type)),
    ("binarize", LblBinarizer())
])

email_len_transform = Pipeline([
    ("convert_to_text", MapTransformer(email_to_text)),
    ("get_email_len", MapTransformer(len)),

])

email_to_word_vector = Pipeline([
    ("email_to_word_count", email_to_word_count),
    ("word_count_to_vector", WordCountsToVector())
])

full_pipeline = FeatureUnion([
    ("email_to_word_vector", email_to_word_vector),
    ("content_type_pipeline", content_type_pipeline),
    ("email_len_transform", email_len_transform)
])

predict_pipeline = Pipeline([
    ("full_pipeline", full_pipeline),
    ("predict", RandomForestClassifier(n_estimators = 5))
])

params = [
    {
        "full_pipeline__email_to_word_vector__email_to_word_count__email_to_wordcount" +
        "__count_word_stem__importance": [3,5],
        "full_pipeline__email_to_word_vector" +
        "__word_count_to_vector__vocabulary_len": [500,1000,1500]
    }
]

grid_search = GridSearchCV(predict_pipeline, params, cv=3, verbose=3, return_train_score=True)
grid_search.fit(X_train, y_train)

已编辑 Pipeline 将对变压器使用

fit and transform

方法，因此你的变压器应该实现这些方法。您可以像下面一样实现自定义转换器并使用它代替 SVC 分类器

from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):
    def fit(self,X,y):
        self.svc = SVC() #initialize your svc here
        return self

    def transform(self,X,y=None):
        return self.svc.predict(X)

如果我在Python管道中有一个定制的集成模型，如何进行交叉验证和网格搜索

问题描述投票：0回答：2

2个回答

最新问题

如果我在Python管道中有一个定制的集成模型，如何进行交叉验证和网格搜索

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2