Pipeline Module (API Reference)

This class is used to do training for different algorithms.

This will just contain the training logic here with both preprocessing and algorithm training to produce already trained models and dump them.

So if we do need to get the models’ best score and prediction, the best process is to load the trained model from disk and do transformation and prediction. If we need to do test, then frist we need to do transformation based on the processor and use the highest score model to do prediction. One important thing here: 1. save the processor; 2. dump trained data; 3. dump whole trained models.

@author: Guangqiang.lu

class automl.pipeline_training.ClassificationPipeline(backend=None, include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, **kwargs)

Bases: automl.pipeline_training.PipelineTrain

Classification pipeline class that we could use as a pipeline, also the ensemble logic should happen here.

__getitem__(ind)

Returns a sub-pipeline or a single esimtator in the pipeline

Indexing with an integer will return an estimator; using a slice returns another Pipeline instance which copies a slice of this Pipeline. This copy is shallow: modifying (or fitting) estimators in the sub-pipeline will affect the larger pipeline and vice-versa. However, replacing a value in step will not affect a copy.

__len__()

Returns the length of the Pipeline

build_preprocessing_pipeline(data=None)

The reason that I want to split the preprocessing pipeline and training pipeline is that we will re-use the whole pre-processing steps if there contains some null values, so I think just to split the real pipeline into 2 parts: pre-processing and training. After whole steps finish, then I would love to store the processed data into disk, so that we could re-use the data.

Also we need to store this pre-processing instance either combined with training pipeline instance.

But I also want to add one more steps without the processing steps, as maybe the models could do better than with processing, we should store 3 parts data:

1: origin data; 2: data processed with imputation; 3: data processed with whole processed.

Parameters

data

Returns

build_training_pipeline(y=None, use_neural_network=True)

Real pipeline step should happen here. Let child to do real build with different steps and add the steps instance into pipeline object. Also I think here should a lazy instant step, should happen when we do real fit logic, so that we could also based on data to modify our steps.

Important thing to notice:

I think even we have many algorithm instances, first step should combine processing step with each algorithm, then we could get some best scores models and save them into disk.

Then we could load them from disk and combine them with ensemble logic!

I have created a model_selection module to get best models based on training data, so here don’t need a list of pipeline objects.

Returns

a list of instance algorithm object.

decision_function(X)

Apply transforms, and decision_function of the final estimator

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

array-like of shape (n_samples, n_classes)

fit(x, y, n_jobs=None, use_neural_network=True)

Real pipeline training steps happen here. :param x: :param y: :return:

fit_predict(X, y=None, **fit_params)

Applies fit_predict of last step in pipeline after transforms.

Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.

Parameters
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.

  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns

y_pred

Return type

array-like

fit_transform(X, y=None, **fit_params)

Fit the model and transform with the final estimator

Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.

Parameters
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.

  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns

Xt – Transformed samples

Return type

array-like of shape (n_samples, n_transformed_features)

get_params(deep=True)

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

get_sorted_models_scores(x, y, reverse=True)

Add this func to get whole score based for each trained models, so that we could get the result that we have taken that times and for each models, how about the testing result.

Load whole trained models from disk and do processing for the new data, and score based on each model with different type of problem. :param x: :param y: :param reverse:

Whether or not to order the result based the reverse.

Returns

sorted dictionary: {‘lr-0.982’: 0.87, …}

property inverse_transform

Apply inverse transformations in reverse order

All estimators in the pipeline must support inverse_transform.

Parameters

Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where n_samples is the number of samples and n_features is the number of features. Must fulfill input requirements of last step of pipeline’s inverse_transform method.

Returns

Xt

Return type

array-like of shape (n_samples, n_features)

predict(x)

Based on the training_pipeline to get prediction :param x: :return:

predict_log_proba(X)

Apply transforms, and predict_log_proba of the final estimator

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

array-like of shape (n_samples, n_classes)

predict_proba(x)

Based on the training_pipeline to get probability :param x: :return:

score(x, y)

Default will just use the best trained estimator to do score. :param x: :param y: :return:

score_samples(X)

Apply transforms, and score_samples of the final estimator.

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

ndarray of shape (n_samples,)

set_params(**kwargs)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in steps.

Returns

Return type

self

property transform

Apply transforms, and transform with the final estimator

This also works where final estimator is None: all prior transformations are applied.

Parameters

X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.

Returns

Xt

Return type

array-like of shape (n_samples, n_transformed_features)

class automl.pipeline_training.PipelineTrain(include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, use_imputation=True, use_onehot=True, use_standard=True, use_norm=False, use_pca=True, use_minmax=False, use_feature_seletion=False, max_feature_num=80, use_ensemble=True, ensemble_alg='stacking', voting_logic='soft', backend=None)

Bases: sklearn.pipeline.Pipeline

Let’s make it as parent class for both classification and regression.

__getitem__(ind)

Returns a sub-pipeline or a single esimtator in the pipeline

Indexing with an integer will return an estimator; using a slice returns another Pipeline instance which copies a slice of this Pipeline. This copy is shallow: modifying (or fitting) estimators in the sub-pipeline will affect the larger pipeline and vice-versa. However, replacing a value in step will not affect a copy.

__len__()

Returns the length of the Pipeline

build_preprocessing_pipeline(data=None)

The reason that I want to split the preprocessing pipeline and training pipeline is that we will re-use the whole pre-processing steps if there contains some null values, so I think just to split the real pipeline into 2 parts: pre-processing and training. After whole steps finish, then I would love to store the processed data into disk, so that we could re-use the data.

Also we need to store this pre-processing instance either combined with training pipeline instance.

But I also want to add one more steps without the processing steps, as maybe the models could do better than with processing, we should store 3 parts data:

1: origin data; 2: data processed with imputation; 3: data processed with whole processed.

Parameters

data

Returns

build_training_pipeline(y=None, use_neural_network=True)

Real pipeline step should happen here. Let child to do real build with different steps and add the steps instance into pipeline object. Also I think here should a lazy instant step, should happen when we do real fit logic, so that we could also based on data to modify our steps.

Important thing to notice:

I think even we have many algorithm instances, first step should combine processing step with each algorithm, then we could get some best scores models and save them into disk.

Then we could load them from disk and combine them with ensemble logic!

I have created a model_selection module to get best models based on training data, so here don’t need a list of pipeline objects.

Returns

a list of instance algorithm object.

decision_function(X)

Apply transforms, and decision_function of the final estimator

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

array-like of shape (n_samples, n_classes)

fit(x, y, n_jobs=None, use_neural_network=True)

Real pipeline training steps happen here. :param x: :param y: :return:

fit_predict(X, y=None, **fit_params)

Applies fit_predict of last step in pipeline after transforms.

Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.

Parameters
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.

  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns

y_pred

Return type

array-like

fit_transform(X, y=None, **fit_params)

Fit the model and transform with the final estimator

Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.

Parameters
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.

  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns

Xt – Transformed samples

Return type

array-like of shape (n_samples, n_transformed_features)

get_params(deep=True)

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

get_sorted_models_scores(x, y, reverse=True)

Add this func to get whole score based for each trained models, so that we could get the result that we have taken that times and for each models, how about the testing result.

Load whole trained models from disk and do processing for the new data, and score based on each model with different type of problem. :param x: :param y: :param reverse:

Whether or not to order the result based the reverse.

Returns

sorted dictionary: {‘lr-0.982’: 0.87, …}

property inverse_transform

Apply inverse transformations in reverse order

All estimators in the pipeline must support inverse_transform.

Parameters

Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where n_samples is the number of samples and n_features is the number of features. Must fulfill input requirements of last step of pipeline’s inverse_transform method.

Returns

Xt

Return type

array-like of shape (n_samples, n_features)

predict(x)

Based on the training_pipeline to get prediction :param x: :return:

predict_log_proba(X)

Apply transforms, and predict_log_proba of the final estimator

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

array-like of shape (n_samples, n_classes)

predict_proba(x)

Based on the training_pipeline to get probability :param x: :return:

score(x, y)

Default will just use the best trained estimator to do score. :param x: :param y: :return:

score_samples(X)

Apply transforms, and score_samples of the final estimator.

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

ndarray of shape (n_samples,)

set_params(**kwargs)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in steps.

Returns

Return type

self

property transform

Apply transforms, and transform with the final estimator

This also works where final estimator is None: all prior transformations are applied.

Parameters

X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.

Returns

Xt

Return type

array-like of shape (n_samples, n_transformed_features)

class automl.pipeline_training.RegressionPipeline(backend=None, include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, **kwargs)

Bases: automl.pipeline_training.PipelineTrain

__getitem__(ind)

Returns a sub-pipeline or a single esimtator in the pipeline

Indexing with an integer will return an estimator; using a slice returns another Pipeline instance which copies a slice of this Pipeline. This copy is shallow: modifying (or fitting) estimators in the sub-pipeline will affect the larger pipeline and vice-versa. However, replacing a value in step will not affect a copy.

__len__()

Returns the length of the Pipeline

build_preprocessing_pipeline(data=None)

The reason that I want to split the preprocessing pipeline and training pipeline is that we will re-use the whole pre-processing steps if there contains some null values, so I think just to split the real pipeline into 2 parts: pre-processing and training. After whole steps finish, then I would love to store the processed data into disk, so that we could re-use the data.

Also we need to store this pre-processing instance either combined with training pipeline instance.

But I also want to add one more steps without the processing steps, as maybe the models could do better than with processing, we should store 3 parts data:

1: origin data; 2: data processed with imputation; 3: data processed with whole processed.

Parameters

data

Returns

build_training_pipeline(y=None, use_neural_network=True)

Real pipeline step should happen here. Let child to do real build with different steps and add the steps instance into pipeline object. Also I think here should a lazy instant step, should happen when we do real fit logic, so that we could also based on data to modify our steps.

Important thing to notice:

I think even we have many algorithm instances, first step should combine processing step with each algorithm, then we could get some best scores models and save them into disk.

Then we could load them from disk and combine them with ensemble logic!

I have created a model_selection module to get best models based on training data, so here don’t need a list of pipeline objects.

Returns

a list of instance algorithm object.

decision_function(X)

Apply transforms, and decision_function of the final estimator

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

array-like of shape (n_samples, n_classes)

fit(x, y, n_jobs=None, use_neural_network=True)

Real pipeline training steps happen here. :param x: :param y: :return:

fit_predict(X, y=None, **fit_params)

Applies fit_predict of last step in pipeline after transforms.

Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.

Parameters
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.

  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns

y_pred

Return type

array-like

fit_transform(X, y=None, **fit_params)

Fit the model and transform with the final estimator

Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.

Parameters
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.

  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns

Xt – Transformed samples

Return type

array-like of shape (n_samples, n_transformed_features)

get_params(deep=True)

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

get_sorted_models_scores(x, y, reverse=True)

Add this func to get whole score based for each trained models, so that we could get the result that we have taken that times and for each models, how about the testing result.

Load whole trained models from disk and do processing for the new data, and score based on each model with different type of problem. :param x: :param y: :param reverse:

Whether or not to order the result based the reverse.

Returns

sorted dictionary: {‘lr-0.982’: 0.87, …}

property inverse_transform

Apply inverse transformations in reverse order

All estimators in the pipeline must support inverse_transform.

Parameters

Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where n_samples is the number of samples and n_features is the number of features. Must fulfill input requirements of last step of pipeline’s inverse_transform method.

Returns

Xt

Return type

array-like of shape (n_samples, n_features)

predict(x)

Based on the training_pipeline to get prediction :param x: :return:

predict_log_proba(X)

Apply transforms, and predict_log_proba of the final estimator

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

array-like of shape (n_samples, n_classes)

predict_proba(x)

Based on the training_pipeline to get probability :param x: :return:

score(x, y)

Default will just use the best trained estimator to do score. :param x: :param y: :return:

score_samples(X)

Apply transforms, and score_samples of the final estimator.

Parameters

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns

y_score

Return type

ndarray of shape (n_samples,)

set_params(**kwargs)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in steps.

Returns

Return type

self

property transform

Apply transforms, and transform with the final estimator

This also works where final estimator is None: all prior transformations are applied.

Parameters

X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.

Returns

Xt

Return type

array-like of shape (n_samples, n_transformed_features)