Pipeline Module (API Reference)¶
This class is used to do training for different algorithms.
This will just contain the training logic here with both preprocessing and algorithm training to produce already trained models and dump them.
So if we do need to get the models’ best score and prediction, the best process is to load the trained model from disk and do transformation and prediction. If we need to do test, then frist we need to do transformation based on the processor and use the highest score model to do prediction. One important thing here: 1. save the processor; 2. dump trained data; 3. dump whole trained models.
@author: Guangqiang.lu
- class automl.pipeline_training.ClassificationPipeline(backend=None, include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, **kwargs)¶
Bases:
automl.pipeline_training.PipelineTrainClassification pipeline class that we could use as a pipeline, also the ensemble logic should happen here.
- __getitem__(ind)¶
Returns a sub-pipeline or a single esimtator in the pipeline
Indexing with an integer will return an estimator; using a slice returns another Pipeline instance which copies a slice of this Pipeline. This copy is shallow: modifying (or fitting) estimators in the sub-pipeline will affect the larger pipeline and vice-versa. However, replacing a value in step will not affect a copy.
- __len__()¶
Returns the length of the Pipeline
- build_preprocessing_pipeline(data=None)¶
The reason that I want to split the preprocessing pipeline and training pipeline is that we will re-use the whole pre-processing steps if there contains some null values, so I think just to split the real pipeline into 2 parts: pre-processing and training. After whole steps finish, then I would love to store the processed data into disk, so that we could re-use the data.
Also we need to store this pre-processing instance either combined with training pipeline instance.
But I also want to add one more steps without the processing steps, as maybe the models could do better than with processing, we should store 3 parts data:
1: origin data; 2: data processed with imputation; 3: data processed with whole processed.
- Parameters
data –
- Returns
- build_training_pipeline(y=None, use_neural_network=True)¶
Real pipeline step should happen here. Let child to do real build with different steps and add the steps instance into pipeline object. Also I think here should a lazy instant step, should happen when we do real fit logic, so that we could also based on data to modify our steps.
- Important thing to notice:
I think even we have many algorithm instances, first step should combine processing step with each algorithm, then we could get some best scores models and save them into disk.
Then we could load them from disk and combine them with ensemble logic!
I have created a model_selection module to get best models based on training data, so here don’t need a list of pipeline objects.
- Returns
a list of instance algorithm object.
- decision_function(X)¶
Apply transforms, and decision_function of the final estimator
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
array-like of shape (n_samples, n_classes)
- fit(x, y, n_jobs=None, use_neural_network=True)¶
Real pipeline training steps happen here. :param x: :param y: :return:
- fit_predict(X, y=None, **fit_params)¶
Applies fit_predict of last step in pipeline after transforms.
Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.
- Parameters
X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the
fitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p.
- Returns
y_pred
- Return type
array-like
- fit_transform(X, y=None, **fit_params)¶
Fit the model and transform with the final estimator
Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.
- Parameters
X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the
fitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p.
- Returns
Xt – Transformed samples
- Return type
array-like of shape (n_samples, n_transformed_features)
- get_params(deep=True)¶
Get parameters for this estimator.
Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
- get_sorted_models_scores(x, y, reverse=True)¶
Add this func to get whole score based for each trained models, so that we could get the result that we have taken that times and for each models, how about the testing result.
Load whole trained models from disk and do processing for the new data, and score based on each model with different type of problem. :param x: :param y: :param reverse:
Whether or not to order the result based the reverse.
- Returns
sorted dictionary: {‘lr-0.982’: 0.87, …}
- property inverse_transform¶
Apply inverse transformations in reverse order
All estimators in the pipeline must support
inverse_transform.- Parameters
Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where
n_samplesis the number of samples andn_featuresis the number of features. Must fulfill input requirements of last step of pipeline’sinverse_transformmethod.- Returns
Xt
- Return type
array-like of shape (n_samples, n_features)
- predict(x)¶
Based on the training_pipeline to get prediction :param x: :return:
- predict_log_proba(X)¶
Apply transforms, and predict_log_proba of the final estimator
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
array-like of shape (n_samples, n_classes)
- predict_proba(x)¶
Based on the training_pipeline to get probability :param x: :return:
- score(x, y)¶
Default will just use the best trained estimator to do score. :param x: :param y: :return:
- score_samples(X)¶
Apply transforms, and score_samples of the final estimator.
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
ndarray of shape (n_samples,)
- set_params(**kwargs)¶
Set the parameters of this estimator.
Valid parameter keys can be listed with
get_params(). Note that you can directly set the parameters of the estimators contained in steps.- Returns
- Return type
self
- property transform¶
Apply transforms, and transform with the final estimator
This also works where final estimator is
None: all prior transformations are applied.- Parameters
X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
- Returns
Xt
- Return type
array-like of shape (n_samples, n_transformed_features)
- class automl.pipeline_training.PipelineTrain(include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, use_imputation=True, use_onehot=True, use_standard=True, use_norm=False, use_pca=True, use_minmax=False, use_feature_seletion=False, max_feature_num=80, use_ensemble=True, ensemble_alg='stacking', voting_logic='soft', backend=None)¶
Bases:
sklearn.pipeline.PipelineLet’s make it as parent class for both classification and regression.
- __getitem__(ind)¶
Returns a sub-pipeline or a single esimtator in the pipeline
Indexing with an integer will return an estimator; using a slice returns another Pipeline instance which copies a slice of this Pipeline. This copy is shallow: modifying (or fitting) estimators in the sub-pipeline will affect the larger pipeline and vice-versa. However, replacing a value in step will not affect a copy.
- __len__()¶
Returns the length of the Pipeline
- build_preprocessing_pipeline(data=None)¶
The reason that I want to split the preprocessing pipeline and training pipeline is that we will re-use the whole pre-processing steps if there contains some null values, so I think just to split the real pipeline into 2 parts: pre-processing and training. After whole steps finish, then I would love to store the processed data into disk, so that we could re-use the data.
Also we need to store this pre-processing instance either combined with training pipeline instance.
But I also want to add one more steps without the processing steps, as maybe the models could do better than with processing, we should store 3 parts data:
1: origin data; 2: data processed with imputation; 3: data processed with whole processed.
- Parameters
data –
- Returns
- build_training_pipeline(y=None, use_neural_network=True)¶
Real pipeline step should happen here. Let child to do real build with different steps and add the steps instance into pipeline object. Also I think here should a lazy instant step, should happen when we do real fit logic, so that we could also based on data to modify our steps.
- Important thing to notice:
I think even we have many algorithm instances, first step should combine processing step with each algorithm, then we could get some best scores models and save them into disk.
Then we could load them from disk and combine them with ensemble logic!
I have created a model_selection module to get best models based on training data, so here don’t need a list of pipeline objects.
- Returns
a list of instance algorithm object.
- decision_function(X)¶
Apply transforms, and decision_function of the final estimator
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
array-like of shape (n_samples, n_classes)
- fit(x, y, n_jobs=None, use_neural_network=True)¶
Real pipeline training steps happen here. :param x: :param y: :return:
- fit_predict(X, y=None, **fit_params)¶
Applies fit_predict of last step in pipeline after transforms.
Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.
- Parameters
X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the
fitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p.
- Returns
y_pred
- Return type
array-like
- fit_transform(X, y=None, **fit_params)¶
Fit the model and transform with the final estimator
Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.
- Parameters
X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the
fitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p.
- Returns
Xt – Transformed samples
- Return type
array-like of shape (n_samples, n_transformed_features)
- get_params(deep=True)¶
Get parameters for this estimator.
Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
- get_sorted_models_scores(x, y, reverse=True)¶
Add this func to get whole score based for each trained models, so that we could get the result that we have taken that times and for each models, how about the testing result.
Load whole trained models from disk and do processing for the new data, and score based on each model with different type of problem. :param x: :param y: :param reverse:
Whether or not to order the result based the reverse.
- Returns
sorted dictionary: {‘lr-0.982’: 0.87, …}
- property inverse_transform¶
Apply inverse transformations in reverse order
All estimators in the pipeline must support
inverse_transform.- Parameters
Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where
n_samplesis the number of samples andn_featuresis the number of features. Must fulfill input requirements of last step of pipeline’sinverse_transformmethod.- Returns
Xt
- Return type
array-like of shape (n_samples, n_features)
- predict(x)¶
Based on the training_pipeline to get prediction :param x: :return:
- predict_log_proba(X)¶
Apply transforms, and predict_log_proba of the final estimator
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
array-like of shape (n_samples, n_classes)
- predict_proba(x)¶
Based on the training_pipeline to get probability :param x: :return:
- score(x, y)¶
Default will just use the best trained estimator to do score. :param x: :param y: :return:
- score_samples(X)¶
Apply transforms, and score_samples of the final estimator.
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
ndarray of shape (n_samples,)
- set_params(**kwargs)¶
Set the parameters of this estimator.
Valid parameter keys can be listed with
get_params(). Note that you can directly set the parameters of the estimators contained in steps.- Returns
- Return type
self
- property transform¶
Apply transforms, and transform with the final estimator
This also works where final estimator is
None: all prior transformations are applied.- Parameters
X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
- Returns
Xt
- Return type
array-like of shape (n_samples, n_transformed_features)
- class automl.pipeline_training.RegressionPipeline(backend=None, include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, **kwargs)¶
Bases:
automl.pipeline_training.PipelineTrain- __getitem__(ind)¶
Returns a sub-pipeline or a single esimtator in the pipeline
Indexing with an integer will return an estimator; using a slice returns another Pipeline instance which copies a slice of this Pipeline. This copy is shallow: modifying (or fitting) estimators in the sub-pipeline will affect the larger pipeline and vice-versa. However, replacing a value in step will not affect a copy.
- __len__()¶
Returns the length of the Pipeline
- build_preprocessing_pipeline(data=None)¶
The reason that I want to split the preprocessing pipeline and training pipeline is that we will re-use the whole pre-processing steps if there contains some null values, so I think just to split the real pipeline into 2 parts: pre-processing and training. After whole steps finish, then I would love to store the processed data into disk, so that we could re-use the data.
Also we need to store this pre-processing instance either combined with training pipeline instance.
But I also want to add one more steps without the processing steps, as maybe the models could do better than with processing, we should store 3 parts data:
1: origin data; 2: data processed with imputation; 3: data processed with whole processed.
- Parameters
data –
- Returns
- build_training_pipeline(y=None, use_neural_network=True)¶
Real pipeline step should happen here. Let child to do real build with different steps and add the steps instance into pipeline object. Also I think here should a lazy instant step, should happen when we do real fit logic, so that we could also based on data to modify our steps.
- Important thing to notice:
I think even we have many algorithm instances, first step should combine processing step with each algorithm, then we could get some best scores models and save them into disk.
Then we could load them from disk and combine them with ensemble logic!
I have created a model_selection module to get best models based on training data, so here don’t need a list of pipeline objects.
- Returns
a list of instance algorithm object.
- decision_function(X)¶
Apply transforms, and decision_function of the final estimator
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
array-like of shape (n_samples, n_classes)
- fit(x, y, n_jobs=None, use_neural_network=True)¶
Real pipeline training steps happen here. :param x: :param y: :return:
- fit_predict(X, y=None, **fit_params)¶
Applies fit_predict of last step in pipeline after transforms.
Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.
- Parameters
X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the
fitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p.
- Returns
y_pred
- Return type
array-like
- fit_transform(X, y=None, **fit_params)¶
Fit the model and transform with the final estimator
Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.
- Parameters
X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the
fitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p.
- Returns
Xt – Transformed samples
- Return type
array-like of shape (n_samples, n_transformed_features)
- get_params(deep=True)¶
Get parameters for this estimator.
Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
- get_sorted_models_scores(x, y, reverse=True)¶
Add this func to get whole score based for each trained models, so that we could get the result that we have taken that times and for each models, how about the testing result.
Load whole trained models from disk and do processing for the new data, and score based on each model with different type of problem. :param x: :param y: :param reverse:
Whether or not to order the result based the reverse.
- Returns
sorted dictionary: {‘lr-0.982’: 0.87, …}
- property inverse_transform¶
Apply inverse transformations in reverse order
All estimators in the pipeline must support
inverse_transform.- Parameters
Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where
n_samplesis the number of samples andn_featuresis the number of features. Must fulfill input requirements of last step of pipeline’sinverse_transformmethod.- Returns
Xt
- Return type
array-like of shape (n_samples, n_features)
- predict(x)¶
Based on the training_pipeline to get prediction :param x: :return:
- predict_log_proba(X)¶
Apply transforms, and predict_log_proba of the final estimator
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
array-like of shape (n_samples, n_classes)
- predict_proba(x)¶
Based on the training_pipeline to get probability :param x: :return:
- score(x, y)¶
Default will just use the best trained estimator to do score. :param x: :param y: :return:
- score_samples(X)¶
Apply transforms, and score_samples of the final estimator.
- Parameters
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- Returns
y_score
- Return type
ndarray of shape (n_samples,)
- set_params(**kwargs)¶
Set the parameters of this estimator.
Valid parameter keys can be listed with
get_params(). Note that you can directly set the parameters of the estimators contained in steps.- Returns
- Return type
self
- property transform¶
Apply transforms, and transform with the final estimator
This also works where final estimator is
None: all prior transformations are applied.- Parameters
X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
- Returns
Xt
- Return type
array-like of shape (n_samples, n_transformed_features)