Automl Module (API Reference)¶

Entry point for full automl training pipeline with support for classfication and regression.

Support with base machine learning models training and nueral network training, not only with just one model training, but also with ensemble to combine trained models into a more robust model to both reduce variance and bias.

High level steps:

Load training and testing data file or memory objects.
Feature engineering step to process data.
Model training based on processed data.
Nueral network model training based on processed data.
Ensemble logic to combine trained model and do comparation to see better or not.
Dump trained models into disk with user defined path.

author: Guangqiang.lu

class automl.estimator.AutoML(models_path=None, time_left_for_this_task=3600, n_ensemble=10, n_best_model=5, include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, keep_models=True, model_dir=None, precision=32, delete_models=True)¶

Bases: sklearn.base.BaseEstimator

Parent class for both classificatinon and regression auto training class.

this is to init automl class, whole thing should be ininstanted in this class, like what algorithms to use, how many models to be selected, etc.

Parameters

backend – backend object used to save and load models
time_left_for_this_task – how long for this models to be trained.
n_ensemble – how many models to be selected to be ensemble
n_best_model – how many models to be keeped during training.
include_estimators – what algorithms to be included
exclude_estimators – what algorithms to be excluded
include_preprocessors – what preprocessing step to be included
exclude_preprocessors – what preprocessing step to be excluded
keep_models – whether or not to keep trained models
model_dir – keep model folder, if None use backend to create one folder
precision – precision of data, to save memory

fit(x=None, y=None, file_load=None, xval=None, yval=None, val_split=0.2, n_jobs=None, use_neural_network=True, *args, **kwargs)¶

Main training entry point with support with file and memory objects.

Full training step with pre-processing pipeline and training pipeline happens here. Various type of data is supported and will convert them into a normal array for later training algorithms, will instant a training pipeline with different algorithms with hyper-parameters selected, will use grid-search to find best hyper-parameters, will store these trained models with validation score attached with algorithm name.

Parameters

x ([array], optional) – [training data]. Defaults to None.
y ([array], optional) – [training label]. Defaults to None.
file_load ([FileLoad], optional) – [file_load object to contain data and label]. Defaults to None.
xval ([array], optional) – [validation data]. Defaults to None.
yval ([array], optional) – [validation label]. Defaults to None.
val_split ([float], optional) – [percentage for validation if xval and yval not provdied]. Defaults to 0.2.
n_jobs ([int], optional) – [how many cores to be used]. Defaults to None.
use_neural_network (bool, optional) – [whether or not to use neural networks.]. Defaults to True.

Returns

[trained object.]

Return type

[self]

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

get_sorted_models_scores(xtest=None, ytest=None, file_load=None, reverse=True, **kwargs)¶

To get some best trained model’s score for test data with ordered.

So that we could get the list of the best scores for later front end show case. :param x: :param y: :param kwargs: :return:

predict(x=None, file_load=None, **kwargs)¶

Based on data or file to get prediction based on best trained models.

Parameters

x ([array], optional) – [test data]. Defaults to None.
file_load ([FileLoad], optional) – [file_load object to contain data and label]. Defaults to None.

Returns

[prediction]

Return type

[array]

predict_proba(x=None, file_load=None, **kwargs)¶

Probability supported based on best trained model.

Parameters

x (array, optional) – test data. Defaults to None.
file_load (array, optional) – file_load object. Defaults to None.

Raises

NotImplementedError – Raise error if not support with predict_proba

Returns

probability of test data

Return type

array

classmethod reconstruct(models_path=None, *args, **kwargs)¶

Used for Restful API to create

Parameters: models_path (str, optional) – Where trained model is. Defaults to None.
Returns: a re-constructed object for API use case
Return type: AutoML

score(x=None, y=None, file_load=None, **kwargs)¶

Get score based on test data and label.

Classifcation will use accuracy, regression will use r2-score

Parameters

x (array, optional) – test data. Defaults to None.
y (array, optional) – test label. Defaults to None.
file_load (FileLoad, optional) – file_load to contain data and label. Defaults to None.

Returns

evaluation score

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

class automl.estimator.ClassificationAutoML(models_path=None, include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, **kwargs)¶

Bases: automl.estimator.AutoML

Added with algorithm selection and processing selection, even with others in case we need.

Parameters: models_path (Str, optional) – Where to store our models. Defaults to None.

fit(x=None, y=None, file_load=None, xval=None, yval=None, val_split=0.2, n_jobs=None, use_neural_network=True, *args, **kwargs)¶

Main training entry point with support with file and memory objects.

Full training step with pre-processing pipeline and training pipeline happens here. Various type of data is supported and will convert them into a normal array for later training algorithms, will instant a training pipeline with different algorithms with hyper-parameters selected, will use grid-search to find best hyper-parameters, will store these trained models with validation score attached with algorithm name.

Parameters

x ([array], optional) – [training data]. Defaults to None.
y ([array], optional) – [training label]. Defaults to None.
file_load ([FileLoad], optional) – [file_load object to contain data and label]. Defaults to None.
xval ([array], optional) – [validation data]. Defaults to None.
yval ([array], optional) – [validation label]. Defaults to None.
val_split ([float], optional) – [percentage for validation if xval and yval not provdied]. Defaults to 0.2.
n_jobs ([int], optional) – [how many cores to be used]. Defaults to None.
use_neural_network (bool, optional) – [whether or not to use neural networks.]. Defaults to True.

Returns

[trained object.]

Return type

[self]

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

get_sorted_models_scores(xtest=None, ytest=None, file_load=None, reverse=True, **kwargs)¶

To get some best trained model’s score for test data with ordered.

So that we could get the list of the best scores for later front end show case. :param x: :param y: :param kwargs: :return:

predict(x=None, file_load=None, **kwargs)¶

Based on data or file to get prediction based on best trained models.

Parameters

x ([array], optional) – [test data]. Defaults to None.
file_load ([FileLoad], optional) – [file_load object to contain data and label]. Defaults to None.

Returns

[prediction]

Return type

[array]

predict_proba(x=None, file_load=None, **kwargs)¶

Probability supported based on best trained model.

Parameters

x (array, optional) – test data. Defaults to None.
file_load (array, optional) – file_load object. Defaults to None.

Raises

NotImplementedError – Raise error if not support with predict_proba

Returns

probability of test data

Return type

array

classmethod reconstruct(models_path=None, *args, **kwargs)¶

Used for Restful API to create

Parameters: models_path (str, optional) – Where trained model is. Defaults to None.
Returns: a re-constructed object for API use case
Return type: AutoML

score(x=None, y=None, file_load=None, **kwargs)¶

Get score based on test data and label.

Classifcation will use accuracy, regression will use r2-score

Parameters

x (array, optional) – test data. Defaults to None.
y (array, optional) – test label. Defaults to None.
file_load (FileLoad, optional) – file_load to contain data and label. Defaults to None.

Returns

evaluation score

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

class automl.estimator.FileLoad(file_name, file_path=None, file_sep=',', label_name='label', use_for_pred=False, service_account_file_name=None, service_account_file_path=None, except_columns=None)¶

Bases: object

Load data from file, support with local file also with GCS.

Make this class as a container for later use case.

Main container for file-like dataset.

Parameters

file_name (str) – Name of file
label_name (str, optional) – What is label column’s name?. Defaults to ‘label’.
file_path (str, optional) – Where file located?. Defaults to None.
file_sep (str, optional) – File seprator. Defaults to ‘,’.
use_for_pred (Boolean, optional) – Whether to use this for prediction? Noted: If file doesn’t contain label column, do need set this parameter to True. Defaults to False.
service_account_file_name (str, optional) – SA file name. Defaults to None.
service_account_file_path (str, optional) – SA file path. Defaults to None.
except_columns (List, optional) – Columns are needed to be used. Defaults to None.

Raises

ValueError – [description]

class automl.estimator.RegressionAutoML(models_path=None, include_estimators=None, exclude_estimators=None, include_preprocessors=None, exclude_preprocessors=None, **kwargs)¶

Bases: automl.estimator.AutoML

Added with algorithm selection and processing selection, even with others in case we need.

Parameters: models_path (Str, optional) – Where to store our models. Defaults to None.

fit(x=None, y=None, file_load=None, xval=None, yval=None, val_split=0.2, n_jobs=None, use_neural_network=True, *args, **kwargs)¶

Main training entry point with support with file and memory objects.

Full training step with pre-processing pipeline and training pipeline happens here. Various type of data is supported and will convert them into a normal array for later training algorithms, will instant a training pipeline with different algorithms with hyper-parameters selected, will use grid-search to find best hyper-parameters, will store these trained models with validation score attached with algorithm name.

Parameters

x ([array], optional) – [training data]. Defaults to None.
y ([array], optional) – [training label]. Defaults to None.
file_load ([FileLoad], optional) – [file_load object to contain data and label]. Defaults to None.
xval ([array], optional) – [validation data]. Defaults to None.
yval ([array], optional) – [validation label]. Defaults to None.
val_split ([float], optional) – [percentage for validation if xval and yval not provdied]. Defaults to 0.2.
n_jobs ([int], optional) – [how many cores to be used]. Defaults to None.
use_neural_network (bool, optional) – [whether or not to use neural networks.]. Defaults to True.

Returns

[trained object.]

Return type

[self]

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

get_sorted_models_scores(xtest=None, ytest=None, file_load=None, reverse=True, **kwargs)¶

To get some best trained model’s score for test data with ordered.

So that we could get the list of the best scores for later front end show case. :param x: :param y: :param kwargs: :return:

predict(x=None, file_load=None, **kwargs)¶

Based on data or file to get prediction based on best trained models.

Parameters

x ([array], optional) – [test data]. Defaults to None.
file_load ([FileLoad], optional) – [file_load object to contain data and label]. Defaults to None.

Returns

[prediction]

Return type

[array]

predict_proba(x=None, file_load=None, **kwargs)¶

Probability supported based on best trained model.

Parameters

x (array, optional) – test data. Defaults to None.
file_load (array, optional) – file_load object. Defaults to None.

Raises

NotImplementedError – Raise error if not support with predict_proba

Returns

probability of test data

Return type

array

classmethod reconstruct(models_path=None, *args, **kwargs)¶

Used for Restful API to create

Parameters: models_path (str, optional) – Where trained model is. Defaults to None.
Returns: a re-constructed object for API use case
Return type: AutoML

score(x=None, y=None, file_load=None, **kwargs)¶

Get score based on test data and label.

Classifcation will use accuracy, regression will use r2-score

Parameters

x (array, optional) – test data. Defaults to None.
y (array, optional) – test label. Defaults to None.
file_load (FileLoad, optional) – file_load to contain data and label. Defaults to None.

Returns

evaluation score

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance