神刀安全网

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

There are standard workflows in a machine learning project that can be automated.

In Python scikit-learn, Pipelines help to to clearly define and automate these workflows.

In this post you will discover Pipelines in scikit-learn and how you can automate common machine learning workflows.

Let’s get started.

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

Photo by Brian Cantoni , some rights reserved.

Pipelines for Automating Machine Learning Workflows

There are standard workflows in applied machine learning. Standard because they overcome common problems like data leakage in your test harness.

Python scikit-learn provides a Pipeline utility to help automate machine learning workflows.

Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.

The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross validation procedure.

You can learn more about Pipelines in scikit-learn by reading the Pipeline section of the user guide . You can also review the API documentation for the  Pipeline and  FeatureUnion classes an the pipeline module .

Pipeline 1: Data Preparation and Modeling

An easy trap to fall into in applied machine learning is leaking data from your training dataset to your test dataset.

To avoid this trap you need a robust test harness with strong separation of training and testing. This includes data preparation.

Data preparation is one easy way to leak knowledge of the whole training dataset to the algorithm. For example, preparing your data using normalization or standardization on the entire training dataset before learning would not be a valid test because the training dataset would have been influenced by the scale of the data in the test set.

Pipelines help you prevent data leakage in your test harness by ensuring that data preparation like standardization is constrained to each fold of your cross validation procedure.

The example below demonstrates this important data preparation and model evaluation workflow. The pipeline is defined with two steps:

  1. Standardize the data.
  2. Learn a Linear Discriminant Analysis model.

The pipeline is then evaluated using 10-fold cross validation.

# Create a pipeline that standardizes the data then creates a model frompandasimportread_csv fromsklearn.cross_validationimportKFold fromsklearn.cross_validationimportcross_val_score fromsklearn.preprocessingimportStandardScaler fromsklearn.pipelineimportPipeline fromsklearn.discriminant_analysisimportLinearDiscriminantAnalysis # load data url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # create pipeline estimators = [] estimators.append(('standardize', StandardScaler())) estimators.append(('lda', LinearDiscriminantAnalysis())) model = Pipeline(estimators) # evaluate pipeline num_folds = 10 num_instances = len(X) seed = 7 kfold = KFold(n=num_instances, n_folds=num_folds, random_state=seed) results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) 

Running the example provides a summary of accuracy of the setup on the dataset.

0.773462064252 

Pipeline 2: Feature Extraction and Modeling

Feature extraction is another procedure that is susceptible to data leakage.

Like data preparation, feature extraction procedures must be restricted to the data in your training dataset.

The pipeline provides a handy tool called the FeatureUnion which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a model can be trained. Importantly, all the feature extraction and the feature union occurs within each fold of the cross validation procedure.

The example below demonstrates the pipeline defined with four steps:

  1. Feature Extraction with Principal Component Analysis (3 features)
  2. Feature Extraction with Statistical Selection (6 features)
  3. Feature Union
  4. Learn a Logistic Regression Model

The pipeline is then evaluated using 10-fold cross validation.

# Create a pipeline that extracts features from the data then creates a model frompandasimportread_csv fromsklearn.cross_validationimportKFold fromsklearn.cross_validationimportcross_val_score fromsklearn.pipelineimportPipeline fromsklearn.pipelineimportFeatureUnion fromsklearn.linear_modelimportLogisticRegression fromsklearn.decompositionimportPCA fromsklearn.feature_selectionimportSelectKBest # load data url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # create feature union features = [] features.append(('pca', PCA(n_components=3))) features.append(('select_best', SelectKBest(k=6))) feature_union = FeatureUnion(features) # create pipeline estimators = [] estimators.append(('feature_union', feature_union)) estimators.append(('logistic', LogisticRegression())) model = Pipeline(estimators) # evaluate pipeline num_folds = 10 num_instances = len(X) seed = 7 kfold = KFold(n=num_instances, n_folds=num_folds, random_state=seed) results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) 

Running the example provides a summary of accuracy of the pipeline on the dataset.

0.776042378674 

Your Guide to Machine Learning with Scikit-Learn

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn Python and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course inMachine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Summary

In this post you discovered the difficulties of data leakage in applied machine learning.

You discovered the Pipeline utilities in Python scikit-learn and how they can be used to automate standard applied machine learning workflows.

You learned how to use Pipelines in two important use cases:

  1. Data preparation and modeling constrained to each fold of the cross validation procedure.
  2. Feature extraction and feature union constrained to each fold of the cross validation procedure.

Do you have any questions about data leakage, Pipelines or this post? Ask your questions in the comments and I will do my best to answer.

Can You Step-Through Machine Learning Projectsin Python with scikit-learn and Pandas?

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn Discover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn in the new Ebook: 

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址