Is Ensemble Learning Deep Learning Or Shallow

Ensemble Learning Methods for Deep Learning Neural Networks

Last Updated on August half-dozen, 2019

How to Improve Functioning By Combining Predictions From Multiple Models.

Deep learning neural networks are nonlinear methods.

They offer increased flexibility and can scale in proportion to the amount of training data available. A downside of this flexibility is that they learn via a stochastic training algorithm which means that they are sensitive to the specifics of the grooming data and may discover a different gear up of weights each fourth dimension they are trained, which in turn produce different predictions.

More often than not, this is referred to as neural networks having a high variance and information technology can be frustrating when trying to develop a final model to use for making predictions.

A successful approach to reducing the variance of neural network models is to train multiple models instead of a unmarried model and to combine the predictions from these models. This is called ensemble learning and not only reduces the variance of predictions but also can result in predictions that are better than any unmarried model.

In this postal service, you will discover methods for deep learning neural networks to reduce variance and improve prediction functioning.

Afterwards reading this post, you will know:

Neural network models are nonlinear and have a loftier variance, which can be frustrating when preparing a final model for making predictions.
Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization mistake.
Techniques for ensemble learning can be grouped by the element that is varied, such as preparation data, the model, and how predictions are combined.

Kick-beginning your project with my new book Ameliorate Deep Learning, including step-by-step tutorials and the Python source code files for all examples.

Let'southward get started.

Overview

This tutorial is divided into iv parts; they are:

High Variance of Neural Network Models
Reduce Variance Using an Ensemble of Models
How to Ensemble Neural Network Models
Summary of Ensemble Techniques

Loftier Variance of Neural Network Models

Training deep neural networks can exist very computationally expensive.

Very deep networks trained on millions of examples may have days, weeks, and sometimes months to railroad train.

Google's baseline model […] was a deep convolutional neural network […] that had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores.

— Distilling the Knowledge in a Neural Network, 2015.

After the investment of so much time and resources, there is no guarantee that the concluding model will accept low generalization error, performing well on examples not seen during preparation.

… train many different candidate networks and so to select the best, […] and to discard the rest. There are two disadvantages with such an approach. Kickoff, all of the try involved in training the remaining networks is wasted. Second, […] the network which had best operation on the validation ready might not be the one with the best performance on new test information.

— Pages 364-365, Neural Networks for Pattern Recognition, 1995.

Neural network models are a nonlinear method. This ways that they tin can acquire circuitous nonlinear relationships in the data. A downside of this flexibility is that they are sensitive to initial conditions, both in terms of the initial random weights and in terms of the statistical dissonance in the grooming dataset.

This stochastic nature of the learning algorithm means that each fourth dimension a neural network model is trained, it may learn a slightly (or dramatically) different version of the mapping function from inputs to outputs, that in turn volition have different operation on the training and holdout datasets.

Equally such, we tin can think of a neural network as a method that has a depression bias and high variance. Even when trained on large datasets to satisfy the high variance, having any variance in a concluding model that is intended to be used to make predictions can exist frustrating.

Want Better Results with Deep Learning?

Take my free 7-day email crash class at present (with sample code).

Click to sign-upward and as well become a free PDF Ebook version of the class.

Reduce Variance Using an Ensemble of Models

A solution to the high variance of neural networks is to train multiple models and combine their predictions.

The idea is to combine the predictions from multiple skilful but different models.

A good model has skill, meaning that its predictions are amend than random chance. Importantly, the models must be practiced in different ways; they must brand different prediction errors.

The reason that model averaging works is that different models will ordinarily not make all the same errors on the examination fix.

— Page 256, Deep Learning, 2016.

Combining the predictions from multiple neural networks adds a bias that in turn counters the variance of a unmarried trained neural network model. The results are predictions that are less sensitive to the specifics of the training data, option of grooming scheme, and the serendipity of a single training run.

In addition to reducing the variance in the prediction, the ensemble can also result in better predictions than whatever single best model.

… the performance of a committee can be amend than the performance of the all-time single network used in isolation.

— Folio 365, Neural Networks for Pattern Recognition, 1995.

This approach belongs to a general form of methods chosen "ensemble learning" that describes methods that attempt to make the best utilize of the predictions from multiple models prepared for the same problem.

Generally, ensemble learning involves preparation more one network on the same dataset, then using each of the trained models to make a prediction before combining the predictions in some mode to make a final issue or prediction.

In fact, ensembling of models is a standard approach in practical machine learning to ensure that the most stable and best possible prediction is made.

For case, Alex Krizhevsky, et al. in their famous 2012 newspaper titled "Imagenet classification with deep convolutional neural networks" that introduced very deep convolutional neural networks for photograph classification (i.east. AlexNet) used model averaging beyond multiple well-performing CNN models to accomplish country-of-the-art results at the fourth dimension. Performance of 1 model was compared to ensemble predictions averaged over ii, five, and 7 unlike models.

Averaging the predictions of v similar CNNs gives an error rate of sixteen.iv%. […] Averaging the predictions of 2 CNNs that were pre-trained […] with the aforementioned v CNNs gives an error rate of fifteen.three%.

Ensembling is as well the approach used by winners in car learning competitions.

Another powerful technique for obtaining the all-time possible results on a task is model ensembling. […] If you expect at machine-learning competitions, in particular on Kaggle, you'll see that the winners use very large ensembles of models that inevitably beat whatever unmarried model, no matter how good.

— Page 264, Deep Learning With Python, 2017.

How to Ensemble Neural Network Models

Maybe the oldest and still well-nigh ordinarily used ensembling approach for neural networks is chosen a "commission of networks."

A drove of networks with the same configuration and different initial random weights is trained on the same dataset. Each model is then used to brand a prediction and the bodily prediction is calculated as the average of the predictions.

The number of models in the ensemble is often kept small both considering of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members. Ensembles may exist as small as iii, five, or 10 trained models.

The field of ensemble learning is well studied and at that place are many variations on this simple theme.

Information technology tin exist helpful to think of varying each of the three major elements of the ensemble method; for example:

Training Data: Vary the choice of data used to train each model in the ensemble.
Ensemble Models: Vary the choice of the models used in the ensemble.
Combinations: Vary the option of the manner that outcomes from ensemble members are combined.

Let's take a closer look at each chemical element in plow.

Varying Training Data

The data used to train each member of the ensemble can be varied.

The simplest approach would be to use k-fold cross-validation to judge the generalization error of the chosen model configuration. In this procedure, chiliad different models are trained on thou unlike subsets of the training data. These k models tin can then be saved and used every bit members of an ensemble.

Some other popular approach involves resampling the training dataset with replacement, then training a network using the resampled dataset. The resampling procedure ways that the limerick of each training dataset is unlike with the possibility of duplicated examples allowing the model trained on the dataset to take a slightly different expectation of the density of the samples, and in plow dissimilar generalization error.

This approach is called bootstrap aggregation, or bagging for curt, and was designed for use with unpruned decision trees that have loftier variance and depression bias. Typically a big number of conclusion copse are used, such as hundreds or thousands, given that they are fast to fix.

… a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to have many training sets from the population, build a separate prediction model using each grooming prepare, and average the resulting predictions. […] Of class, this is not practical considering we more often than not practise not have admission to multiple training sets. Instead, we tin can bootstrap, by taking repeated samples from the (unmarried) training data set.

— Pages 216-317, An Introduction to Statistical Learning with Applications in R, 2013.

An equivalent approach might be to use a smaller subset of the training dataset without regularization to let faster preparation and some overfitting.

The desire for slightly nether-optimized models applies to the selection of ensemble members more than generally.

… the members of the committee should not individually be chosen to take optimal trade-off between bias and variance, but should have relatively smaller bias, since the extra variance tin can exist removed by averaging.

— Page 366, Neural Networks for Pattern Recognition, 1995.

Other approaches may involve selecting a random subspace of the input infinite to allocate to each model, such as a subset of the hyper-volume in the input space or a subset of input features.

Ensemble Tutorials

For examples of deep learning ensembles that vary training data see:

How to Develop a Random-Carve up, Cross-Validation, and Bagging Ensemble for Deep Learning

Varying Models

Training the same under-constrained model on the same data with different initial conditions will consequence in different models given the difficulty of the problem, and the stochastic nature of the learning algorithm.

This is considering the optimization problem that the network is trying to solve is and so challenging that there are many "good" and "different" solutions to map inputs to outputs.

Most neural network algorithms achieve sub-optimal performance specifically due to the existence of an overwhelming number of sub-optimal local minima. If nosotros have a set of neural networks which have converged to local minima and apply averaging we tin construct an improved gauge. 1 fashion to empathize this fact is to consider that, in general, networks which have fallen into different local minima volition perform poorly in different regions of characteristic space and thus their mistake terms will non be strongly correlated.

— When networks disagree: Ensemble methods for hybrid neural networks, 1995.

This may result in a reduced variance, but may not dramatically meliorate generalization error. The errors made past the models may still be too highly correlated because the models all accept learned similar mapping functions.

An alternative approach might be to vary the configuration of each ensemble model, such as using networks with different capacity (e.thousand. number of layers or nodes) or models trained under different conditions (e.g. learning rate or regularization).

The result may be an ensemble of models that accept learned a more heterogeneous drove of mapping functions and in turn accept a lower correlation in their predictions and prediction errors.

Differences in random initialization, random selection of minibatches, differences in hyperparameters, or different outcomes of non-deterministic implementations of neural networks are often plenty to cause unlike members of the ensemble to make partially independent errors.

— Pages 257-258, Deep Learning, 2016.

Such an ensemble of differently configured models can exist achieved through the normal process of developing the network and tuning its hyperparameters. Each model could exist saved during this procedure and a subset of better models chosen to comprise the ensemble.

Slightly inferiorly trained networks are a free by-production of almost tuning algorithms; it is desirable to apply such extra copies fifty-fifty when their performance is significantly worse than the best performance found. Better performance nonetheless can be achieved through careful planning for an ensemble classification by using the best available parameters and training unlike copies on different subsets of the bachelor database.

— Neural Network Ensembles, 1990.

In cases where a single model may take weeks or months to train, another culling may be to periodically save the best model during the training process, chosen snapshot or checkpoint models, then select ensemble members among the saved models. This provides the benefits of having multiple models trained on the same data, although collected during a single training run.

Snapshot Ensembling produces an ensemble of accurate and diverse models from a single training procedure. At the heart of Snapshot Ensembling is an optimization process which visits several local minima before converging to a terminal solution. We have model snapshots at these various minima, and average their predictions at test time.

— Snapshot Ensembles: Railroad train ane, go Thou for free, 2017.

A variation on the Snapshot ensemble is to salvage models from a range of epochs, perhaps identified by reviewing learning curves of model performance on the railroad train and validation datasets during training. Ensembles from such contiguous sequences of models are referred to equally horizontal ensembles.

Offset, networks trained for a relatively stable range of epoch are selected. The predictions of the probability of each label are produced by standard classifiers [over] the selected epoch[s], and then averaged.

— Horizontal and vertical ensemble with deep representation for classification, 2013.

A farther enhancement of the snapshot ensemble is to systematically vary the optimization procedure during training to force different solutions (i.eastward. sets of weights), the best of which can be saved to checkpoints. This might involve injecting an oscillating corporeality of noise over training epochs or oscillating the learning charge per unit during training epochs. A variation of this approach called Stochastic Slope Descent with Warm Restarts (SGDR) demonstrated faster learning and land-of-the-fine art results for standard photo classification tasks.

Our SGDR simulates warm restarts past scheduling the learning rate to achieve competitive results […] roughly two to four times faster. Nosotros also achieved new state-of-the-fine art results with SGDR, mainly by using even wider [models] and ensembles of snapshots from SGDR'due south trajectory.

— SGDR: Stochastic Slope Descent with Warm Restarts, 2016.

A do good of very deep neural networks is that the intermediate hidden layers provide a learned representation of the low-resolution input data. The hidden layers can output their internal representations directly, and the output from one or more than hidden layers from one very deep network can be used every bit input to a new classification model. This is perhaps most effective when the deep model is trained using an autoencoder model. This blazon of ensemble is referred to as a vertical ensemble.

This method ensembles a series of classifiers whose inputs are the representation of intermediate layers. A lower fault rate is expected because these features seem diverse.

— Horizontal and vertical ensemble with deep representation for nomenclature, 2013.

Ensemble Tutorials

For examples of deep learning ensembles that vary models see:

How to Develop a Snapshot Ensemble for Deep Learning
How to Develop Horizontal Voting Ensemble for Deep Learning

Varying Combinations

The simplest way to combine the predictions is to calculate the average of the predictions from the ensemble members.

This can exist improved slightly past weighting the predictions from each model, where the weights are optimized using a agree-out validation dataset. This provides a weighted average ensemble that is sometimes called model blending.

… we might expect that some members of the committee will typically brand better predictions than other members. We would therefore await to exist able to reduce the fault still further if we give greater weight to some commission members than to others. Thus, we consider a generalized committee prediction given by a weighted combination of the predictions of the members …

— Page 367, Neural Networks for Blueprint Recognition, 1995.

Ane further stride in complexity involves using a new model to learn how to best combine the predictions from each ensemble member.

The model could exist a unproblematic linear model (e.g. much like the weighted average), just could be a sophisticated nonlinear method that also considers the specific input sample in addition to the predictions provided by each member. This general approach of learning a new model is called model stacking, or stacked generalization.

Stacked generalization works by deducing the biases of the generalizer(southward) with respect to a provided learning fix. This deduction gain by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with function of the learning set and trying to gauge the rest of it, and whose output is (for case) the right judge. […] When used with a unmarried generalizer, stacked generalization is a scheme for estimating (and and then correcting for) the error of a generalizer which has been trained on a item learning set and then asked a particular question.

— Stacked generalization, 1992.

There are more sophisticated methods for stacking models, such as boosting where ensemble members are added one at a fourth dimension in order to correct the mistakes of prior models. The added complexity means this approach is less oft used with large neural network models.

Some other combination that is a little bit unlike is to combine the weights of multiple neural networks with the same structure. The weights of multiple networks tin be averaged, to hopefully result in a new single model that has better overall performance than whatsoever original model. This approach is called model weight averaging.

… suggests information technology is promising to average these points in weight space, and apply a network with these averaged weights, instead of forming an ensemble by averaging the outputs of networks in model space

— Averaging Weights Leads to Wider Optima and Better Generalization, 2018.

Ensemble Tutorials

For examples of deep learning ensembles that vary combinations see:

How to Develop a Model Averaging Ensemble for Deep Learning
How to Develop a Weighted Average Ensemble for Deep Learning
How to Develop a Stacking Ensemble for Deep Learning
How to Create a Polyak-Ruppert Ensemble for Deep Learning

Summary of Ensemble Techniques

In summary, we tin can list some of the more than common and interesting ensemble methods for neural networks organized by each chemical element of the method that tin can be varied, every bit follows:

Varying Preparation Data
- 1000-fold Cross-Validation Ensemble
- Bootstrap Assemblage (bagging) Ensemble
- Random Training Subset Ensemble
Varying Models
- Multiple Grooming Run Ensemble
- Hyperparameter Tuning Ensemble
- Snapshot Ensemble
- Horizontal Epochs Ensemble
- Vertical Representational Ensemble
Varying Combinations
- Model Averaging Ensemble
- Weighted Average Ensemble
- Stacked Generalization (stacking) Ensemble
- Boosting Ensemble
- Model Weight Averaging Ensemble

There is no unmarried best ensemble method; peradventure experiment with a few approaches or let the constraints of your project guide you.

Summary

In this post, yous discovered ensemble methods for deep learning neural networks to reduce variance and amend prediction performance.

Specifically, yous learned:

Neural network models are nonlinear and have a high variance, which tin exist frustrating when preparing a final model for making predictions.
Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
Techniques for ensemble learning can be grouped by the chemical element that is varied, such every bit grooming data, the model, and how predictions are combined.

Do you have any questions?
Ask your questions in the comments below and I will practice my best to answer.

Develop Better Deep Learning Models Today!

Railroad train Faster, Reduce Overftting, and Ensembles

...with just a few lines of python code

Find how in my new Ebook:
Better Deep Learning

Information technology provides self-report tutorials on topics like:
weight decay, batch normalization, dropout, model stacking and much more...

Bring better deep learning to your projects!

Skip the Academics. Merely Results.

Run into What'southward Inside