How bagging and boosting work and which is best
In this guest article, our learner Kristina goes through the basics of ensemble learning. She takes an in-depth look at both bagging and boosting, before comparing their similarities and differences and explaining which option is best in different situations.
Author Kristina Grigaitytė, Turing College learner
Everything should be made as simple as possible, but not simpler.
The same approach of starting with a very simple model can be applied to machine learning engineering, and it usually proves very valuable. This article is exactly about that, so let us dig in! 🤗
To start with, let's identify the problems. The main causes of error in machine learning are due to noise, bias, and variance. And one model is no good to battle these. Instead of building only one single model to predict the target (or future), how about considering multiple. This is the main idea behind ensemble learning. We build multiple machine learning models, and we call these models weak learners. By combining these weak learners, we make a strong learner, which generalizes to predict all the target classes with a decent amount of accuracy. By using ensemble methods, we’re able to increase the stability of the final model and reduce the errors mentioned previously. What are these methods you ask? Well, there are three main ways to look at ensemble learning:
- Bagging (to decrease the model’s variance).
- Boosting (to decrease the model’s bias).
- Stacking (to increase the predictive force of the classifier).
As mentioned, we build multiple models, so how do these models differ from one another?
- Homogeneous ensemble methods: models are built using the same machine learning algorithm.
- Heterogeneous ensemble methods: models are built using different machine learning algorithms.
But today I will focus on homogeneous ensemble methods: in particular, their similarities and differences. If you want to learn more about Stacking, you can read this article, “Stacking”.
So let's start with the bagging method. Bagging is shorthand for the combination of bootstrapping and aggregating.
As previously stated, the idea behind bagging is to combine the results of multiple models (for instance, all decision trees) to get a generalized result. Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input. So how can we solve this problem? One of the techniques is bootstrapping.
Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The simplest approach with bagging is to use a couple of small subsamples and bag them. If the ensemble accuracy is much higher than the base models, it’s working!
But there is a tradeoff between base model accuracy and the gain you get through bagging. The aggregation from bagging may improve the ensemble greatly when you have an unstable model; however, when your base models are more stable — being trained on larger subsamples with higher accuracy — the improvements from bagging reduce.
Once the bagging is done, and all the models have been created on (mostly) different data, a weighted average is then used to determine the final score.
Bagging steps: 1. Multiple subsets are created from the original dataset. 2. A base model (weak model) is created on each of these subsets. 3. The final predictions are determined by combining the predictions from all the models.
Here is a more practical example of how we can use it in Python. Sci-kit learn has implemented a BaggingClassifier in sklearn.ensemble.BaggingClassifier
The code above is using the default base_estimator of a decision tree. It will fit the random samples with replacements. Samples will be taken for all of the features in the training set.
Alternatively, when you have many random trees, it is called Random Forest. Random Forest is an extension over bagging. It takes one extra step, where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees.
Let’s look at the steps taken to implement Random forest:
1. Suppose there are N observations and M features in the training data set. First, a sample from the training data set is taken randomly with a replacement.
2. A subset of M features are selected randomly and whichever feature gives the best split is used to split the node iteratively.
3. The tree is grown.
4. The above steps are repeated and prediction is given based on the aggregation of predictions from the number of trees.
As you can see, the overall steps are far from complicated, which is a big plus when you are just starting with machine learning in general. With that thought in mind, let's move on to boosting!
Compared to bagging, boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model.
When an input is misclassified by a hypothesis, its weight is increased so that the next hypothesis is more likely to classify it correctly. When the whole set is combined at the end,weak learners are converted into a better-performing model. The individual models would not perform well on the entire dataset, but they work well for some parts of the dataset. Thus, each model actually boosts the performance of the ensemble.
Boosting process. 1. The observations which are incorrectly predicted, are given higher weights. 2. Another model is created and predictions are made on the dataset.
Unlike the bagging example, classical boosting of the subset creation is not random and performance will depend upon the performance of previous models.
Let’s see how this idea works through the example, shall we?
One of the simplest boosting algorithms is Adaptive boosting or AdaBoost (usually, decision trees are used for modeling). AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting.
The AdaBoost technique follows a decision tree model with a depth equal to one. It is nothing but a forest of stumps (one node and two leaves) rather than trees. Stumps are not great for making accurate classifications so it is nothing but a weak learner. A combination of many weak classifiers makes a strong classifier and this is the principle behind the AdaBoost algorithm. Some stumps get more performance or classify better than others so consecutive stumps are made by taking the previous stump’s mistakes into account.
Sci-kit learn offers a module that we can use to gain benefits of the Adaptive Boosting algorithm (other examples include: Xgbm algorithm, LightGBM, CatBoost, LPBoost, GradientBoost, BrownBoost):
In the end, the accuracy of the algorithm can be far superior to the accuracy attained by the decision tree.
So far, we have looked into details about bagging and boosting separately. It would be a good idea to sum up the similarities and differences in a more general way.
Bagging and Boosting: Similarities
- Bagging and Boosting are ensemble methods focused on getting N learners from a single learner.
- Bagging and Boosting make random sampling and generate several training data sets.
- Bagging and Boosting arrive upon the end decision by making an average of N learners or taking the voting rank done by most of them.
- Bagging and Boosting reduce variance and provide higher stability with minimizing errors.
Bagging and Boosting: Differences
- In the case of Bagging, any element has the same probability to appear in a new data set. However, for Boosting the observations are weighted, and therefore some of them will take part in the new sets more often.
- Bagging is a method of merging the same type of predictions. Boosting is a method of merging different types of predictions.
- Bagging decreases variance, not bias, and solves over-fitting issues in a model. Boosting decreases bias, not variance.
- Models are built independently in Bagging. New models are affected by a previously built model’s performance in Boosting.
Bagging and Boosting: which is better?
The question may come to your mind of whether you should select Bagging or Boosting for a particular problem (spoiler alert: it depends on the data, the simulation, and the circumstances).
If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimizes the advantages and reduces the pitfalls of the single model.
By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best option. Boosting for its part doesn’t help you to avoid over-fitting. In fact, this technique is faced with this problem itself. For this reason, Bagging is more effective than Boosting.
So the truth is that we don’t have any hard rules for which method to use in most cases. This is where experience and subject matter expertise comes in! It may seem easy to jump on the first model that works. However, it is important to analyze the algorithm and all the features it selects. It is not just about trying AdaBoost, or Random forests on various datasets. The final algorithm is driven depending on the results it is getting and the support provided.
Now that we have thoroughly described the concepts of Bagging and Boosting, we have arrived at the end of the article and can conclude that both are equally important. It is no secret that ensemble methods generally outperform a single model. This is why many Kaggle winners have utilized ensemble methodologies. Today we learned how bagging and boosting methods are different by understanding ensemble learning. In this process, we learned about bootstrapping, weak learner concepts, and how these methods vary in each level of modeling. Thank you for reading! ☀