Ensemble Learning: Key Techniques Explained
Author: Kristina Grigaitytė, STL at Turing College
Albert Einstein once said, "Everything should be made as simple as possible, but not simpler." This philosophy works great in machine learning engineering, too. Starting with a basic model often pays off big time.
In this article, I'll cover the basics of ensemble machine learning. I'll also examine the bagging and boosting techniques in depth, comparing their similarities and differences and explaining which option is best in different situations. Let's get started!
What Is Ensemble Learning?
Ensemble learning is a machine learning technique combining multiple learning algorithms to achieve better predictive accuracy than any single learning algorithm alone. In essence, ensemble techniques in machine learning are like getting advice from several experts before making a big decision, like buying a house. Relying on just one opinion isn't always the best move.
Instead of depending on only one model, which may fall short due to noise, bias, or variance, why not use several? The underlying concept behind ensemble learning is blending the outputs of different models to make more accurate predictions. By using the strengths of various models, ensemble learning improves overall performance and resilience against data uncertainties. Unlike the infinite possibilities in statistical mechanics, a machine learning ensemble uses a finite set of models, making it more adaptable.
Research shows that ensemble learning delivers better results, whether with machine learning models or convolutional neural networks (CNNs).
Ensemble Learning Applications
Ensemble learning's flexibility and reliability make it a top choice for tackling complex problems across various industries and tasks.
Fraud Detection
Fraud detection deals with identifying scams like credit card fraud, money laundering, and telecommunication fraud. These areas are rich in research and machine learning applications. Ensemble learning strengthens normal behavior modeling, making it a powerful tool for spotting fraud in banking and credit card systems.
Healthcare Diagnostics
Ensemble learning boosts diagnostic accuracy by combining multiple algorithms to analyze medical data. This approach reduces the risk of misdiagnosis, offering healthcare professionals a reliable second opinion. Ensemble models have been successfully applied in areas like neuroscience, proteomics, and medical diagnosis, such as detecting neuro-cognitive disorders (e.g., Alzheimer's or myotonic dystrophy) from MRI data and classifying cervical cytology.
Malware Detection
In cybersecurity, machine learning techniques are used to classify malware like viruses, worms, trojans, ransomware and spyware. This is similar to how we categorize documents. Machine learning classifiers have proven to be quite effective in this area.
Financial Decision-Making
Predicting business failures is crucial in finance. That's why we use different ensemble models to foresee financial crises. In cases of trade-based manipulation, where some traders try to influence prices through their buying and selling activities, ensemble classifiers help analyze market data and spot suspicious behavior.
Emotion Recognition
Speech recognition mainly relies on deep learning, with big players like Google, Microsoft, and IBM leading the way. However, ensemble learning also performs well in recognizing emotions from speech. It's also used effectively for detecting emotions through facial expressions.
Introduction to Ensemble Learning Techniques
Getting started with ensemble learning involves building multiple machine learning algorithms, known as base models, base learners, or weak learners. These can be small decision trees (commonly used), linear models, support vector machines (SVM), neural networks, or any other algorithms that are able to make predictions better than random guessing. By combining them, we create a strong learner that can predict target classes with a decent amount of accuracy. Ensemble methods help increase the stability of the final model and reduce errors.
So, what are these methods? There are three main ensemble learning methods:
- Bagging creates multiple subsets of the dataset through bootstrapping to decrease the model's variance.
- Boosting trains ensemble members sequentially, with each new model focusing on the mistakes of the previous ones. The goal is to decrease the model's bias.
- Stacking blends different weak learners into an ensemble model to combine predictions. The goal is to increase the classifier's predictive force.
Now, how do these models differ from one another?
- Homogeneous parallel ensembles: In boosting and bagging, the ensemble model uses the same base model.
- Heterogeneous parallel ensembles: Stacking uses different base models, each trained independently, and combines their predictions.
This article focuses on homogeneous ensemble methods, highlighting their similarities and differences. If you want to learn more about stacking, check out this article.
Voting
How do ensemble methods combine base models to create a strong learner? Some techniques – e.g., stacking – use separate machine learning algorithms to train a strong ensemble learner from the weak ones. However, a common method for combining predictions is voting, specifically majority voting.
Majority voting takes each base model's prediction for a given data point and determines the final prediction based on the majority. For example, in a binary classification problem, majority voting gathers predictions from each base classifier for a data instance and uses the majority's choice as the final prediction. Weighted majority voting extends this approach by giving more importance to certain models' predictions over others.
Bagging
Bagging is a supervised learning technique used for both regression and classification. Its name comes from "bootstrapping" and "aggregating."
As mentioned earlier, the idea is to combine the results of multiple models, like decision trees, to get a more generalized result. But if you train all these models on the same dataset, they'll likely produce the same results. So, how can you fix this? That's where bootstrapping comes in.
Bootstrapping creates random subsets of the original training data with replacement. The most straightforward approach is to use a couple of small subsamples and bag them. If the ensemble accuracy is much higher than the base models, it's working!
But there's a tradeoff between the accuracy of base models and the gain from aggregation. Aggregation increases diversity in the ensemble, especially when dealing with unstable models. But if your base models are stable and trained on larger, more accurate subsamples, the improvements are less pronounced.
After creating models on (mostly) different data, a weighted average (for regression) or a vote (for classification) is used to determine the final score.
Bagging steps:
- Multiple subsets are created from the original dataset.
- A base model is built on each of these subsets.
- The final prediction is made by combining the predictions from all these models.
Here is a more practical example of using it in Python. Sci-kit learn has implemented a bagging classifier in sklearn.ensemble.BaggingClassifier.
from sklearn.ensemble import BaggingClassifier |
The code above is using the default base_estimator of a decision tree. It will fit the random samples with replacements. Samples will be taken for all of the features in the training set.
Random Forests
A random forest is a popular ensemble method that combines multiple decision trees to make a final prediction. Instead of using the entire dataset and all features to grow trees, it relies on random subsets of features and data.
Steps to implement a random forest:
- Start with training data that has N observations and M features. Randomly select a sample from this data with replacement.
- Randomly choose a subset of M features, and use the best feature to split the node. Repeat this process to grow the tree.
- Repeat these steps for multiple trees, and combine the predictions from all the trees for the final result.
These steps are straightforward, making random forests a great starting point for those new to machine learning. With that thought in mind, let's move on to boosting!
Boosting
Boosting is an ensemble technique that turns multiple weak learners into a strong one. It's a sequential ensemble technique: the ensemble of weak models is trained in series, with each new model trying to fix the errors of the previous one. If an input is misclassified, its weight is increased so the next model is more likely to get it right. By the end, these base models combine to form a better-performing model. Each individual model may not perform well on its own, but together they improve the ensemble's performance.
Here’s how boosting works:
- Increase the weights of incorrectly predicted observations.
- Create a new model and make predictions based on the updated dataset.
- Repeat, with each model’s performance influenced by the previous ones.
Classical boosting of the subset creation is not random, and the performance will depend on the performance of previous models.
Boosting Methods
One of the simplest boosting methods is adaptive boosting, or AdaBoost, usually using decision trees. AdaBoost was the first successful boosting algorithm for binary classification and is a great starting point for learning about boosting.
The AdaBoost technique follows a decision tree model with a depth equal to one. It is nothing but a forest of stumps (one node and two leaves) rather than trees. Stumps are not great for making accurate classifications, so it is a weak learner. By combining many of them, AdaBoost forms a strong ensemble classifier. Each stump addresses the mistakes of the previous ones, improving overall performance.
scikit-learn offers a module that you can use to take advantage of the adaptive boosting algorithm:
from sklearn.ensemble import AdaBoostClassifier |
Since AdaBoost, many other methods have emerged. Techniques like stochastic gradient boosting trees have proven to be highly effective for classification and regression tasks, especially with structured data.
- XGBoost/XGBM (Extreme Gradient Boosting) features tree pruning, regularization, and parallel processing.
- LightGBM is a gradient boosting algorithm that uses leaf-wise tree growth and other techniques for efficiency and speed.
- CatBoost handles categorical data directly and reduces overfitting with techniques like random permutations.
- LPBoost is a gradient boosting framework that uses tree-based algorithms and follows a leaf-wise approach.
- BrownBoost handles noisy data by adjusting the weight of misclassified instances, improving accuracy while resisting overfitting.
Comparing Boosting and Bagging
Both techniques aim to improve model performance by combining multiple base learners, yet they do so in distinct ways. Knowing their similarities and differences helps in using their strengths effectively.
Similarities:
- They are both ensemble methods focused on getting N learners from a single one.
- They both make random sampling and generate several training data sets.
- They both arrive at the final decision by calculating the average of N learners or taking the voting rank of most of them.
- They both reduce variance and provide higher stability while minimizing errors.
Differences:
- In bagging, any element has the same probability of appearing in a new data set. In boosting, observations are weighted, so some of them will appear in the new sets more often.
- Bagging is a method of merging the same type of prediction while boosting focuses on different types.
- Bagging reduces variance, not bias, and solves over-fitting issues in a model. Boosting decreases bias, not variance.
- Bagging builds models independently. In boosting, each new model is influenced by the performance of the previous one.
Which Technique Is Better?
You may wonder which technique to choose for a particular problem. The answer? It depends on the data, the simulation, and the circumstances.
If the problem is that a single model performs poorly, bagging won't improve the bias much. Boosting, on the other hand, can reduce errors by optimizing the strengths and addressing the weaknesses of individual models.
By contrast, if the problem is overfitting, then bagging is the best option. Boosting, for its part, doesn't help you avoid overfitting. In fact, this technique faces this problem itself.
The truth is that we don't have any hard rules for which method to use in most cases. This is where experience and subject matter expertise come into play. It's tempting to stick with the first model that shows promise, but it's crucial to evaluate the algorithm and the features it selects. It's not just about trying AdaBoost or random forests on different datasets. The final algorithm depends on the results it is getting and the support provided.
Master Ensemble Algorithms for a Rewarding Career in Machine Learning
Now that we've covered bagging and boosting, we've reached the end of the article and can conclude that both are equally important. It is no secret that ensemble methods generally outperform a single model, which is why many Kaggle winners use them.
Looking to elevate your data science and machine learning skills? Explore the Turing College program in Data Science & AI. This program equips you with the knowledge and skills to excel in data analysis, data visualization, and machine learning. Learn from industry experts, work on real-world projects, and get practical experience with the latest tools and techniques. Take advantage of this opportunity to advance your data science career.