Author Linda Oranya, Turing College learner
The first time I ever heard about PCA I found its explanation so confusing that I decided it wasn’t necessary to use it when modelling data. But if you’re like I was, fear not. In this article, I’ll be debunking a lot of the confusion around PCA, and helping you to understand its basics. What’s more, I‘ll also be telling you why it’s an especially important tool when it comes to modelling large datasets.
What is PCA? Put simply, PCA is Principal Components Analysis,
But what does that mean exactly? Well, PCA is an unsupervised machine learning method that is used for dimensionality reduction. Basically, PCA is used for 2 things: To perform dimensionality reduction, and to visualize datasets that cannot be visualized in two-dimensions. PCA transforms variables into a new set of variables by combining the existing variables in our original dataset. These combined variables are called PCs (Principal Components).
The value of each PCs decreases as they progress i.e PC1>PC2>PC3>PC4>…>PCn
It is also important to note that there is no known rule (at least not to the best of my knowledge) that states how many PCs you can select, although it is useful to note that the idea is to reduce your variables as much as possible. But also keep in mind that the selected PCs should capture at least 90% of your data variance (this will become clearer as you read further).
The resulting components will be less or equal to the initial number of variables, i.e PCs ≤ no. of variables in the original dataset.
How does PCA work?
I will state the steps as clearly as possible but will also illustrate everything using the Iris dataset.
Steps for performing dimensionality reduction with PCA
Step 1: Normalize the data
The first step is to normalize the data that we have so that PCA works properly. Yes! You heard me right, I used the phrase “works properly.” You see, if you don’t normalize your dataset, PCA will prioritize the columns with higher values because it deals with variance. In order to avoid this, we subtract the respective means from the numbers in the respective column. This then gives us a dataset whose mean is zero and whose standard deviation is 1.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
Let’s read this in our iris data. You can either download the data here or load it from the datasets as shown below.
iris_data = datasets.load_iris()
iris_data = pd.read_csv("Iris.csv",index_col='Id')
Before you proceed, you can perform some basic EDA on the dataset. I have already written an article on that, and you can find it here.
#normalizing the data
scaler = MinMaxScaler()
Step 2: Calculate the covariance matrix
Since the dataset we are using is 2-dimensional, this will result in a 2x2 Covariance matrix.
where: Var[X1] = Cov[X1,X1] and Var[X2] = Cov[X2,X2].
pca = PCA()
X_new = pca.fit_transform(X)
Covariance of the matrix
Step 3: Calculate the eigenvalues and eigenvectors
The next step is to calculate the eigenvalues and eigenvectors for the covariance matrix. ƛ is an eigenvalue for a matrix A, if it is the solution to the characteristic equation where “I” is the identity matrix of the same dimension as A:
det( ƛI — A ) = 0
.This is a required condition for the matrix subtraction as well in this case, and ‘det’ is the determinant of the matrix. For each eigenvalue ƛ, a corresponding eigenvector v, can be found by solving:
( ƛI — A )v = 0
Now we will try to obtain the explained variance, which will then basically show us all the principal components and how much variance each has. Remember when I wrote about the PCs explaining your data variance? This is it right here.
Result of PCs
Basically, the PCs are the Eigen values, while the directions of the PCs are the Eigen vectors.
Step 4: Choose components:
Ever heard of the Scree plot? It helps us to choose components because it orders eigenvalues from largest to smallest, giving us a picture of the components in order of significance. Now comes the dimensionality reduction part. We might lose some information in the process of selecting components, but if the eigenvalues are small, we will not lose much.
plt.bar(range(4), explained_variance, alpha=0.5, align='center',
label='individual explained variance')
plt.ylabel('Explained variance ratio')
Scree plot showing PCs and their variance ratio
We can select 3 components from the above since the rest are significantly small.
Now that we have successfully reduced our dataset, we can now perform data modeling on it using the desired ML algorithm of our choice. It will now definitely perform better than it would have had we not reduced our data.