An overview of Principal Component Analysis
Principal component analysis (PCA) is an unsupervised machine learning algorithm that calculates the principal components of a given dataset. But what are principal components and why do they matter? What can be achieved with PCA and why should we learn it? Let’s explore the math and the meaning behind this algorithm.
The basic linear algebra
In math language, PCA is a linear transformation from a set of variables A to another set T where the new variables of T are orthogonal, meaning they are uncorrelated. To go from A to T, we need some linear algebra.
Let’s start by assuming that the set of variables A is a n x m matrix, where n is the number of rows (which, in terms of datasets, are the measurements) and m is the number of variables (features). For the sake of demonstration, the matrix A will be a 4 x 3 matrix, representing a dataset with 4 measurements for each column and 3 features:
We can see that matrix A has different scales in its columns. The next step is therefore to standardize matrix A to form a new matrix B with zero mean. We standardize by subtracting from each column element the mean value of its column and dividing by the standard deviation of the column:
Transforming data to the same scale is very important to PCA since linear transformations are very sensitive to scales, which means if a variable has a larger deviation from its mean, it will dominate the results (a discussion on whether we must standardize or normalize data to obtain better computational results can be found here).
Now we calculate the covariance matrix C of B defined as
Covariance is a measure of how a pair of variables change conjointly. If both of them increase or decrease, their covariance is positive and we say they are correlated. If one decreases and the other increases, their covariance is negative and they are inversely correlated. The diagonal elements of C are the variances (covariance of an element with itself):
We must now diagonalize matrix C, which corresponds to finding its eigenvalues and eigenvectors that give us the linear transformation we seek for. Since C is the covariance matrix, its eigenvalues are the directions in which there is the most variance in the data and the eigenvalues are the scale of the variances in each direction. The eigenvalue-eigenvector equation is given by:
Eigenvectors of distinct eigenvalues are orthogonal for symmetric matrices (which is the case of matrix C). Solving the eigenvalue-eigenvector equation we have then a matrix V where each column is an eigenvector and the eigenvalues are given by:
The linear combinations λV are what are called the principal components [1](although some authors refer to the eigenvectors as the principal components). To each eigenvector (column of the eigenvector matrix V), there is a corresponding eigenvalue. If we sort the eigenvalues in order of decreasing values (in this case, λ2 > λ3 > λ1) and then sort the eigenvectors according to their respective eigenvalues (V2, V3, V1), we can write an ordered eigenvector matrix W
Finally, the transformed matrix T we were looking for is given by the projection of matrix B (standardized version of matrix A) onto matrix W
Since each eigenvalue is proportional to the explained variance, which is given by
(meaning that each eigenvalue is roughly its explained variance) we can actually lower the dimension of matrix T by choosing the first i eigenvalues and respective eigenvectors, writing the matrix W with only these chosen eigenvectors and project B on it. The choice of the number of eigenvalues depends on what is the variance explained we are looking for (90%, 80%, and so on. Usually, more than 80% of the variance explained is already good, but this is a factor to analyze depending on the dataset and your goals.)
By choosing only the first i eigenvectors, we still have a lot of information about our original dataset. This is the beauty of PCA! Therefore, besides the dimensionality reduction of T, it will also have variables that are orthogonal to each other (remember this is one thing we were seeking at the beginning: uncorrelated variables). We have then made a dimensionality reduction of the matrix (dataset) A and generated uncorrelated variables. Thus, if dealing with datasets that contain many features, we can reduce this number with PCA, obtaining a smaller dataset, easier to deal with, containing the most information about the original dataset and whose variables are independent.
Application on a dataset with Python
We will consider the Classification in Asteroseismology Kaggle dataset. In this dataset, a red giant branch of stars is analyzed and the objective is to predict whether a red giant is becoming a helium-burning star. We only have three columns of features (Dnu, numax and epsilon) and the target column POP is 1 for helium-burning and 0 for red giant, therefore we will be able to reduce this dataset to two PCs and see the clustering.
We use first import the dataset and divide it into features (X) and target (y — remember we worked with a matrix of features in the previous sections, therefore if your dataset has a target column, it must be separated):
The next step is to standardize the features, as we can see in Fig. 1 that the features have different scales and this can jeopardize the results of PCA. We use the function StandardScaler():
Now, we must apply PCA to the standardized dataset by calling PCA(). Let us look at the cumulative variance explained and the variance explained as a function of the number of PCs:
Fig. 3 shows the number of PCs and the percentage of the variance explained. The blue line is the cumulative explained variance and it reaches 100% of the variance explained, while the orange line is the variance explained, which starts with the first component having the greater value. We see there's an elbow in each curve that indicates the number of components we must pick. This is the elbow method. In this case, 2 is the number of PCs we will use. For the next step, we apply PCA again, choosing the parameter n_components = 2 and construct a new dataset with the PCs as columns and add the target column:
Plotting the PCs with the target column as hue to see the clustering:
From Fig. 5, we see the red giants are predominantly distributed in a diagonal line, while the helium-burning is in the left upper corner. Some helium-burning stars can be found close to the red giant branch, and I believe this is because they have some similarities (after all, a helium-burning is one phase of the red giant). What is most important here is that PCA was able to reduce the dimension of the dataset with 100% of the variance explained and we could see the clustering. For larger datasets, we may not be able to choose 2, or 3 components, but we still can use PCA to diminish the dataset size.
Key steps and some limitations
Let’s now summarize the key steps to perform the PCA:
- Divide the dataset into features and target;
- Standardize the features;
- Apply PCA and use the elbow method to see how many components to choose;
- Choose the number of components and apply PCA to obtain the final reduced dataset.
There is one important limitation in applying PCA: the new variables will be less interpretable, as we could see with the dataset generated previously. If we need to keep more than four dimensions, it will not be possible to visualize the PCs in one graph and see clustering.
Conclusions
PCA is a great technique to reduce the dimension of the dataset while keeping as much information as possible. If the dataset is not so big and the number of PCs is 2 or 3, we can even see clustering by plotting the PCs. It can also be used to reduce the dimension of images.
I hope this article was helpful! Please leave comments and suggestions :)
References
[1] Jolliffe Ian T. and Cadima Jorge 2016 Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A.3742015020220150202 https://doi.org/10.1098/rsta.2015.0202