Principal Component Analysis

PCA is an algorithm used to find the principal component of data.
Principal components are the directions where there is the most variance, the directions where the data is most spread out.

Eigenvectors and Eigenvalues

On a set of data, we can deconstruct the set into eigenvectors and eigenvalues.
Every eigenvector has a corresponding eigenvalue.
An eigenvector is a direction, while the eigenvalue associated is a number that tells how much variance there is in the data in that direction.
The eigenvector with the highest eigenvalue is, therefore, the principal component.
The number of eigenvectors/values that exist in a data set is the total number of the dimensions of the dataset.

import numpy as np  
import matplotlib.pyplot as plt  
from mpl_toolkits.mplot3d import Axes3D  
import pandas as pd

from sklearn import decomposition  
from sklearn import datasets

np.random.seed(5)

centers = [[1, 1], [-1, -1], [1, -1]]  
iris = datasets.load_iris()  
X = iris.data  
y = iris.target  
plt.scatter(X[:,0],X[:,1],X[:,2],X[:,3])  
plt.show()  
pca = decomposition.PCA(n_components=3)  
pca.fit(X)  
X = pca.transform(X)

fig = plt.figure(1, figsize=(11, 8))  
plt.clf()  
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:  
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)  
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral)

ax.w_xaxis.set_ticklabels([])  
ax.w_yaxis.set_ticklabels([])  
ax.w_zaxis.set_ticklabels([])

plt.show()  

Davide Andreazzini

Read more posts by this author.