Principal Component Analysis

PCA is an algorithm used to find the principal component of data.
Principal components are the directions where there is the most variance, the directions where the data is most spread out.

Eigenvectors and Eigenvalues

On a set of data, we can deconstruct the set into eigenvectors and eigenvalues.
Every eigenvector has a corresponding eigenvalue.
An eigenvector is a direction, while the eigenvalue associated is a number that tells how much variance there is in the data in that direction.
The eigenvector with the highest eigenvalue is, therefore, the principal component.
The number of eigenvectors/values that exist in a data set is the total number of the dimensions of the dataset.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

from sklearn import decomposition
from sklearn import datasets

np.random.seed(5)

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
plt.scatter(X[:,0],X[:,1],X[:,2],X[:,3])
plt.show()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

fig = plt.figure(1, figsize=(11, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()