Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 5 1 PCA Reduction

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.
Cover Image for Principal Component Analysis (PCA) Explained

Principal Component Analysis (PCA) Explained

Learn how Principal Component Analysis (PCA) reduces the dimensionality of datasets while preserving important information. Understand the intuition, mathematics, and practical uses of PCA in machine learning and data science.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Dimensionality Reduction in Machine Learning

Next →

K-Means Clustering

🧊 Principal Component Analysis (PCA)

PCA finds the most important directions in data and compresses the data into fewer numbers while trying to keep the important information

PCA is a dimensionality reduction algorithm that:

  • finds directions of maximum variance
  • projects data onto lower-dimensional space
  • minimizes projection error

Run PCA only on the inputs to learn a mapping:

x→zx \rightarrow zx→z

where:

  • x∈Rnx \in \mathbb{R}^nx∈Rn
  • z∈Rkz \in \mathbb{R}^kz∈Rk
  • k≪nk \ll nk≪n

Original training set:

(x(1),y(1)),…,(x(m),y(m))(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})(x(1),y(1)),…,(x(m),y(m))

Becomes:

(z(1),y(1)),…,(z(m),y(m))(z^{(1)}, y^{(1)}), \dots, (z^{(m)}, y^{(m)})(z(1),y(1)),…,(z(m),y(m))

Now the learning algorithm trains on lower-dimensional data.

Important:

  • PCA map data of nnn dimensions into kkk dimensions
  • PCA is an unsupervised algorithm
  • PCA does not use labels yyy
  • Do NOT fit PCA on:
    • cross-validation set
    • test set

The most widely used algorithm for dimensionality reduction.

Example:

Given a messy training data with

  • 1000 features

PCA says:

“Maybe only 20 directions are really important.”

Advantages:

  • Speed Up Learning: Lower-dimensional data makes training faster.
  • Compression : Reduce storage and memory requirements.
  • Visualization: visualization becomes easier for lower dimension

Bad Use of PCA

PCA is NOT a good method for preventing overfitting.

Some people think:

fewer dimensions = less overfitting

but this is not ideal.

Reason:

  • PCA ignores labels yyy
  • PCA may throw away useful predictive information

Instead:

✅ Use regularization to reduce overfitting.

Do NOT automatically add PCA to every ML pipeline.

Bad habit:

Training Data
    ↓
PCA
    ↓
Logistic Regression
    ↓
Predictions

Before using PCA, first try:

Training Data
    ↓
Learning Algorithm
    ↓
Predictions

Use PCA only if:

  • training is too slow
  • memory usage is too large
  • dimensionality is extremely high

How to select kkk in PCA?

A common way to choose the number of PCA components kkk is by checking how much variance is retained.

PCA tries to minimize the projection error:

1m∑i=1m∥x(i)−xapprox(i)∥2\frac{1}{m}\sum_{i=1}^{m} \left\|x^{(i)} - x^{(i)}_{approx}\right\|^2m1​i=1∑m​​x(i)−xapprox(i)​​2

where:

  • x(i)x^{(i)}x(i) = original data point
  • xapprox(i)x^{(i)}_{approx}xapprox(i)​ = projected/reconstructed point

The total variation in data is:

1m∑i=1m∥x(i)∥2\frac{1}{m}\sum_{i=1}^{m} \left\|x^{(i)}\right\|^2m1​i=1∑m​​x(i)​2

A standard rule is to choose the smallest kkk such that:

1m∑i=1m∥x(i)−xapprox(i)∥21m∑i=1m∥x(i)∥2≤0.01\frac{ \frac{1}{m}\sum_{i=1}^{m} \left\|x^{(i)} - x^{(i)}_{approx}\right\|^2 }{ \frac{1}{m}\sum_{i=1}^{m} \left\|x^{(i)}\right\|^2 } \le 0.01m1​∑i=1m​​x(i)​2m1​∑i=1m​​x(i)−xapprox(i)​​2​≤0.01

This means:

  • projection error ≤1%\le 1\%≤1%
  • equivalently, 99% variance retained

People usually describe PCA quality as:

  • 99% variance retained
  • 95% variance retained
  • 90% variance retained

How to select Projection Direction?

A good projection line is one where:

  • When we project each point onto the line
  • The distance between the original point and its projection is small

⚠️ Projection errors

The orthogonal distance from a point to the line.

  • Help with selecting projection direction

The projection error is:

∥x(i)−x^(i)∥2\| x^{(i)} - \hat{x}^{(i)} \|^2∥x(i)−x^(i)∥2

So goal of PCA is to find the direction that minimizes the total squared orthogonal distance.

min⁡u(1)∑i=1m∥x(i)−projection of x(i) onto u(1)∥2\min_{u^{(1)}} \sum_{i=1}^{m} \| x^{(i)} - \text{projection of } x^{(i)} \text{ onto } u^{(1)} \|^2u(1)min​i=1∑m​∥x(i)−projection of x(i) onto u(1)∥2

where:

  • x^(i)\hat{x}^{(i)}x^(i) is the projected version of x(i)x^{(i)}x(i)

PCA minimizes:

∑i=1m∥x(i)−x^(i)∥2\sum_{i=1}^{m} \| x^{(i)} - \hat{x}^{(i)} \|^2i=1∑m​∥x(i)−x^(i)∥2

Important:

  • If PCA returns u(1)u^{(1)}u(1) or −u(1)-u^{(1)}−u(1), it does not matter.
  • Both define the same line.

General Case: nD → kD

Now suppose:

x(i)∈Rnx^{(i)} \in \mathbb{R}^nx(i)∈Rn

Where nnn is original number of dimensions

Example for 3D n=3n =3n=3

and we want to reduce to kkk dimensions, eg. k=2k=2k=2 when we want to project 3D to 2D

z(i)∈Rkwhere k<nz^{(i)} \in \mathbb{R}^k \quad \text{where } k < nz(i)∈Rkwhere k<n

Instead of finding one vector, we find kkk vectors:

u(1),u(2),…,u(k)u^{(1)}, u^{(2)}, \dots, u^{(k)}u(1),u(2),…,u(k)

These vectors:

  • Define a k-dimensional surface
  • Span a k-dimensional linear subspace

We then project each point onto that subspace.

3D → 2D Example

If:

x(i)∈R3x^{(i)} \in \mathbb{R}^3x(i)∈R3

and we reduce to 2D:

  • We find two vectors:
u(1),u(2)∈R3u^{(1)}, u^{(2)} \in \mathbb{R}^3u(1),u(2)∈R3
  • These define a plane.
  • Each point is projected onto that plane.

2D → 1D Example

Suppose we have:

x(i)∈R2x^{(i)} \in \mathbb{R}^2x(i)∈R2

and we want to reduce the data from 2 dimensions to 1 dimension.

That means:

  • We want to find a line
  • Onto which we project all data points
u(1)∈R2u^{(1)} \in \mathbb{R}^2u(1)∈R2


💡 PCA Algorithm

Suppose we have supervised learning data:

(x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i))

where:

  • x(i)x^{(i)}x(i) = input features
  • y(i)y^{(i)}y(i) = labels

1. Ignore Labels Temporarily

Extract only the input vectors:

x(1),x(2),x(3),x(4)x^{(1)}, x^{(2)}, x^{(3)}, x^{(4)}x(1),x(2),x(3),x(4)

Before applying PCA, it is standard to:

1. Perform mean normalization

For each feature:

xj:=xj−μjx_j := x_j - \mu_jxj​:=xj​−μj​

This makes each feature have zero mean.

2. Perform feature scaling (recommended)

Especially when features have different ranges.

xj:=xj−μjsjx_j := \frac{x_j - \mu_j}{s_j}xj​:=sj​xj​−μj​​

where:

  • μj\mu_jμj​ = mean of feature jjj
  • sjs_jsj​ = standard deviation or feature range

So that:

  • Each feature has zero mean
  • Features have comparable ranges

This prevents one feature from dominating purely due to scale.

Step 2: Compute Covariance Matrix

Covariance Matrix (Σ\SigmaΣ)

is a square matrix giving the covariance between each pair of elements of a given random vector.

If:

  • mmm = number of examples
  • x(i)∈Rnx^{(i)} \in \mathbb{R}^nx(i)∈Rn

then covariance matrix is:

Σ=1m∑i=1mx(i)(x(i))T\Sigma = \frac{1}{m}\sum_{i=1}^{m} x^{(i)}(x^{(i)})^TΣ=m1​i=1∑m​x(i)(x(i))T

Vectorized implementation:

Σ=1mXTX\Sigma = \frac{1}{m}X^TXΣ=m1​XTX
import numpy as np

# Assuming data is (observations, features)
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Set rowvar=False to treat columns as variables
cov_matrix = np.cov(data, rowvar=False) 

print(cov_matrix)

Step 3: Apply Singular Value Decomposition (SVD)

Compute Eugen Vector for Covariance Matrix (Σ\SigmaΣ)

[U,S,V]=SVD(Σ)[U,S,V] = SVD(\Sigma)[U,S,V]=SVD(Σ)

where

  • UUU is nXnnXnnXn dimension matrix. U=[u1u2…un]U = \begin{bmatrix}u_1 & u_2 & \dots & u_n\end{bmatrix}U=[u1​​u2​​…​un​​]

to take kkk dimensions select first k columns

Ureduce=[u1u2…uk]U_{reduce} = \begin{bmatrix}u_1 & u_2 & \dots & u_k\end{bmatrix}Ureduce​=[u1​​u2​​…​uk​​]

  • SSS = diagonal matrix of singular values
S=[S11000S22000⋱]S = \begin{bmatrix} S_{11} & 0 & 0 \\ 0 & S_{22} & 0 \\ 0 & 0 & \ddots \end{bmatrix}S=​S11​00​0S22​0​00⋱​​

Then variance retained can be computed efficiently as:

∑i=1kSii∑i=1nSii\frac{ \sum_{i=1}^{k} S_{ii} }{ \sum_{i=1}^{n} S_{ii} }∑i=1n​Sii​∑i=1k​Sii​​

Choose the smallest kkk such that:

∑i=1kSii∑i=1nSii≥0.99\frac{ \sum_{i=1}^{k} S_{ii} }{ \sum_{i=1}^{n} S_{ii} } \ge 0.99∑i=1n​Sii​∑i=1k​Sii​​≥0.99

for 99% variance retained.

Typical values:

  • 90%
  • 95%
  • 99%

Most commonly:

  • 95% to 99% variance retained.

  • VVV = right singular vectors (not used in PCA)


import numpy as np

# Define your matrix
A = np.array([[1, 2], [3, 4], [5, 6]])

# Perform SVD
U, S, Vt = np.linalg.svd(A)

print("U (Left Singular Vectors):\n", U)
print("\nS (Singular Values as 1D array):\n", S)
print("\nVt (Right Singular Vectors - Transposed):\n", Vt)


Step 4: Choose Top K Components

Take the first kkk columns:

Ureduce=[u1 u2 ... uk]U_{reduce} = [u_1 \ u_2 \ ... \ u_k]Ureduce​=[u1​ u2​ ... uk​]

This reduces data from:

  • nnn dimensions to
  • kkk dimensions

Step 5: Project Data

Compute reduced representation:

z=UreduceTxz = U_{reduce}^T xz=UreduceT​x

Which is equivalent to

zj=(u(j))Txz_j = (u^{(j)})^{T}xzj​=(u(j))Tx

where:

  • x∈Rnx \in \mathbb{R}^nx∈Rn : represents the original input values in n dimension
  • z∈Rkz \in \mathbb{R}^kz∈Rk : represents the coordinates of xxx in the reduced k-dimensional space.
  • UreduceT=[u1Tu2T⋮ukT]U_{reduce}^T = \begin{bmatrix}u_1^T \\ u_2^T \\ \vdots \\ u_k^T\end{bmatrix} UreduceT​=​u1T​u2T​⋮ukT​​​

Reproduce xix_ixi​ from given zzz

We know

z=UreduceTxz = U_{reduce}^T xz=UreduceT​x

we can calculate xxx

xapprox=zUreducex_{approx} = z U_{reduce}xapprox​=zUreduce​

Where

  • zzz is kX1k X 1kX1 matrix
  • UreduceU_{reduce}Ureduce​ is nXkn X knXk matrix

PCA vs Linear Regression (Very Important)

PCA is NOT linear regression.

Linear Regression:

  • Predicts a special variable yyy
  • Minimizes vertical squared errors
  • Error is measured in the y-direction only

PCA:

  • Has no special target variable
  • All features x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​ are treated equally
  • Minimizes orthogonal (shortest) distance to a line/plane

Linear regression minimizes:

vertical distance\text{vertical distance}vertical distance

PCA minimizes:

orthogonal distance\text{orthogonal distance}orthogonal distance

These are completely different objectives.


Final Summary

PCA:

  • Finds a lower-dimensional subspace
  • Projects data onto that subspace
  • Minimizes squared orthogonal projection error
  • Treats all features symmetrically
  • Is not a predictive model

Formally, PCA solves:

min⁡∑i=1m∥x(i)−x^(i)∥2\min \sum_{i=1}^{m} \| x^{(i)} - \hat{x}^{(i)} \|^2mini=1∑m​∥x(i)−x^(i)∥2

where x^(i)\hat{x}^{(i)}x^(i) is the projection of x(i)x^{(i)}x(i) onto a k-dimensional subspace.


AI-Machine-Learning/5-1-PCA-Reduction
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.