K-Means Clustering

K-Means is a powerful unsupervised learning algorithm for clustering data into coherent subsets. It iteratively assigns points to the nearest centroid and updates centroids to minimize distortion, making it widely used in practice.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Machine Learning Learning Path

Anomaly Detection: Identifying Rare and Unusual Patterns in Data

K-Means Clustering

Given an *unlabeled dataset automatically group the data into coherent subsets, called clusters.

Cluster analysis is a powerful unsupervised learning technique for discovering groups in data.

Used to solves the clustering problem:

one of the most widely used clustering algorithms.
K-Means is an Iterative Algorithm

Input:

$K$ number of clusters
$\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}$ ${x^{(1)}, x^{(2)}, ..., x^{(m)}}$ ,
- where $x^{(i)} \in \mathbb{R}^n$ .

Output: cluster assignments for each example and cluster centroids.

K-Means works as follows:

Takes $K$ and unlabeled data as input.
Repeats:
- Assign points to nearest centroid.
- Move centroids to mean of assigned points.
Minimizes Cost function (Distortion):
$J = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2$
Stops when assignments stabilize.

It is simple, fast, and widely used in practice.

When K-Means Works Well

K-Means works best when:

Clusters are roughly spherical.
Clusters have similar sizes.
Features are properly scaled.

Feature scaling is important:

x_j := \frac{x_j - \mu_j}{\sigma_j}

Without scaling, features with larger magnitude dominate distance calculations.

When K-Means Struggles

K-Means performs poorly when:

Clusters are non-convex.
Clusters have very different densities.
Data contains significant outliers.
Clusters are not well separated.

K-Mean Algorithm

Suppose we want to group data into $K = 2$ clusters.

Input 🍓🍡🍬🍭:

$K$ (number of clusters) = 2 in this case
Dataset $\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}$ ${x^{(1)}, x^{(2)}, ..., x^{(m)}}$ , where
- $x^{(i)} \in \mathbb{R}^n$
- $m$ = number of examples

Step 1: 🥣 Initialize `Cluster centroids`( $\mu$ )

Randomly initialize $K$ cluster centroids

$\mu_1, \mu_2, \mu_3 \dots \mu_k$ where $\mu_k \in \mathbb{R}^n$
$K$ = total number of clusters
$k \in \{1, 2, ..., K\}$
$\mu_k$ = centroid of cluster $k$
$\mu_k$ is a point in the same space as the data (i.e., $\mathbb{R}^n$ )

Choosing the Number of Clusters (K)

There is no universal rule for choosing $K$ .

Proper Random Initialization

When running K-Means:

K < m

Where:

$K$ = number of clusters
$m$ = number of training examples

It does not make sense to choose $K \ge m$ .

Recommended Initialization Method

Randomly select $K$ distinct training examples.
Set initial centroids equal to those examples:

\mu_1 = x^{(i_1)}, \quad \mu_2 = x^{(i_2)}, \quad \dots, \quad \mu_K = x^{(i_K)}

where $i_1, \dots, i_K$ are randomly chosen distinct indices.

This ensures centroids start within the data distribution.

Elbow Method:

Run K-Means for different values of $K$ .
Compute distortion $J$ for each.
Plot $J$ vs $K$ .
Look for an “elbow” where improvement slows down.

Repeatedly Steps 2-3 until convergence {

Step 2. 📏 Cluster Assignment Step

For each training example( $x^{(i)}$ ) where $i = 1, \dots, m$ :

Assign $x^{(i)}$ to the cluster with the closest centroid.

In Mathematical terms:

Index of Cluster $c^{(i)}$

Index of cluster is Centroid closest to $x^{(i)}$ in $\{1, \dots, K\}$

Examples:

If $x^{(i)}$ is closest to $\mu_1$ , then $c^{(i)} = 1$
If $x^{(i)}$ is closest to $\mu_2$ , then $c^{(i)} = 2$

$\mu_c^{(i)}$ = CLuster centroid of cluster

Value of centroid of cluster assigned to $x^{(i)}$ is $\mu_{c^{(i)}}$

If $c^{(i)} = 1$ , then $\mu_{c^{(i)}} = \mu_1$
If $c^{(i)} = 2$ , then $\mu_{c^{(i)}} = \mu_2$

Which can be expressed as:

c^{(i)} := \arg\min_{k \in \{1, \dots, K\}} \left\| x^{(i)} - \mu_k \right\|^2

Formally:

c^{(i)} = \arg\min_k \left\| x^{(i)} - \mu_k \right\|^2

Where:

$\mu_k$ = centroid of cluster $k$
$c^{(i)}$ = cluster assigned to example $i$

We typically use squared Euclidean distance.

Example

Given

Centroids:

\mu_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}

\mu_2 = \begin{bmatrix} -3 \\ 0 \end{bmatrix}

\mu_3 = \begin{bmatrix} 4 \\ 2 \end{bmatrix}

Training example:

x^{(i)} = \begin{bmatrix} -1 \\ 2 \end{bmatrix}

Distance to $\mu_1$

\| x^{(i)} - \mu_1 \|^2 = (-1 - 1)^2 + (2 - 2)^2

= (-2)^2 + 0^2 = 4

Distance to $\mu_2$

\| x^{(i)} - \mu_2 \|^2 = (-1 + 3)^2 + (2 - 0)^2

= (2)^2 + (2)^2 = 4 + 4 = 8

Distance to $\mu_3$

\| x^{(i)} - \mu_3 \|^2 = (-1 - 4)^2 + (2 - 2)^2

= (-5)^2 + 0^2 = 25

Assigning Cluster

Distance to $\mu_1$ : 4
Distance to $\mu_2$ : 8
Distance to $\mu_3$ : 25

The smallest distance is to $\mu_1$ so $c^{(i)} = 1$

Step 3. 🧿 Move Centroid Step

For each cluster $k$ where $k = 1, \dots, K$ :

$K$ = total number of clusters
$k \in \{1, 2, ..., K\}$

Update centroid $\mu_k$ to be the mean of all points assigned to it.

\mu_k = \frac{1}{|C_k|} \sum_{i : c^{(i)} = k} x^{(i)}

Where:

$C_k$ = set of points assigned to cluster $k$
$|C_k|$ = number of points in cluster $k$

This moves the centroid to the center of its assigned points.

Each centroid becomes the mean of the points assigned to it.

⛔ Stop Condition

The algorithm converges when:

Cluster assignments $c^{(i)}$ no longer change, or
Centroids $\mu_k$ stop moving.

What If a Cluster Gets No Points?

If some centroid has zero assigned points:

Eliminate the cluster (resulting in $K - 1$ clusters), or
Reinitialize the centroid randomly

In practice, empty clusters are uncommon with reasonable initialization.

💰 Cost Function (Distortion)

Minimize th Distance between points and their assigned centroids.

J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K) = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2

Or equivalently:

J(\mu_1, \dots, \mu_K) = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2

Where:

$J$ = cost function (distortion)
$m$ = number of training examples
$x^{(i)}$ = $i$ -th training example
$\mu_{c^{(i)}}$ = centroid of the cluster assigned to $x^{(i)}$
$c^{(i)}$ = index of cluster assigned to $x^{(i)}$

Each iteration consists of:

Assignment step
Minimizes $J$ with respect to $c^{(i)}$ (cluster assignments).
Centroid update step
Minimizes $J$ with respect to $\mu_k$ (centroid locations).

Since each step does not increase $J$ , the algorithm is guaranteed to converge.

Each iteration of K-Means guarantees that $J$ does not increase.

However, it may converge to a local minimum, not necessarily the global minimum.

A good solution:

Each natural cluster is captured by one centroid.

A bad solution:

Merge two true clusters into one
Split one true cluster into multiple parts
Assign very few points to some clusters

All of these correspond to higher distortion:

J_{\text{bad}} > J_{\text{good}}

To reduce the risk of poor local minima:

Algorithm

Repeat for $t = 1, \dots, T$ :

Randomly initialize centroids
Run K-Means to convergence
Compute distortion:

J^{(t)} = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2

Finally:

\text{Choose clustering with smallest } J^{(t)}

K-Means Clustering

K-Means is a powerful unsupervised learning algorithm for clustering data into coherent subsets. It iteratively assigns points to the nearest centroid and updates centroids to minimize distortion, making it widely used in practice.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Machine Learning Learning Path

Anomaly Detection: Identifying Rare and Unusual Patterns in Data