Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 10 K mean

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for K-Means Clustering

K-Means Clustering

K-Means is a powerful unsupervised learning algorithm for clustering data into coherent subsets. It iteratively assigns points to the nearest centroid and updates centroids to minimize distortion, making it widely used in practice.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

K-Means Clustering

Given an *unlabeled dataset automatically group the data into coherent subsets, called clusters.

Cluster analysis is a powerful unsupervised learning technique for discovering groups in data.

Used to solves the clustering problem:

  • one of the most widely used clustering algorithms.
  • K-Means is an Iterative Algorithm

Input:

  • KKK number of clusters
  • {x(1),x(2),...,x(m)}\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}{x(1),x(2),...,x(m)},
    • where x(i)∈Rnx^{(i)} \in \mathbb{R}^nx(i)∈Rn.

Output: cluster assignments for each example and cluster centroids.

K-Means works as follows:

  • Takes KKK and unlabeled data as input.

  • Repeats:

    • Assign points to nearest centroid.
    • Move centroids to mean of assigned points.
  • Minimizes Cost function (Distortion):

    J=1m∑i=1m∥x(i)−μc(i)∥2J = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J=m1​i=1∑m​​x(i)−μc(i)​​2
  • Stops when assignments stabilize.

It is simple, fast, and widely used in practice.

When K-Means Works Well

K-Means works best when:

  • Clusters are roughly spherical.
  • Clusters have similar sizes.
  • Features are properly scaled.

Feature scaling is important:

xj:=xj−μjσjx_j := \frac{x_j - \mu_j}{\sigma_j}xj​:=σj​xj​−μj​​

Without scaling, features with larger magnitude dominate distance calculations.

When K-Means Struggles

K-Means performs poorly when:

  • Clusters are non-convex.
  • Clusters have very different densities.
  • Data contains significant outliers.
  • Clusters are not well separated.

K-Mean Algorithm

Suppose we want to group data into K=2K = 2K=2 clusters.

Input 🍓🍡🍬🍭:

  • KKK (number of clusters) = 2 in this case
  • Dataset {x(1),x(2),...,x(m)}\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}{x(1),x(2),...,x(m)}, where
    • x(i)∈Rnx^{(i)} \in \mathbb{R}^nx(i)∈Rn
    • mmm = number of examples

Step 1: 🥣 Initialize Cluster centroids(μ\muμ)

Randomly initialize KKK cluster centroids

  • μ1,μ2,μ3…μk \mu_1, \mu_2, \mu_3 \dots \mu_kμ1​,μ2​,μ3​…μk​ where μk∈Rn\mu_k \in \mathbb{R}^nμk​∈Rn

  • KKK = total number of clusters

  • k∈{1,2,...,K}k \in \{1, 2, ..., K\}k∈{1,2,...,K}

  • μk\mu_kμk​ = centroid of cluster kkk

  • μk\mu_kμk​ is a point in the same space as the data (i.e., Rn\mathbb{R}^nRn)

Choosing the Number of Clusters (K)

There is no universal rule for choosing KKK.

Proper Random Initialization

When running K-Means:

K<mK < mK<m

Where:

  • KKK = number of clusters
  • mmm = number of training examples

It does not make sense to choose K≥mK \ge mK≥m.

Recommended Initialization Method

  1. Randomly select KKK distinct training examples.
  2. Set initial centroids equal to those examples:
μ1=x(i1),μ2=x(i2),…,μK=x(iK)\mu_1 = x^{(i_1)}, \quad \mu_2 = x^{(i_2)}, \quad \dots, \quad \mu_K = x^{(i_K)}μ1​=x(i1​),μ2​=x(i2​),…,μK​=x(iK​)

where i1,…,iKi_1, \dots, i_Ki1​,…,iK​ are randomly chosen distinct indices.

This ensures centroids start within the data distribution.

Elbow Method:

  1. Run K-Means for different values of KKK.
  2. Compute distortion JJJ for each.
  3. Plot JJJ vs KKK.
  4. Look for an “elbow” where improvement slows down.

Repeatedly Steps 2-3 until convergence {

Step 2. 📏 Cluster Assignment Step

For each training example(x(i)x^{(i)}x(i)) where i=1,…,mi = 1, \dots, mi=1,…,m:

Assign x(i)x^{(i)}x(i) to the cluster with the closest centroid.

In Mathematical terms:

Index of Cluster c(i)c^{(i)}c(i)

Index of cluster is Centroid closest to x(i)x^{(i)}x(i) in {1,…,K}\{1, \dots, K\}{1,…,K}

Examples:

  • If x(i)x^{(i)}x(i) is closest to μ1\mu_1μ1​, then c(i)=1c^{(i)} = 1c(i)=1
  • If x(i)x^{(i)}x(i) is closest to μ2\mu_2μ2​, then c(i)=2c^{(i)} = 2c(i)=2

μc(i)\mu_c^{(i)}μc(i)​ = CLuster centroid of cluster

Value of centroid of cluster assigned to x(i)x^{(i)}x(i) is μc(i)\mu_{c^{(i)}}μc(i)​

  • If c(i)=1c^{(i)} = 1c(i)=1, then μc(i)=μ1\mu_{c^{(i)}} = \mu_1μc(i)​=μ1​
  • If c(i)=2c^{(i)} = 2c(i)=2, then μc(i)=μ2\mu_{c^{(i)}} = \mu_2μc(i)​=μ2​

Which can be expressed as:

c(i):=arg⁡min⁡k∈{1,…,K}∥x(i)−μk∥2c^{(i)} := \arg\min_{k \in \{1, \dots, K\}} \left\| x^{(i)} - \mu_k \right\|^2c(i):=argk∈{1,…,K}min​​x(i)−μk​​2

Formally:

c(i)=arg⁡min⁡k∥x(i)−μk∥2c^{(i)} = \arg\min_k \left\| x^{(i)} - \mu_k \right\|^2c(i)=argkmin​​x(i)−μk​​2

Where:

  • μk\mu_kμk​ = centroid of cluster kkk
  • c(i)c^{(i)}c(i) = cluster assigned to example iii

We typically use squared Euclidean distance.

Example

Given

Centroids:

μ1=[12]\mu_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}μ1​=[12​] μ2=[−30]\mu_2 = \begin{bmatrix} -3 \\ 0 \end{bmatrix}μ2​=[−30​] μ3=[42]\mu_3 = \begin{bmatrix} 4 \\ 2 \end{bmatrix}μ3​=[42​]

Training example:

x(i)=[−12]x^{(i)} = \begin{bmatrix} -1 \\ 2 \end{bmatrix}x(i)=[−12​]

Distance to μ1\mu_1μ1​

∥x(i)−μ1∥2=(−1−1)2+(2−2)2\| x^{(i)} - \mu_1 \|^2 = (-1 - 1)^2 + (2 - 2)^2∥x(i)−μ1​∥2=(−1−1)2+(2−2)2 =(−2)2+02=4= (-2)^2 + 0^2 = 4=(−2)2+02=4

Distance to μ2\mu_2μ2​

∥x(i)−μ2∥2=(−1+3)2+(2−0)2\| x^{(i)} - \mu_2 \|^2 = (-1 + 3)^2 + (2 - 0)^2∥x(i)−μ2​∥2=(−1+3)2+(2−0)2 =(2)2+(2)2=4+4=8= (2)^2 + (2)^2 = 4 + 4 = 8=(2)2+(2)2=4+4=8

Distance to μ3 \mu_3μ3​

∥x(i)−μ3∥2=(−1−4)2+(2−2)2\| x^{(i)} - \mu_3 \|^2 = (-1 - 4)^2 + (2 - 2)^2∥x(i)−μ3​∥2=(−1−4)2+(2−2)2 =(−5)2+02=25= (-5)^2 + 0^2 = 25=(−5)2+02=25

Assigning Cluster

  • Distance to μ1\mu_1μ1​: 4
  • Distance to μ2\mu_2 μ2​: 8
  • Distance to μ3\mu_3 μ3​: 25

The smallest distance is to μ1\mu_1μ1​ so c(i)=1c^{(i)} = 1c(i)=1


Step 3. 🧿 Move Centroid Step

For each cluster kkk where k=1,…,Kk = 1, \dots, Kk=1,…,K:

  • KKK = total number of clusters
  • k∈{1,2,...,K}k \in \{1, 2, ..., K\}k∈{1,2,...,K}

Update centroid μk\mu_kμk​ to be the mean of all points assigned to it.

μk=1∣Ck∣∑i:c(i)=kx(i)\mu_k = \frac{1}{|C_k|} \sum_{i : c^{(i)} = k} x^{(i)}μk​=∣Ck​∣1​i:c(i)=k∑​x(i)

Where:

  • CkC_kCk​ = set of points assigned to cluster kkk
  • ∣Ck∣|C_k|∣Ck​∣ = number of points in cluster kkk

This moves the centroid to the center of its assigned points.

  • Each centroid becomes the mean of the points assigned to it.

⛔ Stop Condition

The algorithm converges when:

  • Cluster assignments c(i)c^{(i)}c(i) no longer change, or
  • Centroids μk\mu_kμk​ stop moving.

What If a Cluster Gets No Points?

If some centroid has zero assigned points:

  1. Eliminate the cluster (resulting in K−1K - 1K−1 clusters), or
  2. Reinitialize the centroid randomly

In practice, empty clusters are uncommon with reasonable initialization.


💰 Cost Function (Distortion)

Minimize th Distance between points and their assigned centroids.

J(c(1),…,c(m),μ1,…,μK)=1m∑i=1m∥x(i)−μc(i)∥2J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K) = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J(c(1),…,c(m),μ1​,…,μK​)=m1​i=1∑m​​x(i)−μc(i)​​2

Or equivalently:

J(μ1,…,μK)=1m∑i=1m∥x(i)−μc(i)∥2J(\mu_1, \dots, \mu_K) = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J(μ1​,…,μK​)=m1​i=1∑m​​x(i)−μc(i)​​2

Where:

  • JJJ = cost function (distortion)
  • mmm = number of training examples
  • x(i)x^{(i)}x(i) = iii-th training example
  • μc(i)\mu_{c^{(i)}}μc(i)​ = centroid of the cluster assigned to x(i)x^{(i)}x(i)
  • c(i)c^{(i)}c(i) = index of cluster assigned to x(i)x^{(i)}x(i)

Each iteration consists of:

  1. Assignment step
    Minimizes JJJ with respect to c(i)c^{(i)}c(i) (cluster assignments).

  2. Centroid update step
    Minimizes JJJ with respect to μk\mu_kμk​ (centroid locations).

Since each step does not increase JJJ, the algorithm is guaranteed to converge.

Each iteration of K-Means guarantees that JJJ does not increase.

However, it may converge to a local minimum, not necessarily the global minimum.

A good solution:

  • Each natural cluster is captured by one centroid.

A bad solution:

  • Merge two true clusters into one
  • Split one true cluster into multiple parts
  • Assign very few points to some clusters

All of these correspond to higher distortion:

Jbad>JgoodJ_{\text{bad}} > J_{\text{good}}Jbad​>Jgood​

To reduce the risk of poor local minima:

Algorithm

Repeat for t=1,…,Tt = 1, \dots, Tt=1,…,T:

  1. Randomly initialize centroids
  2. Run K-Means to convergence
  3. Compute distortion:
J(t)=1m∑i=1m∥x(i)−μc(i)∥2J^{(t)} = \frac{1}{m} \sum_{i=1}^{m} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2J(t)=m1​i=1∑m​​x(i)−μc(i)​​2

Finally:

Choose clustering with smallest J(t)\text{Choose clustering with smallest } J^{(t)}Choose clustering with smallest J(t)
AI-Machine-Learning/10-K-mean
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.