Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 5 Logistic Regression

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦥 Sloths can hold their breath longer than dolphins 🐬.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

Complete guide to logistic regression for binary classification, including the sigmoid function, hypothesis model, cost function, decision boundary, gradient descent, and practical machine learning implementation.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

📊 Logistic Regression for Classification

In classification problems, the output variable yyy takes discrete values

Classification Types

1. Binary Classification

Two classes:

y∈{0,1}y \in \{0,1\}y∈{0,1}

We usually call:

  • 000 → Negative class: 0 represents absence.
  • 111 → Positive class: 1 represent presence of something (e.g., disease)

2. Multi-class Classification

More than two classes:

y∈{0,1,2,3,...}y \in \{0,1,2,3,...\}y∈{0,1,2,3,...}

The Sigmoid Function σ(x)\sigma(x)σ(x)

Sigmoid function (also called logistic function) maps any real-valued number into the (0, 1) interval.

  • It is commonly used in logistic regression to model probabilities.

The sigmoid function is defined as:

σ(x)=11+e−x=exex+1=1−11+ex=12(1+tanh⁡(x2))=1−σ(−x)\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1} = 1 - \frac{1}{1 + e^x} = \frac{1}{2} \left(1 + \tanh\left(\frac{x}{2}\right)\right) = 1 - \sigma(-x)σ(x)=1+e−x1​=ex+1ex​=1−1+ex1​=21​(1+tanh(2x​))=1−σ(−x)

Where:

  • zzz is the input to the function (can be any real number)
  • eee is the base of the natural logarithm (approximately 2.71828)

Output:

σ(z)\sigma(z)σ(z) is always between 0 and 1, making it suitable for modeling probabilities.

  • When zzz is large and positive, σ(z)≈1\sigma(z) \approx 1σ(z)≈1.
    • z→+∞z \to +\inftyz→+∞, σ(z)→1\sigma(z) \to 1σ(z)→1
  • When zzz is large and negative, σ(z)≈0\sigma(z) \approx 0σ(z)≈0.
    • z→−∞z \to -\inftyz→−∞, σ(z)→0\sigma(z) \to 0σ(z)→0
  • When z=0z = 0z=0, σ(z)=0.5\sigma(z) = 0.5σ(z)=0.5.

💡 Logistic Regression Hypothesis hθ(x)h_\theta(x)hθ​(x)

Logistic regression ensures:

0≤hθ(x)≤10 \le h_\theta(x) \le 10≤hθ​(x)≤1

Where

  • Input: Any real number : (−∞,+∞)(-\infty, +\infty)(−∞,+∞)
  • Output: Always between: (0,1)(0,1)(0,1)

Instead of: hθ(x)=θTxh_\theta(x) = \theta^T xhθ​(x)=θTx

We apply a transformation that squashes outputs into the probability range [0,1][0,1][0,1].

hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ​(x)=g(θTx)

So the output becomes a probability: hθ(x)=P(y=1∣x)h_\theta(x) = P(y=1 \mid x)hθ​(x)=P(y=1∣x)

This can be simplified to:

hθ(x)=g(z)h_\theta(x) = g(z)hθ​(x)=g(z)

Where

z=θTxz = \theta^T xz=θTx

and g(z)g(z)g(z) as the sigmoid function:

g(z)=11+e−zg(z) = \frac{1}{1 + e^{-z}}g(z)=1+e−z1​

Final Hypothesis

hθ(x)=11+e−θTxh_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}hθ​(x)=1+e−θTx1​

This ensures:

0≤hθ(x)≤10 \le h_\theta(x) \le 10≤hθ​(x)≤1 hθ(x)=P(y=1∣x;θ)h_\theta(x) = P(y = 1 \mid x; \theta)hθ​(x)=P(y=1∣x;θ)

So:

  • If hθ(x)=0.7h_\theta(x) = 0.7hθ​(x)=0.7 → There is a 70% probability that y=1y = 1y=1

Since probabilities must sum to 1:

P(y=0∣x;θ)=1−P(y=1∣x;θ)P(y=0 \mid x; \theta) = 1 - P(y=1 \mid x; \theta)P(y=0∣x;θ)=1−P(y=1∣x;θ)

Decision Boundary

The decision boundary is the line that separates the area where y = 0 and where y = 1.

  • It is created by our hypothesis function / model.

Decision Boundary is a Property of the Model

The decision boundary depends only on:

  • The hypothesis form
  • The parameters θ\thetaθ

It does not depend on the training data once θ\thetaθ is fixed.

The training set is used only to learn θ\thetaθ.

Suppose we have a Classification Rule

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

  • ➕ hθ(x)≥0.5h_\theta(x) \ge 0.5hθ​(x)≥0.5 we Predict y=1y = 1y=1
  • ➖ hθ(x)<0.5h_\theta(x) < 0.5hθ​(x)<0.5 we Predict y=0y = 0y=0

When Is hθ(x)≥0.5h_\theta(x) \ge 0.5hθ​(x)≥0.5?

Since:

g(z)≥0.5whenz≥0g(z) \ge 0.5 \quad \text{when} \quad z \ge 0g(z)≥0.5whenz≥0

and

hθ(x)=g(θTx),h_\theta(x) = g(\theta^T x),hθ​(x)=g(θTx),

we predict:

y=1whenθTx≥0y = 1 \quad \text{when} \quad \theta^T x \ge 0y=1whenθTx≥0

and

y=0whenθTx<0y = 0 \quad \text{when} \quad \theta^T x < 0y=0whenθTx<0

Linear Decision Boundary

Suppose:

hθ(x)=g(θ0+θ1x1+θ2x2)h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)hθ​(x)=g(θ0​+θ1​x1​+θ2​x2​)

Let:

θ0=−3,θ1=1,θ2=1\theta_0 = -3, \quad \theta_1 = 1, \quad \theta_2 = 1θ0​=−3,θ1​=1,θ2​=1

Then:

θTx=−3+x1+x2\theta^T x = -3 + x_1 + x_2θTx=−3+x1​+x2​

We predict y=1y = 1y=1 when:

−3+x1+x2≥0-3 + x_1 + x_2 \ge 0−3+x1​+x2​≥0

Rewriting:

x1+x2≥3x_1 + x_2 \ge 3x1​+x2​≥3

Decision Boundary

The decision boundary occurs when:

x1+x2=3x_1 + x_2 = 3x1​+x2​=3

This is a straight line.

It separates the plane into:

  • Region where y=1y = 1y=1
  • Region where y=0y = 0y=0

The decision boundary corresponds to:

hθ(x)=0.5h_\theta(x) = 0.5hθ​(x)=0.5

Nonlinear Decision Boundaries

We can add polynomial features.

Example:

hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x22)h_\theta(x) = g(\theta_0 + \theta_1 x_1+ \theta_2 x_2+ \theta_3 x_1^2+ \theta_4 x_2^2)hθ​(x)=g(θ0​+θ1​x1​+θ2​x2​+θ3​x12​+θ4​x22​)

Suppose:

θ0=−1,θ1=0,θ2=0,θ3=1,θ4=1\theta_0 = -1, \quad \theta_1 = 0, \quad \theta_2 = 0, \quad \theta_3 = 1, \quad \theta_4 = 1θ0​=−1,θ1​=0,θ2​=0,θ3​=1,θ4​=1

Then:

θTx=−1+x12+x22\theta^T x = -1 + x_1^2 + x_2^2θTx=−1+x12​+x22​

We predict y=1y = 1y=1 when:

−1+x12+x22≥0-1 + x_1^2 + x_2^2 \ge 0−1+x12​+x22​≥0

Rewriting:

x12+x22≥1x_1^2 + x_2^2 \ge 1x12​+x22​≥1

Decision Boundary

The boundary is:

x12+x22=1x_1^2 + x_2^2 = 1x12​+x22​=1

This is a circle of radius 1.

So logistic regression can produce nonlinear boundaries using polynomial features.


More Complex Boundaries

By adding higher-order terms such as:

  • x13x_1^3x13​
  • x1x2x_1 x_2x1​x2​
  • x12x2x_1^2 x_2x12​x2​
  • etc.

Logistic regression can represent:

  • Ellipses
  • Complex curves
  • Highly nonlinear shapes

💰 Cost Function / Optimal Objective

The overall cost is:

J(θ)=1m∑i=1mCost(hθ(x(i)),y(i))J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \text{Cost}\big(h_\theta(x^{(i)}), y^{(i)}\big)J(θ)=m1​i=1∑m​Cost(hθ​(x(i)),y(i))

where:

hθ(x)=11+e−θTxh_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}hθ​(x)=1+e−θTx1​

Why Not Use Squared Error Cost?

In linear regression, we use:

J(θ)=12m∑(hθ(x)−y)2J(\theta) = \frac{1}{2m}\sum (h_\theta(x) - y)^2J(θ)=2m1​∑(hθ​(x)−y)2

If we use same squared error with sigmoid:

  • The cost function becomes non-convex
  • Optimization may get stuck in local minima
  • Training may fail to find the best parameters

So we need a better cost function.

We define cost separately for the two classes.

The cost function is defined as:

Cost(hθ(x),y)={−log⁡(hθ(x))if y=1−log⁡(1−hθ(x))if y=0\text{Cost}(h_\theta(x), y) = \begin{cases} -\log(h_\theta(x)) & \text{if } y = 1 \\ -\log(1 - h_\theta(x)) & \text{if } y = 0 \end{cases}Cost(hθ​(x),y)={−log(hθ​(x))−log(1−hθ​(x))​if y=1if y=0​

Why This Cost Function Is Better

It is convex

  • No local minima
  • Guarantees only one global minimum
  • Optimization is reliable
  • Smooth and well-behaved

Because J(θ)J(\theta)J(θ) is convex:

  • Gradient descent will converge to the global minimum
  • We do not get stuck in bad local optima
  • Training is stable

It penalizes wrong predictions heavily

  • Cost = 0 when prediction is correct
  • Cost → ∞ when prediction is very wrong
  • Encourages the model to be confident and correct

1. Case 1: When y=1y = 1y=1

We want hθ(x)h_\theta(x)hθ​(x) to be close to 1.

Cost:

−log⁡(hθ(x))-\log(h_\theta(x))−log(hθ​(x))

If prediction is close to 1 → cost is small

  • If hθ(x)=1h_\theta(x) = 1hθ​(x)=1 → cost = 0

If prediction is close to 0 → cost is very large -If hθ(x)→0h_\theta(x) \to 0hθ​(x)→0 → cost →∞\to \infty→∞

So:

  • Correct confident prediction → small cost
  • Wrong confident prediction → very large cost

Case 2: When y=0y = 0y=0

We want hθ(x)h_\theta(x)hθ​(x) to be close to 0.

Cost:

−log⁡(1−hθ(x))-\log(1 - h_\theta(x))−log(1−hθ​(x))

If prediction is close to 1 → cost is very large -If hθ(x)→1h_\theta(x) \to 1hθ​(x)→1 → cost →∞\to \infty→∞

If prediction is close to 0 → cost is small

  • If hθ(x)=0h_\theta(x) = 0hθ​(x)=0 → cost = 0

Again:

  • Correct prediction → small cost
  • Wrong confident prediction → large penalty

Unified Logistic Cost Function

Simplified Cost Function (Single Formula)

We can combine the two cases into one equation:

Cost(hθ(x),y)=−ylog⁡(hθ(x))−(1−y)log⁡(1−hθ(x))\text{Cost}(h_\theta(x), y) = - y \log(h_\theta(x))- (1 - y)\log(1 - h_\theta(x))Cost(hθ​(x),y)=−ylog(hθ​(x))−(1−y)log(1−hθ​(x))
  • If y=1y = 1y=1:
    • The second term becomes 0
    • Cost reduces to:
−log⁡(hθ(x))-\log(h_\theta(x))−log(hθ​(x))
  • If y=0y = 0y=0:
    • The first term becomes 0
    • Cost reduces to:
−log⁡(1−hθ(x))-\log(1 - h_\theta(x))−log(1−hθ​(x))

So this single formula covers both cases.


Full Cost Function

Full Cost Function Over Dataset

For mmm training examples:

J(θ)=−1m∑i=1m[y(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))]J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]J(θ)=−m1​i=1∑m​[y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i)))]

This is called:

  • Log loss
  • Cross-entropy loss
  • Logistic loss

Where:

hθ(x)=11+e−θTxh_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}hθ​(x)=1+e−θTx1​

Vectorized Cost Fucntion

Let:

  • XXX = design matrix
  • yyy = vector of labels
  • h=g(Xθ)h = g(X\theta)h=g(Xθ)

Then:

h=g(Xθ)h = g(X\theta)h=g(Xθ)

and

J(θ)=1m(−yTlog⁡(h)−(1−y)Tlog⁡(1−h))J(\theta) = \frac{1}{m} \left(- y^T \log(h)- (1 - y)^T \log(1 - h) \right)J(θ)=m1​(−yTlog(h)−(1−y)Tlog(1−h))

🧠 Key Takeaways: Cost Function

  • Logistic regression uses a convex cost function
  • The simplified cost formula works for both y=0y=0y=0 and y=1y=1y=1
  • Gradient descent update looks the same as linear regression
  • Vectorization makes implementation efficient
  • Always include the 1m\frac{1}{m}m1​ factor in the gradient update

🎢 Gradient Descent

General gradient descent:

Repeat:

θj:=θj−α∂∂θjJ(θ)\theta_j := \theta_j- \alpha \frac{\partial}{\partial \theta_j} J(\theta)θj​:=θj​−α∂θj​∂​J(θ)

Where:

  • α\alphaα = learning rate
  • J(θ)J(\theta)J(θ) = cost function

Logistic Regression Gradient

After computing the derivative, we get:

θj:=θj−αm∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}θj​:=θj​−mα​i=1∑m​(hθ​(x(i))−y(i))xj(i)​

Important Notes

  • This is identical in form to linear regression gradient descent.
  • We must update all θj\theta_jθj​ simultaneously..
  • The difference lies in the hypothesis function:
hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ​(x)=g(θTx)

where

g(z)=11+e−zg(z) = \frac{1}{1 + e^{-z}}g(z)=1+e−z1​

Vectorized Gradient Descent

Let:

  • XXX = design matrix
  • y⃗\vec{y}y​ = vector of labels
  • h=g(Xθ)h = g(X\theta)h=g(Xθ)

Then the update rule becomes:

h=g(Xθ)h = g(X\theta)h=g(Xθ)

Then the update rule becomes:

θ:=θ−αmXT(g(Xθ)−y⃗)\theta := \theta- \frac{\alpha}{m} X^T \left( g(X\theta) - \vec{y} \right)θ:=θ−mα​XT(g(Xθ)−y​)

Where:

  • y⃗\vec{y}y​ is the vector of labels
  • XTX^TXT is the transpose of the design matrix

🧠 Key Takeaways: Gradient Descent

  • Logistic regression uses gradient descent just like linear regression.
  • The update formula is structurally the same.
  • The cost function is different.
  • The model is convex, so gradient descent converges to the global minimum.
  • Vectorized form makes implementation efficient and clean.

AI-Machine-Learning/5-Logistic-Regression
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.