Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation
Complete guide to logistic regression for binary classification, including the sigmoid function, hypothesis model, cost function, decision boundary, gradient descent, and practical machine learning implementation.
📊 Logistic Regression for Classification
In classification problems, the output variable takes discrete values
Classification Types
1. Binary Classification
Two classes:
We usually call:
- → Negative class: 0 represents absence.
- → Positive class: 1 represent presence of something (e.g., disease)
2. Multi-class Classification
More than two classes:
The Sigmoid Function
Sigmoid function (also called
logistic function) maps any real-valued number into the (0, 1) interval.
- It is commonly used in logistic regression to model probabilities.
The sigmoid function is defined as:
Where:
- is the input to the function (can be any real number)
- is the base of the natural logarithm (approximately 2.71828)
Output:
is always between 0 and 1, making it suitable for modeling probabilities.
- When is large and positive, .
- ,
- When is large and negative, .
- ,
- When , .
💡 Logistic Regression Hypothesis
Logistic regression ensures:
Where
- Input: Any real number :
- Output: Always between:
Instead of:
We apply a transformation that squashes outputs into the probability range .
So the output becomes a probability:
This can be simplified to:
Where
and as the sigmoid function:
Final Hypothesis
This ensures:
So:
- If → There is a 70% probability that
Since probabilities must sum to 1:
Decision Boundary
The decision boundary is the line that separates the area where y = 0 and where y = 1.
- It is created by our hypothesis function / model.
Decision Boundary is a Property of the Model
The decision boundary depends only on:
- The hypothesis form
- The parameters
It does not depend on the training data once is fixed.
The training set is used only to learn .
Suppose we have a Classification Rule
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
- ➕ we Predict
- ➖ we Predict
When Is ?
Since:
and
we predict:
and
Linear Decision Boundary
Suppose:
Let:
Then:
We predict when:
Rewriting:
Decision Boundary
The decision boundary occurs when:
This is a straight line.
It separates the plane into:
- Region where
- Region where
The decision boundary corresponds to:
Nonlinear Decision Boundaries
We can add polynomial features.
Example:
Suppose:
Then:
We predict when:
Rewriting:
Decision Boundary
The boundary is:
This is a circle of radius 1.
So logistic regression can produce nonlinear boundaries using polynomial features.
More Complex Boundaries
By adding higher-order terms such as:
- etc.
Logistic regression can represent:
- Ellipses
- Complex curves
- Highly nonlinear shapes
💰 Cost Function / Optimal Objective
The overall cost is:
where:
Why Not Use Squared Error Cost?
In linear regression, we use:
If we use same squared error with sigmoid:
- The cost function becomes non-convex
- Optimization may get stuck in local minima
- Training may fail to find the best parameters
So we need a better cost function.
We define cost separately for the two classes.
The cost function is defined as:
Why This Cost Function Is Better
It is convex
- No local minima
- Guarantees only one global minimum
- Optimization is reliable
- Smooth and well-behaved
Because is convex:
- Gradient descent will converge to the global minimum
- We do not get stuck in bad local optima
- Training is stable
It penalizes wrong predictions heavily
- Cost = 0 when prediction is correct
- Cost → ∞ when prediction is very wrong
- Encourages the model to be confident and correct
1. Case 1: When
We want to be close to 1.
Cost:
If prediction is close to 1 → cost is small
- If → cost = 0
If prediction is close to 0 → cost is very large -If → cost
So:
- Correct confident prediction → small cost
- Wrong confident prediction → very large cost
Case 2: When
We want to be close to 0.
Cost:
If prediction is close to 1 → cost is very large -If → cost
If prediction is close to 0 → cost is small
- If → cost = 0
Again:
- Correct prediction → small cost
- Wrong confident prediction → large penalty
Unified Logistic Cost Function
Simplified Cost Function (Single Formula)
We can combine the two cases into one equation:
- If :
- The second term becomes 0
- Cost reduces to:
- If :
- The first term becomes 0
- Cost reduces to:
So this single formula covers both cases.
Full Cost Function
Full Cost Function Over Dataset
For training examples:
This is called:
- Log loss
- Cross-entropy loss
- Logistic loss
Where:
Vectorized Cost Fucntion
Let:
- = design matrix
- = vector of labels
Then:
and
🧠 Key Takeaways: Cost Function
- Logistic regression uses a convex cost function
- The simplified cost formula works for both and
- Gradient descent update looks the same as linear regression
- Vectorization makes implementation efficient
- Always include the factor in the gradient update
🎢 Gradient Descent
General gradient descent:
Repeat:
Where:
- = learning rate
- = cost function
Logistic Regression Gradient
After computing the derivative, we get:
Important Notes
- This is identical in form to linear regression gradient descent.
- We must update all simultaneously..
- The difference lies in the hypothesis function:
where
Vectorized Gradient Descent
Let:
- = design matrix
- = vector of labels
Then the update rule becomes:
Then the update rule becomes:
Where:
- is the vector of labels
- is the transpose of the design matrix
🧠 Key Takeaways: Gradient Descent
- Logistic regression uses gradient descent just like linear regression.
- The update formula is structurally the same.
- The cost function is different.
- The model is convex, so gradient descent converges to the global minimum.
- Vectorized form makes implementation efficient and clean.
