Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 2 LinearRegression

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Linear Regression Explained: Single Variable and Multivariate Models with Gradient Descent

Linear Regression Explained: Single Variable and Multivariate Models with Gradient Descent

Learn linear regression in machine learning, including single-variable and multivariate models, hypothesis function, cost function (MSE), gradient descent optimization, feature scaling, assumptions, and real-world implementation examples.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 26 2026

Share This on

← Previous

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

Next →

telc A2 – Hörverstehen (Listening) 🎧

📐 Linear Regression

Linear regression is a supervised learning algorithm used to predict a continuous output variable based on one or more input features.

  • It is widely used for prediction, forecasting, and as a baseline model.
  • It assumes that the relationship between inputs and output is linear.
y=θ0+θ1xy = \theta_0 + \theta_1 xy=θ0​+θ1​x

When to Use Linear Regression

Linear regression is ideal for:

  • Price prediction
  • Trend analysis
  • Baseline modeling
  • Interpretable relationships
  • Fast and simple forecasting

It is often used as a baseline before trying more complex models.

Key Assumptions

Linear regression works best when:

  • The relationship is approximately linear
  • Errors are independent
  • Variance of errors is constant
  • Residuals are normally distributed

Understanding these assumptions is important for reliable modeling.


🧮 Training Set (x,y)(x,y)(x,y)

Training Set is the Data that is fed to Leaning Algorithm.

The Learning function outputs a hypothesis function

X=[1x1(1)…xn(1)1x1(2)…xn(2)⋮⋮⋮1x1(m)…xn(m)]X = \begin{bmatrix} 1 & x_1^{(1)} & \dots & x_n^{(1)} \\ 1 & x_1^{(2)} & \dots & x_n^{(2)} \\ \vdots & \vdots & & \vdots \\ 1 & x_1^{(m)} & \dots & x_n^{(m)} \end{bmatrix}X=​11⋮1​x1(1)​x1(2)​⋮x1(m)​​………​xn(1)​xn(2)​⋮xn(m)​​​
  • xxx = inputs
  • yyy = expected output
  • (x,y)(x,y)(x,y) = Row in Training Set: single training example
  • (Xi,Yi)(Xi,Yi)(Xi,Yi) = i-th training example
  • mmm = Number of training examples - rows in training set
  • nnn = Number of features - columns in training set

Example: 🏠 House price

Size (x₁) Rooms (x₂) Price (y)
50 1 150
80 2 230
120 3 310
  • 📏 x1x_1x1​ → size of house
  • 🛏 x2x_2x2​ → number of Rooms
  • mmm = 3 training examples
  • nnn = 2 features

In machine learning, it’s often not who has the best algorithm, but who has the most data.

Testing several learning algorithms while increasing the training set size.

The key findings:

  1. Different algorithms often performed similarly.
  2. Performance improved steadily as training set size increased.
  3. An “inferior” algorithm with more data often outperformed a “superior” algorithm with less data.

More data helps when:

Sufficient features+Low bias model+Large dataset⇒Low bias and low variance\text{Sufficient features} + \text{Low bias model} + \text{Large dataset} \Rightarrow \text{Low bias and low variance}Sufficient features+Low bias model+Large dataset⇒Low bias and low variance
  1. Use a rich model → low bias
  2. Use a massive dataset → low variance

If both are true, then:

  • Training error is small
  • Test error is close to training error
  • Test error is also small

1. Your input features x xx must contain enough features to predict yyy accurately.

Given only the features xxx, could a human expert confidently predict yyy?

  • If yes, then the problem likely has enough signal.
  • If not, no amount of data will fix it.

2. Use a High-Capacity (Low-Bias) Model

You need a powerful learning algorithm, such as:

  • Logistic regression with many features
  • Linear regression with many features
  • Neural networks with many hidden units These models have many parameters and can represent complex functions.

This helps ensure low bias.

3. Use a Very Large Training Set

If:

  • The model has many parameters
  • The training set is much larger than the number of parameters

Then overfitting becomes less likely.

This helps reduce variance.


💡 Hypothesis hθ(x)h_\theta(x)hθ​(x)

Function that maps input xxx to output yyy is called Hypothesis function.

Supervised learning works like this:

Training Set → Learning Algorithm → Hypothesis Function

The algorithm outputs a function called: h (hypothesis)

  • hhh = hypothesis, trained Algo that can map YYY to YYY
hθ(x)=θTxh_\theta(x) = \theta^T xhθ​(x)=θTx

Finding θ \thetaθ

Our goal is to find the best values of θ \thetaθ that minimize prediction error.

1. Single Variable Linear Regression : hθ(x)h_\theta(x)hθ​(x)

Linear regression is method of finding a Continues linear relationship between Y and X

When there is only one feature, the model is:

y=θ0+θ1xy = \theta_0 + \theta_1 xy=θ0​+θ1​x hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 xhθ​(x)=θ0​+θ1​x

This represents a straight line, where:

  • yyy = is the predicted value
  • xx x = is the input feature
  • θ0\theta_0θ0​ = is the Y intercept
  • θ1\theta_1θ1​ = is the slope of line

Example:

  • θ0=0\theta_0 = 0θ0​=0 → line passes through origin

  • θ1=0\theta_1 = 0θ1​=0 → horizontal line

2. Multi Variate Linear Regression: hθ(x)h_\theta(x)hθ​(x)

Linear regression with multiple variables

For multiple features:

hθ(x)=θ0+θ1x1+⋯+θnxnh_\theta(x) = \theta_0 + \theta_1 x_1 + \dots + \theta_n x_nhθ​(x)=θ0​+θ1​x1​+⋯+θn​xn​

Matrix form:

X=[1501180211203]X = \begin{bmatrix} 1 & 50 & 1 \\ 1 & 80 & 2 \\ 1 & 120 & 3 \end{bmatrix}X=​111​5080120​123​​

Parameter Vector

θ=[θ0θ1θ2]\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}θ=​θ0​θ1​θ2​​​

Target Vector

y=[150230310]y = \begin{bmatrix} 150 \\ 230 \\ 310 \end{bmatrix}y=​150230310​​

This form is computationally efficient.

Calculating Hypothesis Function

Using Closed-form solution:

θ=(XTX)−1XTy\theta = (X^T X)^{-1} X^T y θ=(XTX)−1XTy

  • θ0≈20\theta_0 \approx 20θ0​≈20 : Base price of house
  • θ1≈2\theta_1 \approx 2θ1​≈2 : Price increase per unit size
  • θ2≈30\theta_2 \approx 30θ2​≈30 : Price increase per additional room

Final Hypothesis Function:

HousePrice=20+(2)⋅Size+(30)⋅RoomsHouse Price = 20 + (2) \cdot Size + (30) \cdot RoomsHousePrice=20+(2)⋅Size+(30)⋅Rooms

Model Inference

Test For:

  • Size = 100
  • Rooms = 2
Price=20+2(100)+30(2)Price = 20 + 2(100) + 30(2)Price=20+2(100)+30(2) Price=20+200+60=280Price = 20 + 200 + 60 = 280Price=20+200+60=280

Predicted price = 280


💰 Cost Function (J(θ)J(\theta)J(θ))

How bad are our guesses?

Goal of the algorithm is to choose θ0\theta_0θ0​ & θ1\theta_1θ1​ such that h(X)h(X)h(X) come close to YYY.

  • Minimize J(θ0,θ1)J(\theta_0, \theta_1)J(θ0​,θ1​) thus minimize Error
  • Used to measure how well the model performs

Goal is to for a given hypothesis

hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 xhθ​(x)=θ0​+θ1​x

Find θ\thetaθ that minimizes J(θ)J(\theta)J(θ)

Mean Squared Error Cost Function (MSE)

The squared error works well for regression problems because it:

  • Penalizes large errors
  • Is mathematically convenient
  • Produces a convex function

The cost function is defined as:

J(θ0,θ1)=J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta_0, \theta_1) = J(\theta)= \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2J(θ0​,θ1​)=J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2

Where:

  • mmm = number of training examples
  • hθ(x(i))h_\theta(x^{(i)})hθ​(x(i)) = prediction for given xxx input
  • y(i)y^{(i)}y(i) = actual value for given input xxx

The objective is to minimize this function.

Plotting Cost Function

One Feature (xxx) and Two Parameters (θ0\theta_0θ0​, θ1\theta_1θ1​)

h=θ0+θ1x h= \theta_0 + \theta_1 xh=θ0​+θ1​x

Parabola

  • J(θ0,θ1)J(\theta_0, \theta_1)J(θ0​,θ1​) as Y-Axis
  • θ1\theta_1θ1​ as X-Axis

3D Parabola

  • J(θ0,θ1)J(\theta_0, \theta_1)J(θ0​,θ1​) as Z-Axis

  • θ0\theta_0θ0​ as X-Axis

  • θ1\theta_1θ1​ as Y-Axis

    RMS Surface plot

Two Feature (x1,x2x_1, x_2x1​,x2​) and Three Parameters (θ0\theta_0θ0​, θ1\theta_1θ1​, θ2\theta_2θ2​)

h = $\theta_0 + \theta_1 x_1 + \theta_2 x_2

That lives in 4 dimensions impossible to visualize

J(θ0\theta_0θ0​, θ1\theta_1θ1​, θ2\theta_2θ2​) as W-Axis

  • θ0\theta_0θ0​ as X-Axis
  • θ1\theta_1θ1​ as Y-Axis
  • θ2\theta_2θ2​ as Z-Axis

Contour Figure ⛰️

Contour plot is seeing surface plot passing through a horizontal 2D clip plane

Each circle represents:

  • All points that have the same height = Cost J(θ).

In the contour plot:

  • X-axis → θ0\theta_0θ0​
  • Y-axis → θ1\theta_1θ1​

contourPlot

Why Contour Plot is Circular?

Contour Plot is Circular because Cost function is Convex.

  • Circle represent all points that have the same height.
  • Smaller circles → smaller cost
  • Larger circles → larger cost

Gradient Descent Moves towards the center of the contour plot where cost is minimum.

  • It moves perpendicular to the contour lines because that is the direction of steepest descent.

🎢 Gradient Descent

Gradient descent is an optimization algorithm used to minimize the cost function by iteratively moving towards the minimum.

That’s just a fancy name for:

Try numbers → see error → improve numbers → repeat.

Works like

  • Start somewhere
  • Just like going down from a hill.
  • Look around and find local minima and keep on going down repeat till find optimal solution.
  • Multiple minima can be found

Start somewhere → Take steps downhill → Reach minimum

gradientDescent

Algorithm: Single Variant Linear Regression

For feature index j= 0, 1 repeat until convergence

θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}θj​:=θj​−αm1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​

For j=0,1j = 0,1j=0,1

  • Simultaneous compute θ0\theta_0θ0​, θ1\theta_1θ1​ and store in temp values
  • Simultaneous Update θ0\theta_0θ0​, θ1\theta_1θ1​

Where:

  • α\alphaα Learning Rate, How big steps we take down hill
  • j_jj​ is the parameter index
  • :=:=:= Assignment Operation eg a= a+1
  • === Truth Assertion eg a==a

This process is repeated until convergence.

CostFunction

Algo: Multivariate Linear Regression

Steps:

  • For feature index j= 0,1,....n repeat until convergence
θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}θj​:=θj​−αm1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​

For j=0,1,2..nj = 0,1,2..nj=0,1,2..n

  • Simultaneous Compute θ0,θ1,....θn\theta_0, \theta_1, .... \theta_nθ0​,θ1​,....θn​ and store in temp values
  • Simultaneous Update θ0,θ1,....θn\theta_0, \theta_1, .... \theta_nθ0​,θ1​,....θn​

Final Hypothesis Function

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1​∑i=1m​(hθ​(x(i))−y(i))2

Which is equivalent to: J(θ)=12m∑i=1m(θTx(i)−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (\theta^T x^{(i)} - y^{(i)})^2J(θ)=2m1​∑i=1m​(θTx(i)−y(i))2


Learning Rate (α\alphaα)

Alpha defines rate of Learning

The update rule is:

θj:=θj−α∂∂θjJ(θ)\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)θj​:=θj​−α∂θj​∂​J(θ)

Or Simplified

θ:=θ−α∇J(θ)\theta := \theta - \alpha \nabla J(\theta)θ:=θ−α∇J(θ)
  • Small α\alphaα → slow learning
  • Large α\alphaα → overshooting the minimum
  • Proper α\alphaα → fast convergence

Learning Rate

Deciding Learning Rate (α\alphaα)

Plot the cost function, J(θ)J(θ)J(θ) over the number of iterations of gradient descent.

If J(θ)J(θ)J(θ) decreases every iteration, then you are probably using a good learning rate.

  • Smooth Steadily decreasing
  • Flattening near minimum
  • No wild jumps & no upward trend
       # Good Learning Rate
       
       |
       |\
       | \
       |  \
Cost   |   \
       |    \____
       |
       +----------------
       Iterations

If J(θ) continuously increases then you probably need to decrease α\alphaα.

  • Reason: Large α\alphaα causes overshooting the minimum, leading to divergence or oscillation.
  • Fix: Reduce learning rate.
        |
        |      /
        |     /
Cost    |    /
        |   /
        |  /
        +----------------
        Iterations

If J(θ) decreases but very slowly, then you probably need to increase α\alphaα

  • Reason: Small α\alphaα causes slow convergence, taking many iterations to approach the minimum.
  • Fix: Increase learning rate.
      Cost
        |
        |\
        | \
        |  \
        |   \
        |    \______
        +----------------
             Iterations

If J(θ) oscillates, then you probably need to reduce α\alphaα and scale features.

  • Reason:
    • Large α\alphaα leading to oscillation around the minimum.
    • Features are on different scales,
  • Fix: Reduce learning rate and apply feature scaling t

    Cost
    |
    |   /\   /\   /\
    |  /  \ /  \ /  \
    | /
    +----------------
    Iterations

Debugging Learning Rate (α\alphaα)

Behavior Problem Fix
Cost increases α too large Reduce α
Oscillates α too large Reduce α + scale
Very slow decrease α too small Increase α
No improvement Features not scaled Apply scaling

Derivative Term

Derivative terms defines rate of change of Cost function J(θ1)J(\theta_1)J(θ1​) wrt θ1\theta_1θ1​

  • At local minima Derivative Term is = 0.
∂J(θ)∂θ=0\frac{\partial J(\theta)}{\partial \theta} = 0∂θ∂J(θ)​=0
  • Derivative term automatically takes small step when it starts to converge towards local minimal. Having a fixed alpha helps

  • Derivative term automatically converge θ1\theta_1θ1​ towards its local minima from both +ve and -ve slopes:

    descentFormula


Feature Scaling

Feature scaling helps to make the cost function more circular, which allows gradient descent to converge faster.

  • Make sure feature are on same scale otherwise contour will be skew elliptical.
  • Try to get feature into −1<=xi<=1-1<= xi<= 1−1<=xi<=1
  • Ideally should be withing range −3<=xi<=3-3<= xi<= 3−3<=xi<=3

Problem

Difference in scale of features can create skew ellipse in cost function.

  • example:
    • Size of house (0-1000) vs Number of rooms (1-10)

A skew ellipse will have a long axis and a short axis.

  • Gradient descent will oscillate across the long axis and take a long time to converge to the minimum. Feature scaling

📏 Solution

1. Min-Max Normalization

This is done by subtracting the minimum value of the feature and dividing by the range (max - min).

  • Scale features to [0, 1]

xi=xi−min(X)max(X)−min(X)xi = \frac{xi - min(X)}{max(X) - min(X)}xi=max(X)−min(X)xi−min(X)​

2. Mean Normalization:

This is done by subtracting the mean of the feature and dividing by the range (max - min).

  • Scale features to have mean 0 and range [-3, 3]

xi=xi−Avg(X)max(X)−min(X)xi = \frac{xi - Avg(X)}{max(X) - min(X)}xi=max(X)−min(X)xi−Avg(X)​

Alternatively

xi=xi−μσxi = \frac{xi - \mu}{\sigma}xi=σxi−μ​

Where:

  • μ\muμ = mean of feature = Avg(X)
  • σ\sigmaσ = standard deviation = max(X) - min(X)
AI-Machine-Learning/2-LinearRegression
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.