Vectorized Neural Networks Model Representation
From Scalar Equations to Vector Form
Previously, we wrote each neuron separately.
For the hidden layer:
a 1 ( 2 ) = g ( Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 ) a^{(2)}_1 = g\left(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3\right) a 1 ( 2 ) = g ( Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 )
a 2 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 ) a^{(2)}_2 = g\left(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3\right) a 2 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 )
a 3 ( 2 ) = g ( Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 ) a^{(2)}_3 = g\left(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3\right) a 3 ( 2 ) = g ( Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 )
Final hypothesis:
h Θ ( x ) = a 1 ( 3 ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) ) h_\Theta(x) = a^{(3)}_1 = g\left(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1 + \Theta^{(2)}_{12}a^{(2)}_2 + \Theta^{(2)}_{13}a^{(2)}_3\right) h Θ ( x ) = a 1 ( 3 ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) )
This works, but it does not scale. So we vectorize.
Step 1 — Define the Intermediate Variable
Define the weighted sum before activation:
z k ( j ) = Θ k , 0 ( j − 1 ) a 0 ( j − 1 ) + Θ k , 1 ( j − 1 ) a 1 ( j − 1 ) + ⋯ + Θ k , n ( j − 1 ) a n ( j − 1 ) z^{(j)}_k =
\Theta^{(j-1)}_{k,0} a^{(j-1)}_0 +
\Theta^{(j-1)}_{k,1} a^{(j-1)}_1 +
\dots +
\Theta^{(j-1)}_{k,n} a^{(j-1)}_n z k ( j ) = Θ k , 0 ( j − 1 ) a 0 ( j − 1 ) + Θ k , 1 ( j − 1 ) a 1 ( j − 1 ) + ⋯ + Θ k , n ( j − 1 ) a n ( j − 1 )
Activation becomes:
a k ( j ) = g ( z k ( j ) ) a^{(j)}_k = g\left(z^{(j)}_k\right) a k ( j ) = g ( z k ( j ) )
Step 2 — Vector Representation
Input layer:
x = a ( 1 ) = [ x 0 x 1 ⋮ x n ] x = a^{(1)} =
\begin{bmatrix}
x_0 \\
x_1 \\
\vdots \\
x_n
\end{bmatrix} x = a ( 1 ) = x 0 x 1 ⋮ x n
Weighted sum vector:
z ( j ) = [ z 1 ( j ) z 2 ( j ) ⋮ z s j ( j ) ] z^{(j)} =
\begin{bmatrix}
z^{(j)}_1 \\
z^{(j)}_2 \\
\vdots \\
z^{(j)}_{s_j}
\end{bmatrix} z ( j ) = z 1 ( j ) z 2 ( j ) ⋮ z s j ( j )
Where:
s j = number of units in layer j s_j = \text{number of units in layer } j s j = number of units in layer j
Step 3 — The Key Vectorized Equation
The entire layer becomes a single matrix multiplication:
z ( j ) = Θ ( j − 1 ) a ( j − 1 ) z^{(j)} = \Theta^{(j-1)} a^{(j-1)} z ( j ) = Θ ( j − 1 ) a ( j − 1 )
Dimensions:
Θ ( j − 1 ) ∈ R s j × ( n + 1 ) \Theta^{(j-1)} \in \mathbb{R}^{s_j \times (n+1)} Θ ( j − 1 ) ∈ R s j × ( n + 1 )
a ( j − 1 ) ∈ R ( n + 1 ) × 1 a^{(j-1)} \in \mathbb{R}^{(n+1) \times 1} a ( j − 1 ) ∈ R ( n + 1 ) × 1
z ( j ) ∈ R s j × 1 z^{(j)} \in \mathbb{R}^{s_j \times 1} z ( j ) ∈ R s j × 1
Step 4 — Apply Activation Function
Activation is applied element-wise:
a ( j ) = g ( z ( j ) ) a^{(j)} = g\left(z^{(j)}\right) a ( j ) = g ( z ( j ) )
If using sigmoid:
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1 + e^{-z}} g ( z ) = 1 + e − z 1
Step 5 — Add Bias Unit
After computing activations, add the bias:
a ( j ) = [ 1 a 1 ( j ) ⋮ a s j ( j ) ] a^{(j)} =
\begin{bmatrix}
1 \\
a^{(j)}_1 \\
\vdots \\
a^{(j)}_{s_j}
\end{bmatrix} a ( j ) = 1 a 1 ( j ) ⋮ a s j ( j )
Step 6 — Output Layer
Repeat the same process:
z ( j + 1 ) = Θ ( j ) a ( j ) z^{(j+1)} = \Theta^{(j)} a^{(j)} z ( j + 1 ) = Θ ( j ) a ( j )
a ( j + 1 ) = g ( z ( j + 1 ) ) a^{(j+1)} = g\left(z^{(j+1)}\right) a ( j + 1 ) = g ( z ( j + 1 ) )
Final hypothesis:
h Θ ( x ) = a ( j + 1 ) h_\Theta(x) = a^{(j+1)} h Θ ( x ) = a ( j + 1 )
The Big Picture
Each layer performs:
Linear transformation z = Θ a \text{Linear transformation} \quad z = \Theta a Linear transformation z = Θ a
followed by
Nonlinearity a = g ( z ) \text{Nonlinearity} \quad a = g(z) Nonlinearity a = g ( z )
Stacking these layers allows neural networks to represent complex nonlinear functions.