$$
\definecolor{input}{RGB}{66, 133, 244}
\definecolor{output}{RGB}{219, 68, 55}
\definecolor{dinput}{RGB}{244, 180, 0}
\definecolor{doutput}{RGB}{15, 157, 88}
\definecolor{dweight}{RGB}{102, 0, 255}
$$
Backpropagation algorithm
The backpropagation algorithm is essential for training large neural networks
quickly. This article explains how the algorithm works.
Please scroll down...
Simple neural network
On the right, you see a neural network with one input, one output node
and two hidden layers of two nodes each.
Nodes in neighboring layers are connected with weights \(w_{ij}\), which are the network
parameters.
Activation function
Each node has a total input \(\color{input}x\), an activation function \(f(\color{input}x\color{black})\)
and an output \(\color{output}y\color{black}=f(\color{input}x\color{black})\).
\(f(\color{input}x\color{black})\) has to be a non-linear function, otherwise the neural network will only
be able to learn linear models.
A commonly used activation function is
the Sigmoid function:
\(f(\color{input}x\color{black}) = \frac{1}{1+e^{-\color{input}x}}\).
Error function
The goal is to learn the weights of the network automatically from data such that the predicted output \(\color{output}y_{output}\)
is close to the target \(\color{output}y_{target}\) for all inputs \(\color{input}x_{input}\).
To measure how far we are from the goal, we use an error function \(E\).
A commonly used error function is \(E(\color{output}y_{output}\color{black},\color{output}y_{target}\color{black}) = \frac{1}{2}(\color{output}y_{output}\color{black} - \color{output}y_{target}\color{black})^2 \).
Forward propagation
We begin by taking an input example \((\color{input}x_{input}\color{black},\color{output}y_{target}\color{black})\) and updating the input layer of the network.
For consistency, we consider the input to be like any other node but without an activation function so its output is equal to its input, i.e. \( \color{output}y_1 \color{black} = \color{input} x_{input} \).
Forward propagation
Now, we update the first hidden layer. We take the output \(\color{output}y\) of the nodes in the previous layer
and use the weights to compute the input \(\color{input}x\) of the nodes in the next layer.
$$ \color{input} x_j \color{black} = $$$$ \sum_{i\in in(j)} w_{ij}\color{output} y_i\color{black} +b_j$$
Forward propagation
Then we update the output of the nodes in the first hidden layer.
For this we use the activation function, \( f(x) \).
$$ \color{output} y \color{black} = f(\color{input} x \color{black})$$
Forward propagation
Using these 2 formulas we propagate for the rest of the network and get the final output of the network.
$$ \color{output} y \color{black} = f(\color{input} x \color{black})$$
$$ \color{input} x_j \color{black} = $$$$ \sum_{i\in in(j)} w_{ij}\color{output} y_i \color{black} + b_j$$
Error derivative
The backpropagation algorithm decides how much to
update each weight of the network after comparing the predicted output with the desired output for a particular example.
For this, we need to compute how the error changes
with respect to each weight \(\color{dweight}\frac{dE}{dw_{ij}}\).
Once we have the error derivatives, we can update the weights using a simple update rule:
$$w_{ij} = w_{ij} - \alpha \color{dweight}\frac{dE}{dw_{ij}}$$
where \(\alpha\) is a positive constant, referred to as the learning rate, which we need to fine-tune empirically.
[Note] The update rule is very simple: if the error goes down when the weight increases (\(\color{dweight}\frac{dE}{dw_{ij}}\color{black} < 0\)),
then increase the weight, otherwise if the error goes up when the weight increases (\(\color{dweight}\frac{dE}{dw_{ij}} \color{black} > 0\)),
then decrease the weight.
Additional derivatives
To help compute \(\color{dweight}\frac{dE}{dw_{ij}}\), we additionally store for each node two more derivatives:
how the error changes with:
- the total input of the node \(\color{dinput}\frac{dE}{dx}\) and
- the output of the node \(\color{doutput}\frac{dE}{dy}\).
Back propagation
Let's begin backpropagating the error derivatives.
Since we have the predicted output of this particular input example, we can compute how the error changes with that output.
Given our error function \(E = \frac{1}{2}(\color{output}y_{output}\color{black} - \color{output}y_{target}\color{black})^2\) we have:
$$ \color{doutput} \frac{\partial E}{\partial y_{output}} \color{black} = \color{output} y_{output} \color{black} - \color{output} y_{target}$$
Back propagation
Now that we have \(\color{doutput} \frac{dE}{dy}\) we can get \(\color{dinput}\frac{dE}{dx}\) using the chain rule.
$$\color{dinput} \frac{\partial E}{\partial x} \color{black} = \frac{dy}{dx}\color{doutput}\frac{\partial E}{\partial y} \color{black} = \frac{d}{dx}f(\color{input}x\color{black})\color{doutput}\frac{\partial E}{\partial y}$$
where \(\frac{d}{dx}f(\color{input}x\color{black}) = f(\color{input}x\color{black})(1 - f(\color{input}x\color{black}))\) when
\(f(\color{input}x\color{black})\) is the Sigmoid activation function.
Back propagation
As soon as we have the error derivative with respect to the total input of a node,
we can get the error derivative with respect to the weights coming into that node.
$$\color{dweight} \frac{\partial E}{\partial w_{ij}} \color{black} = \frac{\partial x_j}{\partial w_{ij}} \color{dinput}\frac{\partial E}{\partial x_j} \color{black} = \color{output}y_i \color{dinput} \frac{\partial E}{\partial x_j}$$
Back propagation
And using the chain rule, we can also get \(\frac{dE}{dy}\) from the previous layer. We have made a full circle.
$$ \color{doutput} \frac{\partial E}{\partial y_i} \color{black} = \sum_{j\in out(i)} \frac{\partial x_j}{\partial y_i} \color{dinput} \frac{\partial E}{\partial x_j} \color{black} = \sum_{j\in out(i)} w_{ij} \color{dinput} \frac{\partial E}{\partial x_j}$$
Back propagation
All that is left to do is repeat the previous three formulas until we have computed all the error derivatives.