Multi-Layer Perceptron
This tutorial is based on http://deeplearning.net/tutorial/mlp.html but uses Latte to implement the code examples.
A Multi-Layer Perceptron (MLP) can be described as a logistic regression classifier where the input is first transformed using a learnt non-linear transformation $\Phi$. This intermediate layer performing the transformation is referred to as a hidden layer. One hidden layer is sufficient to make MLPs a universal approximator [1].
The Model
An MLP with a single hidden layer can be represented graphically as follows:
To understand this representation, we'll first define the properties of a singular neuron. A Neuron, depicted in the figure as a circle, computes an output (typically called an activation) as a function of its inputs. In this figure, an input is depicted as a directed edge flowing into a Neuron. For an MLP, the function to compute the output of a Neuron begins with a weighted sum of the inputs. Formally, if we describe the input as a vector of values $x$ and a vector of weights $w$, this operation can be written as $w \cdot x$ (dot product). Next, we will shift this value by a learned bias $b$, then apply an activation function $s$ such as $tanh$, $sigmoid$ or $relu$. The entire computation is written as $s(w \cdot x + b)$.
Backpropogation
Forthcoming...
Stochastic Gradient Descent
Gradient Descent is a simple algorithm that repeatedly makes small steps downward on an error surface defined by a loss function of some parameters. Traditional Gradient Descent computes the gradient after a pass over the entire input data set. In practice, Gradient Descent proceeds more quickly when the gradient is estimated from just a few examples at time. This extreme form computes the gradient for a single training input at a time:
for (input, expected_result) in training_set: loss = f(params, input, expected_result) ∇_loss_wrt_params = ... # compute gradients params -= learning_rate * ∇_loss_wrt_params if stop_condition_met() return params end end
In practice, SGD for deep learning is performed using minibatches, where a minibatch
of inputs are used to estimate the gradients. This technique typically reduces variance in the estimation of the gradient, but more importantly allows implementations to make better use of the memory hierarchy in modern computers.
Defining a WeightedNeuron
Defining a WeightedNeuron begins with a subtype of the abstract Neuron
type. The Neuron
type contains 4 default fields:
value
– contains the output value of the neuron∇
– contains the gradient of the neuroninputs
– a vector of vectors of input values.∇inputs
– a vector of gradients for connected neurons
For our WeightedNeuron we will define the following additional fields:
weights
– a vector of learned weights∇weights
– a vector of gradients for the weightsbias
– the bias value∇bias
– the gradient for the bias value
@neuron type WeightedNeuron weights :: DenseArray{Float32} ∇weights :: DenseArray{Float32} bias :: DenseArray{Float32} ∇bias :: DenseArray{Float32} end
Note
We do not define the default fields as using the @neuron macro for the type definition will specify them for us. Furthermore the macro defines a constructor function that automatically initializes the default fields.
Next we define the forward computation for the neuron:
@neuron forward(neuron::WeightedNeuron) do # dot product of weights and inputs for i in 1:length(neuron.inputs[1]) neuron.value += neuron.weights[i] * neuron.inputs[1][i] end # shift by the bias value neuron.value += neuron.bias[1] end
And finally we define the backward computation for the back-propogation algorithm:
@neuron backward(neuron::WeightedNeuron) do for i in 1:length(neuron.inputs[1]) neuron.∇inputs[1][i] += neuron.weights[i] * neuron.∇ end for i in 1:length(neuron.inputs[1]) neuron.∇weights[i] += neuron.inputs[1][i] * neuron.∇ end neuron.∇bias[1] += neuron.∇ end
Building a Layer using Ensembles and Connections
In Latte, a layer can be described as an Ensemble
of Neuron
s with a specific set of connections to one or more input Ensemble
s. To construct a Hidden Layer for our MLP, we will use an Ensemble
of WeightedNeurons
with each neuron connected to each all the neurons in the input Ensemble
.
Our FullyConnectedLayer
will be a Julia Function that instantiates an Ensemble
of WeightedNeuron
s and applies a full connection structure to the input_ensemble
. The signature looks like this:
function FCLayer(name::Symbol, net::Net, input_ensemble::AbstractEnsemble, num_neurons::Int)
To construct a hidden layer with num_neurons
, we begin by instantiating a 1-d Array
to hold our WeightedNeurons
.
neurons = Array(WeightedNeuron, num_neurons)
Next we instantiate the parameters for our WeightedNeurons
. Note xavier
refers to a function to initialize a random set of values using the Xavier (TODO: Reference) initialization scheme. xavier
and other initialization routines are provided as part of the Latte standard library.
num_inputs = length(input_ensemble) weights = xavier(Float32, num_inputs, num_neurons) ∇weights = zeros(Float32, num_inputs, num_neurons) bias = zeros(Float32, 1, num_neurons) ∇bias = zeros(Float32, 1, num_neurons)
With our parameters initialized, we are ready to initialize our neurons. Note that each WeightedNeuron
instance uses a different row of parameter values.
for i in 1:num_neurons neurons[i] = WeightedNeuron(view(weights, :, i), view(∇weights, :, i), view(bias, :, i), view(∇bias, :, i)) end
Finally, we are ready to instantiate our Ensemble.
ens = Ensemble(net, name, neurons, [Param(name,:weights, 1.0f0, 1.0f0), Param(name,:bias, 2.0f0, 0.0f0)])
Then we add connections to each neuron in input_ensemble
add_connections(net, input_ensemble, ens, (i) -> (tuple([Colon() for d in size(input_ensemble)]... )))
Finally, we return the constructed Ensemble so it can be used as an input to another layer.
return ens end
Constructing an MLP using Net
To construct an MLP we instantiate the Net
type with a batch size of $100$. Then we use the Latte standard library provided HDF5DataLayer
that constructs an input ensemble that reads from HDF5
datasets. (TODO: Link to explanation of Latte's HDF5 format). Then we construct two FCLayer
s using the function that we defined. Finally we use two more Latte standard library layers as output layers. The SoftmaxLoss
layer is used for train the network and the AccuracyLayer
is used for test the network.
using Latte net = Net(100) data, label = HDF5DataLayer(net, "data/train.txt", "data/test.txt") fc1 = FCLayer(:fc1, net, data, 100) fc2 = FCLayer(:fc2, net, fc1, 10) loss = SoftmaxLossLayer(:loss, net, fc2, label) accuracy = AccuracyLayer(:accuracy, net, fc2, label) params = SolverParameters( lr_policy = LRPolicy.Inv(0.01, 0.0001, 0.75), mom_policy = MomPolicy.Fixed(0.9), max_epoch = 50, regu_coef = .0005) sgd = SGD(params) solve(sgd, net)
Training
We will train the above MLP on the MNIST digit recognition dataset. For your convenience the code in this tutorial has been provided in tutorials/mlp/mlp.jl
. Note that the name WeightedNeuron
was replaced with MLPNeuron
to resolve conflicts with the existing WeightedNeuron
definition in the Latte standard library. To train the network, first download and convert the dataset by running tutorials/mlp/data/get-data.sh
. Then train by running the script julia mlp.jl
. You should the following output that shows the loss values and test results:
... INFO: 07-Apr 15:15:22 - Entering solve loop INFO: 07-Apr 15:15:23 - Iter 20 - Loss: 1.4688001 INFO: 07-Apr 15:15:24 - Iter 40 - Loss: 0.6913204 INFO: 07-Apr 15:15:25 - Iter 60 - Loss: 0.6053091 INFO: 07-Apr 15:15:26 - Iter 80 - Loss: 0.6043377 INFO: 07-Apr 15:15:27 - Iter 100 - Loss: 0.57204634 INFO: 07-Apr 15:15:28 - Iter 120 - Loss: 0.500179 INFO: 07-Apr 15:15:28 - Iter 140 - Loss: 0.40663132 INFO: 07-Apr 15:15:29 - Iter 160 - Loss: 0.3704785 INFO: 07-Apr 15:15:29 - Iter 180 - Loss: 0.3620596 INFO: 07-Apr 15:15:30 - Iter 200 - Loss: 0.46897307 INFO: 07-Apr 15:15:30 - Iter 220 - Loss: 0.45075363 INFO: 07-Apr 15:15:31 - Iter 240 - Loss: 0.3376474 INFO: 07-Apr 15:15:31 - Iter 260 - Loss: 0.5301368 INFO: 07-Apr 15:15:32 - Iter 280 - Loss: 0.28490248 INFO: 07-Apr 15:15:32 - Iter 300 - Loss: 0.33110633 INFO: 07-Apr 15:15:33 - Iter 320 - Loss: 0.26910272 INFO: 07-Apr 15:15:33 - Iter 340 - Loss: 0.32226878 INFO: 07-Apr 15:15:33 - Iter 360 - Loss: 0.3838666 INFO: 07-Apr 15:15:34 - Iter 380 - Loss: 0.24588501 INFO: 07-Apr 15:15:34 - Iter 400 - Loss: 0.4209111 INFO: 07-Apr 15:15:35 - Iter 420 - Loss: 0.25582874 INFO: 07-Apr 15:15:35 - Iter 440 - Loss: 0.3958639 INFO: 07-Apr 15:15:36 - Iter 460 - Loss: 0.27812394 INFO: 07-Apr 15:15:36 - Iter 480 - Loss: 0.45379284 INFO: 07-Apr 15:15:37 - Iter 500 - Loss: 0.35272872 INFO: 07-Apr 15:15:38 - Iter 520 - Loss: 0.39787623 INFO: 07-Apr 15:15:38 - Iter 540 - Loss: 0.30763283 INFO: 07-Apr 15:15:39 - Iter 560 - Loss: 0.35435736 INFO: 07-Apr 15:15:40 - Iter 580 - Loss: 0.33140996 INFO: 07-Apr 15:15:41 - Iter 600 - Loss: 0.34410283 INFO: 07-Apr 15:15:41 - Epoch 1 - Testing... INFO: 07-Apr 15:15:44 - Epoch 1 - Test Result: 90.88118% ...