Multi-Layer Perceptron

This tutorial is based on http://deeplearning.net/tutorial/mlp.html but uses Latte to implement the code examples.

A Multi-Layer Perceptron (MLP) can be described as a logistic regression classifier where the input is first transformed using a learnt non-linear transformation $\Phi$. This intermediate layer performing the transformation is referred to as a hidden layer. One hidden layer is sufficient to make MLPs a universal approximator [1].

The Model

An MLP with a single hidden layer can be represented graphically as follows:

Single Hidden Layer MLP

To understand this representation, we'll first define the properties of a singular neuron. A Neuron, depicted in the figure as a circle, computes an output (typically called an activation) as a function of its inputs. In this figure, an input is depicted as a directed edge flowing into a Neuron. For an MLP, the function to compute the output of a Neuron begins with a weighted sum of the inputs. Formally, if we describe the input as a vector of values $x$ and a vector of weights $w$, this operation can be written as $w \cdot x$ (dot product). Next, we will shift this value by a learned bias $b$, then apply an activation function $s$ such as $tanh$, $sigmoid$ or $relu$. The entire computation is written as $s(w \cdot x + b)$.

Backpropogation

Forthcoming...

Stochastic Gradient Descent

Gradient Descent is a simple algorithm that repeatedly makes small steps downward on an error surface defined by a loss function of some parameters. Traditional Gradient Descent computes the gradient after a pass over the entire input data set. In practice, Gradient Descent proceeds more quickly when the gradient is estimated from just a few examples at time. This extreme form computes the gradient for a single training input at a time:

for (input, expected_result) in training_set:
    loss = f(params, input, expected_result)
    ∇_loss_wrt_params = ...  # compute gradients
    params -= learning_rate * ∇_loss_wrt_params
    if stop_condition_met()
        return params
    end
end

In practice, SGD for deep learning is performed using minibatches, where a minibatch of inputs are used to estimate the gradients. This technique typically reduces variance in the estimation of the gradient, but more importantly allows implementations to make better use of the memory hierarchy in modern computers.

Defining a WeightedNeuron

Defining a WeightedNeuron begins with a subtype of the abstract Neuron type. The Neuron type contains 4 default fields:

value – contains the output value of the neuron
∇ – contains the gradient of the neuron
inputs – a vector of vectors of input values.
∇inputs – a vector of gradients for connected neurons

For our WeightedNeuron we will define the following additional fields:

weights – a vector of learned weights
∇weights – a vector of gradients for the weights
bias – the bias value
∇bias – the gradient for the bias value

@neuron type WeightedNeuron
    weights  :: DenseArray{Float32}
    ∇weights :: DenseArray{Float32}

    bias     :: DenseArray{Float32}
    ∇bias    :: DenseArray{Float32}
end

Note

We do not define the default fields as using the @neuron macro for the type definition will specify them for us. Furthermore the macro defines a constructor function that automatically initializes the default fields.

Next we define the forward computation for the neuron:

@neuron forward(neuron::WeightedNeuron) do
    # dot product of weights and inputs
    for i in 1:length(neuron.inputs[1])
        neuron.value += neuron.weights[i] * neuron.inputs[1][i]
    end
    # shift by the bias value
    neuron.value += neuron.bias[1]
end

And finally we define the backward computation for the back-propogation algorithm:

@neuron backward(neuron::WeightedNeuron) do
    for i in 1:length(neuron.inputs[1])
        neuron.∇inputs[1][i] += neuron.weights[i] * neuron.∇
    end
    for i in 1:length(neuron.inputs[1])
        neuron.∇weights[i] += neuron.inputs[1][i] * neuron.∇
    end
    neuron.∇bias[1] += neuron.∇
end

Building a Layer using Ensembles and Connections

In Latte, a layer can be described as an Ensemble of Neurons with a specific set of connections to one or more input Ensembles. To construct a Hidden Layer for our MLP, we will use an Ensemble of WeightedNeurons with each neuron connected to each all the neurons in the input Ensemble.

Our FullyConnectedLayer will be a Julia Function that instantiates an Ensemble of WeightedNeurons and applies a full connection structure to the input_ensemble. The signature looks like this:

function FCLayer(name::Symbol, net::Net, input_ensemble::AbstractEnsemble, num_neurons::Int)

To construct a hidden layer with num_neurons, we begin by instantiating a 1-d Array to hold our WeightedNeurons.

    neurons = Array(WeightedNeuron, num_neurons)

Next we instantiate the parameters for our WeightedNeurons. Note xavier refers to a function to initialize a random set of values using the Xavier (TODO: Reference) initialization scheme. xavier and other initialization routines are provided as part of the Latte standard library.

    num_inputs = length(input_ensemble)
    weights = xavier(Float32, num_inputs, num_neurons)
    ∇weights = zeros(Float32, num_inputs, num_neurons)

    bias = zeros(Float32, 1, num_neurons)
    ∇bias = zeros(Float32, 1, num_neurons)

With our parameters initialized, we are ready to initialize our neurons. Note that each WeightedNeuron instance uses a different row of parameter values.

    for i in 1:num_neurons
        neurons[i] = WeightedNeuron(view(weights, :, i), view(∇weights, :, i),
                                    view(bias, :, i), view(∇bias, :, i))
    end

Finally, we are ready to instantiate our Ensemble.

    ens = Ensemble(net, name, neurons, [Param(name,:weights, 1.0f0, 1.0f0), 
                                        Param(name,:bias, 2.0f0, 0.0f0)])

Then we add connections to each neuron in input_ensemble

    add_connections(net, input_ensemble, ens,
                    (i) -> (tuple([Colon() for d in size(input_ensemble)]... )))

Finally, we return the constructed Ensemble so it can be used as an input to another layer.

    return ens
end

Constructing an MLP using Net

To construct an MLP we instantiate the Net type with a batch size of $100$. Then we use the Latte standard library provided HDF5DataLayer that constructs an input ensemble that reads from HDF5 datasets. (TODO: Link to explanation of Latte's HDF5 format). Then we construct two FCLayers using the function that we defined. Finally we use two more Latte standard library layers as output layers. The SoftmaxLoss layer is used for train the network and the AccuracyLayer is used for test the network.

using Latte

net = Net(100)
data, label = HDF5DataLayer(net, "data/train.txt", "data/test.txt")

fc1 = FCLayer(:fc1, net, data, 100)
fc2 = FCLayer(:fc2, net, fc1, 10)

loss     = SoftmaxLossLayer(:loss, net, fc2, label)
accuracy = AccuracyLayer(:accuracy, net, fc2, label)

params = SolverParameters(
    lr_policy    = LRPolicy.Inv(0.01, 0.0001, 0.75),
    mom_policy   = MomPolicy.Fixed(0.9),
    max_epoch    = 50,
    regu_coef    = .0005)
sgd = SGD(params)
solve(sgd, net)

Training

We will train the above MLP on the MNIST digit recognition dataset. For your convenience the code in this tutorial has been provided in tutorials/mlp/mlp.jl. Note that the name WeightedNeuron was replaced with MLPNeuron to resolve conflicts with the existing WeightedNeuron definition in the Latte standard library. To train the network, first download and convert the dataset by running tutorials/mlp/data/get-data.sh. Then train by running the script julia mlp.jl. You should the following output that shows the loss values and test results:

...
INFO: 07-Apr 15:15:22 - Entering solve loop
INFO: 07-Apr 15:15:23 - Iter 20 - Loss: 1.4688001
INFO: 07-Apr 15:15:24 - Iter 40 - Loss: 0.6913204
INFO: 07-Apr 15:15:25 - Iter 60 - Loss: 0.6053091
INFO: 07-Apr 15:15:26 - Iter 80 - Loss: 0.6043377
INFO: 07-Apr 15:15:27 - Iter 100 - Loss: 0.57204634
INFO: 07-Apr 15:15:28 - Iter 120 - Loss: 0.500179
INFO: 07-Apr 15:15:28 - Iter 140 - Loss: 0.40663132
INFO: 07-Apr 15:15:29 - Iter 160 - Loss: 0.3704785
INFO: 07-Apr 15:15:29 - Iter 180 - Loss: 0.3620596
INFO: 07-Apr 15:15:30 - Iter 200 - Loss: 0.46897307
INFO: 07-Apr 15:15:30 - Iter 220 - Loss: 0.45075363
INFO: 07-Apr 15:15:31 - Iter 240 - Loss: 0.3376474
INFO: 07-Apr 15:15:31 - Iter 260 - Loss: 0.5301368
INFO: 07-Apr 15:15:32 - Iter 280 - Loss: 0.28490248
INFO: 07-Apr 15:15:32 - Iter 300 - Loss: 0.33110633
INFO: 07-Apr 15:15:33 - Iter 320 - Loss: 0.26910272
INFO: 07-Apr 15:15:33 - Iter 340 - Loss: 0.32226878
INFO: 07-Apr 15:15:33 - Iter 360 - Loss: 0.3838666
INFO: 07-Apr 15:15:34 - Iter 380 - Loss: 0.24588501
INFO: 07-Apr 15:15:34 - Iter 400 - Loss: 0.4209111
INFO: 07-Apr 15:15:35 - Iter 420 - Loss: 0.25582874
INFO: 07-Apr 15:15:35 - Iter 440 - Loss: 0.3958639
INFO: 07-Apr 15:15:36 - Iter 460 - Loss: 0.27812394
INFO: 07-Apr 15:15:36 - Iter 480 - Loss: 0.45379284
INFO: 07-Apr 15:15:37 - Iter 500 - Loss: 0.35272872
INFO: 07-Apr 15:15:38 - Iter 520 - Loss: 0.39787623
INFO: 07-Apr 15:15:38 - Iter 540 - Loss: 0.30763283
INFO: 07-Apr 15:15:39 - Iter 560 - Loss: 0.35435736
INFO: 07-Apr 15:15:40 - Iter 580 - Loss: 0.33140996
INFO: 07-Apr 15:15:41 - Iter 600 - Loss: 0.34410283
INFO: 07-Apr 15:15:41 - Epoch 1 - Testing...
INFO: 07-Apr 15:15:44 - Epoch 1 - Test Result: 90.88118%
...