# Compression of Google Neural Machine Translation Model¶

## Overview¶

Google Neural Machine Translation (GNMT) is a Sequence to sequence (Seq2seq) model which learns a mapping from an input text to an output text.

The example below demonstrates how to train a highly sparse GNMT model with minimal loss in accuracy. The model is based on the GNMT model presented in the paper Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation [1] which consists of approximately 210M floating point parameters.

## GNMT Model¶

The GNMT architecture is an encoder-decoder architecture with attention as presented in the original paper [1].

The encoder consists of an embedding layer followed by 1 bi-directional and 3 uni-directional LSTM layers with residual connections between them. The decoder consists of an embedding layer followed by 4 uni-directional LSTM layers and a linear Softmax layer. The attention mechanism connects between the encoder’s bi-directional LSTM layer to all of the decoder’s LSTM layers.

The GNMT model was adapted from the model shown in Neural Machine Translation (seq2seq) Tutorial [2] and from its repository.

The Sparse model implementation can be found in GNMTModel and offers several options to build the GNMT model.

## Sparsity - Pruning GNMT¶

Sparse neural networks are networks where a portion of the network weights are zeros. A high sparsity ratio can help compress the model and accelerate inference, reduce power consumption used for memory transfer and computing.

In order to produce a sparse network the network weights are pruned while training by forcing weights to be zero. There are a number of methods to prune neural networks, for example the paper To prune, or not to prune: exploring the efficacy of pruning for model compression [3] presents a method for gradual pruning of weights with low amplitude.

The example below demonstrates how to prune the GNMT model up to 90% sparsity with minimal loss in BLEU score using the Tensorflow model_pruning package which implements the method presented in [3]

## Post Training Weight Quantization¶

The weights of pre-trained GNMT models are usually represented in 32bit Floating-point format. The highly sparse pre-trained model below can be further compressed by uniform quantization of the weights to 8bits Integer, gaining a further compression ratio of 4x with negligible accuracy loss. The implementation of the weight quantization is based on TensorFlow API. When using the model for inference, the int8 weights of the sparse and quantized model are de-quantized back to fp32.

## Dataset¶

The models below were trained using the following datasets:

• Europarlv7 [4]
• Common Crawl Corpus
• News Commentary 11
• Development and test sets

All datasets are provided by WMT Shared Task: Machine Translation of News

You can use this script wmt16_en_de.sh to download and prepare the data for training and evaluating your model.

## Results & Pre-Trained Models¶

The following table presents some of our experiments and results. We provide pre-trained checkpoints for a 90% sparse GNMT model and a similar 90% sparse but with 2x2 sparsity blocks pattern. See table below and our Model Zoo. You can use these models to Run Inference using our Pre-Trained Models and evaluate them.

 Model Sparsity BLEU Non-Zero Parameters Data Type Baseline 0% 29.9 ~210M Float32 Sparse 90% 28.4 ~22M Float32 2x2 Block Sparse 90% 27.8 ~22M Float32 Quantized Sparse 90% 28.4 ~22M Integer8 Quantized 2x2 Block Sparse 90% 27.6 ~22M Integer8
1. The pruning is applied to the embedding, decoder projection layer and all LSTM layers in both the encoder and decoder.
2. BLEU score is measured using newstest2015 test set provided by the Shared Task.
3. The accuracy of the quantized model was measure when we converted the 8 bits weights back to floating point during inference.

## Running Modalities¶

Below are simple examples for training 90% sparse GNMTModel model, running inference using a pre-trained/trained model, quantizing a model to 8bit Integer and running inference using a quantized model. Before inference, the int8 weights of the sparse and quantized model are de-quantize back to fp32.

### Training¶

Train a German to English GNMT model with 90% sparsity using the WMT16 dataset:

# Download the dataset
wmt16_en_de.sh /tmp/wmt16_en_de

# Go to examples directory
cd <nlp_architect root>/examples

# Train the sparse GNMT
python -m sparse_gnmt.nmt \
--src=de --tgt=en \
--hparams_path=sparse_gnmt/standard_hparams/sparse_wmt16_gnmt_4_layer.json \
--out_dir=<output directory> \
--vocab_prefix=/tmp/wmt16_en_de/vocab.bpe.32000 \
--train_prefix=/tmp/wmt16_en_de/train.tok.clean.bpe.32000 \
--dev_prefix=/tmp/wmt16_en_de/newstest2013.tok.bpe.32000 \
--test_prefix=/tmp/wmt16_en_de/newstest2015.tok.bpe.32000

• Train using GPUs by adding --num_gpus=<n>
• Model configuration JSON files are found in examples/sparse_gnmt/standard_hparams directory.
• Sparsity policy can be re-configured by changing the parameters given in --pruning_hparams. E.g. change target_policy=0.7 in order to train 70% sparse GNMT.
• All pruning hyper parameters are listed in model_pruning.

While training Tensorflow checkpoints, Tensorboard events, Hyper-Parameters used and log files will be saved in the output directory given.

### Inference¶

Run inference using a trained model:

# Go to examples directory
cd <nlp_architect root>/examples

# Run Inference
python -m sparse_gnmt.nmt \
--src=de --tgt=en \
--hparams_path=sparse_gnmt/standard_hparams/sparse_wmt16_gnmt_4_layer.json \
--ckpt=<path to a trained checkpoint> \
--vocab_prefix=/tmp/wmt16_en_de/vocab.bpe.32000 \
--out_dir=<output directory> \
--inference_input_file=<file with lines in the source language> \
--inference_output_file=<target file to place translations>

• Measure performance and BLEU score against a reference file by adding --inference_ref_file=<reference file in the target language>
• Inference using GPUs by adding --num_gpus=<n>

#### Run Inference using our Pre-Trained Models¶

Run inference using our pre-trained models:

# Download pre-trained model zip file, e.g. gnmt_sparse.zip
wget https://d2zs9tzlek599f.cloudfront.net/models/sparse_gnmt/gnmt_sparse.zip

# Unzip checkpoint + vocabulary files
unzip gnmt_sparse.zip -d /tmp/gnmt_sparse_checkpoint

# Go to examples directory
cd <nlp_architect root>/examples

# Run Inference
python -m sparse_gnmt.nmt \
--src=de --tgt=en \
--hparams_path=sparse_gnmt/standard_hparams/sparse_wmt16_gnmt_4_layer.json \
--ckpt=/tmp/gnmt_sparse_checkpoint/gnmt_sparse.ckpt\
--vocab_prefix=/tmp/gnmt_sparse_checkpoint/vocab.bpe.32000 \
--out_dir=<output directory> \
--inference_input_file=<file with lines in the source language> \
--inference_output_file=<target file to place translations>


Important Note: use the vocabulary files provided with the checkpoint when using our pre-trained models

#### Quantized Inference¶

Add the following flags to the Inference command line in order to quantize the pre-trained models and run inference with the quantized models:

• --quantize_ckpt=true: Produce a quantized checkpoint. Checkpoint will be saved in the output directory. Inference will run using the produced checkpoint.
• --from_quantized_ckpt=true: Inference using an already quantized checkpoint

### Custom Training/Inference Parameters¶

All customizable parameters can be obtained by running: python -m nlp-architect.examples.sparse_gnmt.nmt -h