Model Compression to Gain Speed and Reduce your Cloud bill

Facebook IconFacebook IconFacebook IconFacebook Icon
Héloïse Baudhuin
Dec 1, 2022

It’s a beautiful Thursday afternoon…

You, an incredibly talented ML engineer, are sipping on your coffee, a cappuccino with almond milk and a dash of cocoa. Your model is done, the client is happy with the performances, it is neatly packaged in a docker image and you send it 10 minutes ago to the MLOps who will put it in production. You stare wistfully through the window, thinking of all the next Machine Learning mysteries you will solve now. 

Suddenly a blood-curdling scream comes from behind you, making you drop your coffee mug to the ground. You jump around, with hot coffee splattered on your trousers and there is your MLOps who let out the scream.  “YOUR MODEL WEIGHS 10G AND THE RAM CONSUMPTION…THE RAM CONSUMPTION !! AARGL”.  Foam almost comes to his lips. You back against the window. He seems to be incapable of wrapping his head around that RAM consumption. In all honesty, you didn’t even check. It’s too much for him and he runs out of the office while sobbing “Why do the ML engineers do this to me ! It costs so much in production, they never listen, bouhouhou”.

Your boss turns around. “It’s the third this month, do something about the size of your models !”  Fine. Maybe it's time to do a bit of model compression.

So you want to compress your model…

There is a multitude of model compression techniques available. The most widely known are parameter pruning and quantization which can reduce the size of the model while keeping the performance and even reduce the latency for quantization especially. 

Parameter pruning

The idea behind parameter pruning is that networks are over-parameterized and extract redundant information. We can put to zero or completely remove weights that are not important to the decision-making process of the model. There are two types of pruning, structured and unstructured pruning.

In unstructured pruning, you will remove individual weights. Traditionally you will choose to put to zero weights that are very small to others in that layer or to the xth lowest percentile of weights of that layer.  You can reasonably think that if the weights are so small, they have very little influence on the network. Such pruning managed to downsize a VGG-16 from 163M to just 10.3M parameters with minimal loss of accuracy.

In structured pruning, you will remove entire blocks of weights instead of individual ones. You can remove filters or channels for example. Contrary to the first type of pruning this doesn't result in sparse weight matrices. 

Pruning is very easy and a straightforward way to lighten your model but keep a few points in mind; it is unclear how different pruning strategies generalize over different architecture. You will have to fine-tune your model to keep your performances. It is more effective to use an efficient architecture than to prune a suboptimal one.


Quantization decreases the size of the weights of your model. Initially, your weights are encoded as 32 bits float. With quantization, you can write on 16-bit floats or even 8 bits-like integers. As you can guess, dividing the size of your weights by half, halves the size of your model, and so on,... Even better, integer computation is faster than float-32 computation which means you will reduce latency and power consumption.

You can perform quantization at two moments :

  • During the training or training-aware quantization is when the network uses quantized weights during all the training. The errors due to quantization will be mixed with the “normal” error of the model and the optimizer will minimize both at the same time. Training-aware quantization gives a higher guarantee of performance.
  • Post-training quantization was performed on the frozen weights of the final model. Traditionally quantizing to float-16 points can be done with almost no performance loss. Integer quantization is not advised without a bit of fine-tuning. 

Keep in mind that not all hardware is capable of handling all degrees of precision. Accelerator devices such as Coral Edge TPUs are typically integer only while most CPUs will dequantize float-16 models to operate on float32.


For experience purposes, we start with a small UNET, roughly half a million parameters big. In “.hdf5'' format, it weighs 17Mb and takes up to 550Mb of RAM for one inference on my laptop’s CPU. 

For weight pruning and quantization, TensorFlow has a library called TensorFlow model optimization toolkit containing all necessary functions and easy-to-understand tutorials on the steps to follow. 

I start with my pre-trained UNet which has a loss (Focal-tversky) of 0.17 on my test set. I apply sparsity ranging from 10% to 80% and for each, fine-tuning it for 20 epochs. Each of my models has a loss quasi-identical to the original model. The 80% sparse model has decreased 2x in size.  

Hopefully, quantization will reduce my 80% sparse model further. The TensorFlow toolkit only allows quantization to uint8. However, not all layers are supported for quantization (Conv1D for example ) which means that between quantized layers and non-quantized layers, the weights need to be "dequantized". This will hamper the speed of inference. 

Once again I fine-tune my newly quantized model for 20 epochs and reach a comparable loss. My 80% sparse model quantized to uint8 weighs now only 300Mb on disk, a rough 6x diminishing in size for identical performances  However it turns out the inference time tripled after the quantization ! There are multiple hypotheses as to why quantization to uint8 slows some models on CPU. The first is that not all CPUs are capable of taking advantage of uint8 computation and are converting to float32 anyway. A second is that the input and output of my quantized layers have to be quantized and dequantized which takes some time.  

Quantization to uint8 being a dead end, I turn towards TensorFlow Lite with which I can quantize my whole model to float-16. I won’t gain latency but my model should be twice as small. Indeed after quantizing my sparse model, it is now 4x smaller than the original on disk. At inference time, it needs 200Mb less RAM, a 2x factor! 

Overall, I ended with a smaller and less memory-hungry model


Model compression is a critical part of machine learning and even more for deep learning. Memory consumption in production settings is expensive and not always available at every scale. This blog post presented quantization and weight pruning as two common techniques that efficiently reduce your model without sacrificing performance.

In my experience, I applied both techniques with simple configurations with promising results. Of course, there are many other configurations to try and play with. As for most things in Machine Learning, there is no one size fits all.