anwaar-khalid - Compression

Model Compression: A Survey of Techniques

2024-08-10T00:00:00+05:30

Machine Learning (ML) has witnessed a surge in interest in recent years driven by the availability of large-scale datasets, advances in ML frameworks such as PyTorch and TensorFlow, rise of hardware accelerators (e.g., GPUs and TPUs) that enable large-scale training, and the development of increasingly powerful neural network architectures. Together, these factors have enabled the creation of highly capable models with wide ranges of applications.

Scaling Laws have demonstrated model performance improves predictably with the increase in model size, dataset size and compute. This has led companies to train even larger models on massive datasets, requiring a vast amount of computational resources. As a consequence, it is expected that the cost of training models will rise from $100 Million to $500 Million by 2030. At the same time, International Energy Agency projects that the energy requirement for AI could double over the same period.

Beyond training costs, deployment introduces another critical constraint for these models: user experience. Models must operate with low latency & high throughput to be practical in real-world applications. Therfore, it has become increasingly important to develop techniques to optimize these models and make them faster, greener and more cost efficient.

Enter Model Compression.

Model compression refers to a set of algorithms that aim to reduce the size, memory and power requirements of neural networks without significantly impacting their accuracy. These include pruning redundant connections in the model, representing the parameters through lower precision datatypes, exploiting the structure of data to represent the layers more efficiently, or training smaller and more efficient models.

The goal of this blogpost is to provide a high-level overview of these techniques.

Quantization: Making the most of approximations¶

What is Quantization?¶

Modern models are typically trained using mixed precision, i.e some operations in float16 or bfloat16, and others in float32. As a rule of thumb, a model with N billion parameters requires roughly 2 × N GB of memory when stored in 16 bits. For instance, a 3B parameter model needs about 6 GB of memory, a 7B model needs 14 GBs and so on.

Quantization reduces this footprint by representing weights and activations using lower-precision data types such as int8, effectively halving memory requirements. Beyond memory savings, prior work (e.g., Horowitz, 2014) has showed that integer arithmetic is generally faster and more energy-efficient than floating-point computation.

Quantization Steps: Source: A White Paper On Neural Network Quantization

Quantization is therefore a key model compression technique for efficient deployment of neural networks on mobile phones and other resource-constrained edge devices.

What are the different types of quantization algorithms?¶

Quantization can be categorized along several axes, depending on when it is applied, how parameters are represented, and how computations are performed.

Based on when it is applied: During Training or Inference?¶

Quantization aware training (QAT) simlutes inference-time quantization duering training by using quantized weights and activations in the forward pass, while using full precision weights in the backward pass. The idea is to emulate the inference time quantization error by using quantized values in the forward pass, and allowing the backward pass to update the full-precision weights such that they become more robust to quantization. This approach leads to better accuracy preservation as the model learns to adapt to the precision loss during training.

Quantization Aware Training. Source: Quantization and Deployment of Deep Neural Networks on Microcontrollers

Unlike quantization aware training, where the model learns to adjust to lower precision during training, post-training quantization (PTQ) directly applies this reduction in bit width after the model has been trained. While it may not offer the same level of accuracy preservation as quantization aware training, it is widely used in practice since it doesn’t require retraining the model.

Post Training Quantization. Source: Practical Quantization in PyTorch

Based on how quantization parameters are applied:¶

Per-tensor quantization: a single scale and offset are used for the entire tensor.
Per-channel quantization: each channel has its own scale and offset.
Block quantization: groups of elements share quantization parameters.

In general, finer-grained schemes (more parameters) yield better accuracy but introduce additional storage and compute overhead.

Based on how activations are quantized:¶

Since weights are always known before inference they can be quantized offline, however, activations depend on the input to the model, and there are two ways of quantizing them.

In static quantization, the min and max ranges of activations are calculated using a small subset of the training data called calibration dataset which typically contains around 300-500 samples. This technique is typically used when both memory bandwidth and compute savings are important with CNNs being a typical use case.

In dynamic quantization, the min-max ranges are calculated on the fly during inference. This is useful in situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. For example, this is true for LSTM and Transformer type models with small batch size.

Based on quantization mapping:¶

Affine quantization incorporates scaling (Scale S) and a non zero shifting(Zero Point Z) allowing for a flexible representation of a wide range of values by adjusting magnitude and position before rounding.

Symmetric quantization is a special case of affine quantization where-in the values are mapped to a symmetric range of values i.e [-a, a]. In this case, the integer space is usually [-127, 127], meaning that the -128 is opted out of the regular [-128, 127] signed int8 range. The reason is that having both ranges symmetric allows to have Z \= 0. While one value out of the 256 representable values is lost, it can provide a speedup since a lot of additional operations can be skipped.

Pruning: Keeping what matters¶

What is Pruning?¶

Pruning is a model compression technique that involves removing redundant or less important weights from a neural network to reduce model size and compute while preserving accuracy. Typically, weights are pruned based on critera such as magnitude or their estimated impact on the model’s output.

A neural network structure, before and after pruning. Source: Learning both Weights and Connections for Efficient Neural Networks

What are the different types of pruning algorithms?¶

Pruning can also be categorized along several dimensions:

Based on how weights are removed¶

Unstructured pruning: removes individual weights without constraints, resulting in sparse weight matrices. While this can achieve high compression, it often requires specialized hardware or libraries to realize speedups.
Structured pruning: removes entire structures such as neurons, channels, or filters. This produces dense, smaller models that are more compatible with standard hardware and typically lead to practical speedups.
Semi-structured pruning : imposes regular sparsity i.e N:M patterns. For example, 2:4 sparsity, where for every group of 4 weights 2 are zeroed out. It’s a middle ground between structured and unstructured pruning and provides real speed ups (e.g on Nvidia sparse tensor cores.)

Based on how pruning is performed?¶

One-shot pruning: removes weights in a single step. It is simple but can lead to significant accuracy degradation.
Iterative pruning : gradually removes weights over multiple steps, often interleaved with fine-tuning. This approach generally preserves accuracy better.

Based on scope¶

Local pruning: applies pruning independently within each layer.
Global pruning: selects weights to prune across the entire network based on a global criterion, often leading to better overall allocation of sparsity.

Tensorization / Factorization: Higher ranking doesn’t mean better¶

What is Tensorization / Factorization?¶

Tensorization (or factorization) decomposes weight tensors into lower-rank components, reducing parameter count while preserving the most important structure in the data.

Visual representation of the tensorization process applied to image data. Source: Tensor Contraction and Regression Networks

What are the different types of tensorization/factorization algorithms?¶

There are four popular tensor decomposition algorithms which are used extensively:

Singular Value Decomposition: Decomposes a matrix A into UΣV⊤, where a low-rank approximation retains only the top singular values. This provides the most optimal rank-r approximation of the matrix (Eckart-Young theorem). It is directly applicable to 2D weights (e.g., linear layers); higher-dimensional tensors must be reshaped.
Canonical Polyadic Decomposition (CP or PARAFAC): Factorizes a tensor into a sum of rank-1 tensors. Each component captures a separable pattern across all modes, enabling direct decomposition of multi-dimensional weights without flattening.
Tensor Train Decomposition : Represents a high-dimensional tensor as a sequence of smaller, interconnected tensors (“cores”). This factorization scales linearly with the number of dimensions, making it efficient for very large tensors.
Tucker Decomposition: Decomposes a tensor into a smaller core tensor multiplied by factor matrices along each mode. It can be seen as a higher-order generalization of SVD, allowing different ranks per dimension.

Visual representation of CP & Tucker Decomposition. Source: Research Gate

How is Tensorization used in practice?¶

Tensorly is the most popular library for working with tensor decompositions. It works with various frameworks such as Numpy, Pytorch and TensorFlow & provides off-the-shelf API for all the algorithms we discussed. Extending the core features of TensorLy, TensorLy-Torch provides out-of-the-box tensor layers that can be used to implement and train tensorized networks from scratch or fine-tuning existing models by replacing the layers with their tensorized counterparts. Some examples are:

Factorized Convolutions: which decompose the convolution filter into two or more smaller filters, reducing parameter count and compute.
Tensorized Linear Layers: where the 2D weight matrix of a linear layer is first tensorized (reshaped into a higher dimensional tensor) and then factored using a high-dimensional decomposition / tensorization method.
Factorized Embedding Layers: which can act as a drop-in replacement for Pytorch’s embeddings but using efficient tensor parametrization that doesn’t need to reconstruct the table.
Tensor Regression Layers: eliminate fully connected layers by regressing directly on convolutional activations using low-rank tensor factorization.

Knowledge Distillation: Why reinvent the wheel?¶

What is Knowledge Distillation?¶

Knowledge distillation involves transferring knowledge from a complex, large model (teacher network) to a smaller, simpler neural network (student model). The goal here is to leverage the knowledge learned by a large model to train a small model to mimic it’s behavior.

At a high-level, knowledge distillation involves two main steps:

Training the teacher model: Involves training the teacher network to a desired level of accuracy
Training the student model: Using the outputs of the teacher network as soft targets i.e probability distributions over the classes instead of hard labels

An illustration of the knowledge distillation process. Source: Knowledge Distillation: A Survey

What are the different types of knowledge distillation algorithms?¶

Knowledge Distillation techniques generally pertains to one of the following categories:

Offline Distillation: is the most common distillation technique and involves using a trained model (teacher) to guide another smaller model (student).
Online Distillation: is used when a trained teacher model isn’t available. In online distillation, both teacher and student models are updated simultaneously in a single end-to-end training process.
Self Distillation: is a special case of online distillation and involves using the same model as the teacher as well as the student. In self-distillation, knowledge from deeper layers of the network is used to guide the shallow layers.

Conclusion¶

Models may keep scaling, but so do the bills. Model compression is how we push back. Hope this gave you a useful overview. Thank you for reading!

Case Study: Compressing DeepMind’s RepNet for Edge Deployment

2024-06-10T00:00:00+05:30

In this case study, we explore compressing neural networks for efficient deployment on edge devices with limited resources. We explore practical techniques like quantization, pruning, and tensorization using off-the-shelf open-source tools.

Our aim is to illustrate a typical model compression workflow, highlighting the approaches and techniques used to analyse a model from a compression perspective, and how to select and apply the appropriate compression techniques expected to yield the best results. We begin our experiment by understanding the target model and how it behaves, before diving into sequentially applying several state-of-the-art techniques and explaining the thought process at each step.

Join us as we go through this process to learn useful insights and optimization strategies to use when deploying models on mobile and edge devices!

NB: If you’re interested in trying out these techniques, check out our demo notebook here which you can use to follow along. It provides all the code used to reproduce and experiment with the results presented here.

Understanding the Model¶

For this experiment, we’ll attempt to compress Google DeepMind’s RepNet model. RepNet is a model that predicts the count of repeating actions in a video. Because the model is class agnostic, it can be used to detect various repetitive actions in different contexts.

We chose RepNet for this demonstration because the base model performance is not suitable for running on mobile (as we’ll see below) making it a good candidate for compression. The architecture includes a diverse set of layers that we use to illustrate the use of different compression techniques.

Figure 2. RepNet Model Architecture. Source RepNet.

RepNet is composed of three fundamental building blocks, namely: * An encoder layer that generates per-frame embeddings from the input videos. * A temporal self-similarity layer that computes all pairwise similarities (distances) between the calculated frame embeddings. * A period predictor block that outputs the (a) per-frame binary periodicity classification (whether the frame relates to a repeated action or not) and (b) per-frame period length estimation (how many times the action was carried until the frame).

Building Intuition for Compression¶

Profiling¶

Instead of blindly applying model compression techniques to the model, it is common practice to identify the latency and memory bottlenecks within the network to perform a more targeted and efficient compression. This involves running and timing inference on each major block making up the architecture as well as their trainable parameters.

To measure this, we run inference on a sample of inputs while making sure to separately time each submodule’s latency and measure the number of trainable parameters at each step. This allows us to track how the overall inference runtime and parameters count evolves from one layer to the next, and assess the contribution of each layer to the total inference time and model weight. We start with a warm-up of 10 initial runs and follow up with 32 runs to assess the speed of each component, with results averaged for accuracy. Using PyTorch on a Tesla V100-PCIE-16GB GPU, the script focuses on specific elements such as the encoder, temporal convolutional layers, TSM convolutional layer, period length head, and periodicity head, as the main building blocks of the architecture.

To ensure accuracy, we explicitly define the input sizes for these components based on the expected sequence of input transformation events during inference. For example, the encoder begins with an input size of 64 which is subsequently transformed to an input size of 1 at the temporal convolution layer, and so on. Consequently, we assume an input size of 1 for the temporal convolution layer and 64 for the encoder, precisely aligning with the actual input transformation process.

The analysis reveals that the temporal convolution submodule makes up half of the model’s weights but the encoder submodule is responsible for more than two-thirds of the overall latency. It is thus natural to focus our efforts on these two submodules.

Fig 3. Parameter and Latency Contribution of each Submodule

Baseline Benchmarks¶

To assess the impact of applying model compression techniques, we track three key metrics namely Model Size, Accuracy, and Latency. For Accuracy, we look at the average OBO (Off by One) error, which shows the percentage of times predictions were off by just one count. Additionally, we use the Average MAE (Mean Absolute Error) to get the average normalised absolute error of predictions. Tracking these metrics helps us understand how the model performs in terms of size, accuracy, and speed, and how these values change when we apply compression techniques.

We use the QUVARepetetions Dataset for benchmarking. The dataset consists of 100 short videos of various repeating actions and is also used by the authors of the model to initially benchmark the performance against comparable architectures.

Since compression is highly hardware-dependent, we begin by running the benchmarks on the target device early on. For this, we use Pytorch’s benchmarking binary to run a _scripted _pytorch model on the device without having to learn anything about mobile development.

The table below shows benchmark results for the baseline model on a Samsung S22 Plus phone with batch sizes of 1, 8, and 16. Each frame has dimensions 112x112 and the model processes 64 frames per iteration. The model size is 97.97 MB, with OBO at 0.32 and MAE at 0.17.

Input Dimensions	Seconds Per Iteration	Iteration Per Second
1,3,64,112,112	2.77	0.36
8,3,64,112,112	22.5	0.04
16,3,64,112,112	49.3	0.02

Compressing the Model¶

We can begin our compression experiments now that we have a better idea of various memory and latency bottlenecks in our model and how well it runs on the target device. We first explore pruning, tensorization, and quantization in isolation before experimenting with various combinations of these techniques.

Tensorization¶

Tensorization involves representing the weights of layers of neural networks in a factorized form using tensor decomposition methods like Tucker, CP, etc. Tensorly-Torch provides off-the-shelf factorized layers that can directly be replaced with their standard pytorch counterparts where the pre-trained weights are used to initialise the layers. However, Tensorly-Torch is not compatible with _torch.jit.script _as highlighted in this issue, so we use TensorLy to perform the decompositions and implement custom factorized layers.

Tensorization works especially well with convolutional layers where kernels of any order ( 2D, 3D, or more!) can be parametrized efficiently using tensor decomposition algorithms. For example, using a CP decomposition results in a separable convolution that can replace your original convolution with a series of small efficient ones that work better on mobile phones.

Tensorizing the Convolution Layer. Because the Temporal Convolution layer consists of the maximum number of parameters, we begin by tensorizing this layer first. Running experiments using different decomposition ranks shows that a rank of 0.1 yields the best results in this case. With this rank, the model size decreases to 62.85MB, and the inference time is cut in half. Notably, prediction accuracy improves with OBO decreasing to 0.29 and MAE to 0.167, without fine-tuning the model. While this is not a common result, we believe the model’s original performance may be impacted by inherent noise, and we attribute the increase in prediction accuracy to the regularisation effect of tensorization which reduces the noise in the network, allowing it to generalise better.

Tensorizing the Feature Extractor. Tensorizing this part was challenging as it includes numerous convolutional layers, making grid searching for optimal ranks impractical. Rank selection for tensor decompositions is a hard problem because determining the optimal rank involves finding the right balance between capturing the essential structure of the underlying data and avoiding overfitting. Selecting a rank that is too low may result in a loss of important information, leading to a poor approximation of the original tensor. On the other hand, selecting a rank that is too high can lead to overfitting, where the model captures noise or irrelevant details, hindering generalisation to new data. Additionally, the computational complexity of searching for the optimal rank increases as the number of possible ranks grows exponentially.

Practitioners generally rely on heuristics or meta-learning for determining the optimal ranks. The authors of Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile have used Variational Bayesian Matrix Factorization (VBMF) successfully on various networks, including VGG-16 and AlexNet. VBMF is a complicated technique and out of the scope of this post, but on a high level, it involves approximating a matrix V_LxM as the sum of a lower-ranking matrix B_LxHA^T_HxM and Gaussian noise. After A and B are found, H acts as an upper bound on the rank. VBMF is a very promising tool because it can automatically find noise variance, and rank and even provide theoretical condition for perfect rank recovery (Nakajima et al., 2012). We determined the rank R3 and R4 by applying global analytic VBMF on mode-3 and mode-4 matricization of the convolutional kernel tensors K and managed to reduce the model size to 84.19 MB. In this case, MAE error increased to 0.27 while OBO decreased to 0.29.

Tensorizing the Entire Encoder. Finally, we experiment with tensorizing the entire encoder as it makes up for the bulk of the latency and memory footprint of the model. We use VMBF ranks for the Resnet50 part and a set of different ranks for the Conv3D layer and get the best results with a rank of 0.1. We can see an average of 1.7x speedup across all the batch sizes. This is mainly due to the reduction in arithmetic intensity as a result of tensorization. The model size is reduced to 51.8MB, OBO decreases to 0.30 while MAE increases to 0.21. The results are summarised in the following table:

Input Dimensions	Seconds Per Iter	Iter Per Second
1,3,64,112,112	1.51	0.66
8,3,64,112,112	12.9	0.08
16,3,64,112,112	32.4	0.03

Pruning¶

Pruning is a technique aimed at reducing the size of a deep neural network by eliminating non-essential parameters, or weights. Structured Pruning involves removing specific structures like neurons in a fully connected layer, channels or filters in a convolutional layer, or self-attention heads in transformers. Conversely, Unstructured Pruning targets and removes individual weights based on criteria such as magnitude or their contribution to the overall loss function during training, without adhering to a specific structural pattern.

Given the intended scope of our solution, where we want to deploy on as wide a range of devices as possible, we focus on structured pruning as it is a more hardware-agnostic technique. Because unstructured pruning introduces sparsity in the model, it is generally not suitable for all hardware devices, for example, GPUs are not well-suited for processing sparse matrices, and using unstructured pruning on devices with embedded GPU won’t yield significant performance improvements. Conversely, structural pruning involves removing entire neurons, layers, or other structured components from the neural network to reduce size and complexity and is thus a viable solution regardless of the device. For our experiments, we use the Torch-Pruner library which provides off-the-shelf utilities for structured pruning.

In the following table, benchmark results demonstrate an average 2.19x latency reduction achieved by the pruned model compared to the baseline. Notably, we reduced the model size to approximately 58% of the original model size by pruning 35% of the ResNet encoder layer and 64% of the Temporal Convolution layer. There is a minimal drop in accuracy with MAE at 0.22 and OBO at 0.34.

Input Dimensions	Seconds Per Iter	Iter Per Second
1,3,64,112,112	1.25	0.79
8,3,64,112,112	10.8	0.08
16,3,64,112,112	22.2	0.05

Quantization¶

Quantization refers to the process of converting the weights of a neural network from a high-precision bidwidth (e.g. float32) to a lower precision bidwidth (e.g. float16 or int8). We use Pytorch’s built-in Quantization API for all of our quantization experiments.

To determine which parts of the model to target for quantization, we first test-quantize the whole model and measure the Signal-To-Quantization-Noise-Ratio (SQNR) of each layer of the quantized model to quantify the sensitivity of the weights and activations to quantization. SQNR measures the amount of signal power (or information in this context) lost due to quantization, relative to the amount of noise introduced by quantization. A higher SQNR indicates that the layer is less affected by quantization noise, and therefore should be targeted for quantization to minimise accuracy loss. Running this sensitivity reveals that, consistent with the inference sensitivity we conducted above, both the temporal and the encoder layers are better fit for quantization.

We run the final quantization experiments using post-training static quantization which involves calibrating the quantization parameters ahead of time and allows us to compress the model more than we would using dynamic quantization which defines the quantization values during inference. With post-training static quantization, we reduce the model size by 12% and improve model inference by 1.2x speedup compared to the baseline model. Accuracy drop is minimal with OBO at 0.38 and MAE at 0.31. The results are summarised in the following table:

Input Dimensions	Seconds Per Iteration	Iterations Per Second
1,3,64,112,112	2.64	0.380
8,3,64,112,112	16.3	0.060
16,3,64,112,112	46.9	0.022

Combining Compression Techniques¶

Having studied the impacts on the performance of each group of compression techniques, we move into combining them to achieve the best overall performance improvement.

We begin with pruning the model since the use of structured pruning allows us to contain the impact on the model architecture to the removal of layers and blocks of neurons, and doesn’t introduce new components that may be incompatible with the other techniques.

We proceed to apply tensorization on the pruned model, leaving quantization at the end since quantizing the model replaces all the target layers with quantized versions that are not supported by the tensorization operations we use. Combining tensorization with pruning only minimally impacts accuracy but already allows us to achieve almost double the speed-up obtained from pruning only. Interestingly, we also separately tried tensorizing the model first before pruning it and this resulted in a significant dip in accuracy, which highlights the importance of conducting thorough tests when designing optimization pipelines.

Finally, applying SQNR-guided post-training static quantization to the pruned and tensorized model allows us to achieve an overall speed-up of 3.5x compared to the baseline model and a 70.5% decrease in model size. Impact on accuracy compounds but remains within acceptable boundaries with OBO at 0.40 and MAE at 0.21. This lost accuracy can be recovered by subsequent fine-tuning.

Input Dimensions	Seconds Per Iteration	Iterations Per Second
1,3,64,112,112	0.85	1.17
8,3,64,112,112	6.3	0.142
16,3,64,112,112	13.5	0.073

Conclusion¶

Through this blog post, we have outlined the overall process for compressing a model and optimising it for edge deployment by using various techniques and libraries. The following table summarises the results achieved with all the techniques:

Setup	Model Size (MB)	Speedup	OBO	MAE
Baseline	97.97	-	0.32	0.17
Tensorized	51.80	1.7x	0.30	0.21
Pruned	57.56	2.19x	0.34	0.22
Quantized	86.14	1.2x	0.38	0.31
Pruning + Tensorization + Quantization	28.92	3.5x	0.40	0.21

Of course, the proposed solution isn’t necessarily the best depending on the use case, and selecting the appropriate hyperparameters/compression modes often boils down to running several tests to find the combination that strikes the best balance in terms of performance vs accuracy, taking into account one’s personal objectives that may involve putting more emphasis on either metrics.

The libraries and tools we used are also not the only ones offering the utilities we needed, and we encourage you to explore using other tools as well! A good place to get started would be our blog post series on model compression where we outline the various tools available and what they uniquely offer, feel free to give it a read to learn more on the topic!

We hope this was informational and that it gave you some insights regarding model optimization, see you in the next one!

Acknowledgements¶

This case study was conducted at Unify with the collaborative efforts of Muhammad Elaseyad, Nassim Berrada, and Guillermo Brizeula. I extend my gratitude to their valuable contributions and insights throughout the project.