<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>anwaar-khalid - Quantization</title><link href="https://hello-fri-end.github.io/" rel="alternate"/><link href="https://hello-fri-end.github.io/feeds/quantization.atom.xml" rel="self"/><id>https://hello-fri-end.github.io/</id><updated>2026-06-18T00:00:00+05:30</updated><entry><title>Integer Quantization: Deep Dive 🤿</title><link href="https://hello-fri-end.github.io/2026/06/integer-quantization-deep-dive/" rel="alternate"/><published>2026-06-18T00:00:00+05:30</published><updated>2026-06-18T00:00:00+05:30</updated><author><name>Anwaar Khalid</name></author><id>tag:hello-fri-end.github.io,2026-06-18:/2026/06/integer-quantization-deep-dive/</id><summary type="html">&lt;p&gt;A lot has happened in transformer quantization over the past few years, from barely being able to quantize a 7B model in INT8 without destroying accuracy, to routinely fitting a 70B model in 4-bits on a single GPU. But existing guides on the topic are fragmented: either focused on a …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A lot has happened in transformer quantization over the past few years, from barely being able to quantize a 7B model in INT8 without destroying accuracy, to routinely fitting a 70B model in 4-bits on a single GPU. But existing guides on the topic are fragmented: either focused on a specific technique or on how to use a library. I&amp;rsquo;ve been working on integer quantization for fixed-point hardware for a while now and my goal with this series is to bridge that gap: building the core ideas carefully and tracing how the field has evolved, each technique motivated by the problems of what came before. This first post covers the foundations: what quantization is, why it&amp;rsquo;s hard, and the math behind it.&lt;/p&gt;
&lt;h2 id="what-is-quantization-why-should-you-care"&gt;&lt;strong&gt;What is Quantization &amp;amp; why should you care?&lt;/strong&gt;&lt;a class="headerlink" href="#what-is-quantization-why-should-you-care" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Quantization is the process of representing high-precision values using fewer bits. In practice, this means storing weights and (optionally) activations in lower precision (e.g., int8 instead of fp16), introducing a small approximation error.&lt;/p&gt;
&lt;p&gt;The most immediate and easy-to-realize benefit of quantization is memory reduction. As a rule of thumb, a model with N billion parameters requires roughly &lt;strong&gt;2 × N GB&lt;/strong&gt; of memory when stored in 16-bit precision. Quantizing to 8-bit or 4-bit reduces this footprint by 2× and 4×, respectively.&lt;/p&gt;
&lt;p&gt;There is also a hardware advantage. In 2014, Mark Horowitz, from Stanford University published a paper &lt;a href="https://gwern.net/doc/cs/hardware/2014-horowitz-2.pdf"&gt;Computing&amp;rsquo;s Energy Problem&lt;/a&gt; which studied fp operations vs integer operations:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Energy Consumption: Integer vs Floating Point" src="/assets/images/Horowitz.png"&gt;
 &lt;em&gt;Energy Costs for various operations on a 45nm CMOS node. Source: &lt;a href="https://gwern.net/doc/cs/hardware/2014-horowitz-2.pdf"&gt;Computing&amp;rsquo;s Energy Problem&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;So, integer arithmetic consumes &lt;strong&gt;lesser energy&lt;/strong&gt;, specifically int8 add consumes 30x less energy than fp32 add &amp;amp; int8 mul consumes 18x less energy than fp32 mul. Lower precision hardware is also &lt;strong&gt;faster&lt;/strong&gt; &amp;amp; &lt;strong&gt;consumes lesser silicon area&lt;/strong&gt; than floating point.&lt;/p&gt;
&lt;p&gt;How do these benefits translate to real-world gains? It depends on the bottleneck:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compute-bound workloads (e.g., CNNs, LLM prefill)&lt;/strong&gt;:
Quantization can improve throughput since lower-precision arithmetic is faster and consumes lesser energy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Memory-bandwidth-bound workloads (e.g., LLM decoding)&lt;/strong&gt;:
Quantization reduces the amount of data moved, improving performance by lowering memory bandwidth pressure.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By this point, the motivation should be clear: quantization reduces memory, lowers energy consumption, and can improve performance. Next, we will look at the hardware unit that executes fixed point arthmetic. &lt;/p&gt;
&lt;h2 id="multiply-accumulate-unit"&gt;&lt;strong&gt;Multiply Accumulate Unit&lt;/strong&gt;&lt;a class="headerlink" href="#multiply-accumulate-unit" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The dominant operation in neural networks is &lt;strong&gt;matrix multiplication&lt;/strong&gt;. Modern hardware accelerators optimize this using specialized units called &lt;strong&gt;Multiply–Accumulate (MAC) Units&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Multiply Accumulate Unit" src="/assets/images/MAC.png"&gt;&lt;br&gt;
&lt;em&gt;Matrix–vector computation in neural network accelerator hardware. Source: &lt;a href="https://arxiv.org/pdf/2106.08295"&gt;A White Paper on Neural Network Quantization&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The diagram represents a typical &lt;strong&gt;matrix–vector multiply&lt;/strong&gt; unit in neural network accelerators. This is the building block for matrix multiplications and convolutions. The two fundamental components are the &lt;em&gt;processing elements&lt;/em&gt; &lt;span class="math"&gt;\(C_{n,m}\)&lt;/span&gt; and the &lt;em&gt;accumulators&lt;/em&gt; &lt;span class="math"&gt;\(A_n\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;The computation proceeds as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The accumulators are first initialized with the bias value &lt;span class="math"&gt;\(b_n\)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In the next cycle, weights &lt;span class="math"&gt;\(W_{n,m}\)&lt;/span&gt; and input values &lt;span class="math"&gt;\(x_m\)&lt;/span&gt; are loaded&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Their product is computed at each processing element:&lt;br&gt;
&lt;div class="math"&gt;$$C_{n,m} = W_{n,m} \cdot x_m$$&lt;/div&gt;
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The results are then accumulated:&lt;br&gt;
&lt;div class="math"&gt;$$A_n = b_n + \sum_{m} C_{n,m}$$&lt;/div&gt;
&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="how-is-quantization-done"&gt;&lt;strong&gt;How is quantization done?&lt;/strong&gt;&lt;a class="headerlink" href="#how-is-quantization-done" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Starting from a real-valued vector &lt;span class="math"&gt;\(x\)&lt;/span&gt;, we map it to an integer grid &lt;span class="math"&gt;\(\{x_{\text{int}}^{\min}, \ldots, x_{\text{int}}^{\max}\}\)&lt;/span&gt;:&lt;/p&gt;
&lt;div class="math"&gt;$$
x_{\text{int}} =
\mathrm{clamp}
\left(
\left\lfloor \frac{x}{s} \right\rceil + z,\;
x_{\text{int}}^{\min},\;
x_{\text{int}}^{\max}
\right)
$$&lt;/div&gt;
&lt;p&gt;Here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;span class="math"&gt;\(s\)&lt;/span&gt; is the &lt;strong&gt;scale&lt;/strong&gt; &lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span class="math"&gt;\(z\)&lt;/span&gt; is the &lt;strong&gt;zero-point (offset)&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span class="math"&gt;\(\lfloor \cdot \rceil\)&lt;/span&gt; denotes &lt;strong&gt;rounding to the nearest integer&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;strong&gt;clamp&lt;/strong&gt; operation ensures the result lies within the valid integer range:&lt;/p&gt;
&lt;p&gt;So, the idea is to scale and shift the floating-point value, then clamp it to fit within the integer grid.&lt;/p&gt;
&lt;h2 id="quantization-simulation-fake-quantization"&gt;&lt;strong&gt;Quantization Simulation (Fake Quantization)&lt;/strong&gt;&lt;a class="headerlink" href="#quantization-simulation-fake-quantization" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Instead of running quantized models directly on target hardware, we often &lt;strong&gt;simulate quantization&lt;/strong&gt; on general-purpose hardware using high-level frameworks like PyTorch. This is commonly referred to as &lt;strong&gt;fake quantization&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The key idea is simple: we mimic the effects of quantization while still executing operations in floating point. This allows us to study accuracy and perform experiments  like Quantization Aware Training (QAT) without requiring specialized hardware.&lt;/p&gt;
&lt;p&gt;To do this, we:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Quantize&lt;/strong&gt; the input to an integer grid  &lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dequantize&lt;/strong&gt; it back to floating point  &lt;/li&gt;
&lt;li&gt;Perform all computations in floating point on standard hardware (e.g., GPUs)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The dequantization step maps integers back to real values:&lt;/p&gt;
&lt;div class="math"&gt;$$
\widehat{\mathbf{x}} = s \left( \mathbf{x}_{\text{int}} - z \right)
$$&lt;/div&gt;
&lt;p&gt;Combining quantization and dequantization, we get:&lt;/p&gt;
&lt;div class="math"&gt;$$
\widehat{\mathbf{x}} = q(\mathbf{x}; s, z) =
s \left[
\mathrm{clamp}
\left(
\left\lfloor \frac{\mathbf{x}}{s} \right\rceil + z,\;
x_{\text{int}}^{\min},\;
x_{\text{int}}^{\max}
\right)
- z
\right]
$$&lt;/div&gt;
&lt;p&gt;In practice, frameworks insert these &lt;strong&gt;quant–dequant (Q/DQ) pairs&lt;/strong&gt; around operations in the model graph. While the computation is still carried out in floating point, the values are constrained to a &lt;strong&gt;discrete set&lt;/strong&gt;, effectively simulating quantization effects during inference.&lt;/p&gt;
&lt;h3 id="what-are-q_min-q_max"&gt;&lt;strong&gt;What are &lt;span class="math"&gt;\(q_{\min}\)&lt;/span&gt; &amp;amp; &lt;span class="math"&gt;\(q_{\max}\)&lt;/span&gt; ?&lt;/strong&gt;&lt;a class="headerlink" href="#what-are-q_min-q_max" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;At this point, it’ll be helpful to build some intuition around what the quantization formula is actually doing.&lt;/p&gt;
&lt;p&gt;Think about the &lt;strong&gt;minimum value&lt;/strong&gt; that can come out of the quantization operation.&lt;br&gt;
This corresponds to &lt;span class="math"&gt;\(x_{\text{int}}^{\min}\)&lt;/span&gt;. Plugging this into the de-quantization formula:&lt;/p&gt;
&lt;div class="math"&gt;$$
q_{\min} = s \left( x_{\text{int}}^{\min} - z \right)
$$&lt;/div&gt;
&lt;p&gt;Similarly, the &lt;strong&gt;maximum value&lt;/strong&gt; corresponds to &lt;span class="math"&gt;\(x_{\text{int}}^{\max}\)&lt;/span&gt;:&lt;/p&gt;
&lt;div class="math"&gt;$$
q_{\max} = s \left( x_{\text{int}}^{\max} - z \right)
$$&lt;/div&gt;
&lt;p&gt;&lt;img alt="Fake Quantized Grid" src="/assets/images/fake_quantized_grid.png"&gt;  &lt;/p&gt;
&lt;p&gt;Key thing to note here is: we no longer operate over a continuous floating-point range, but over a &lt;strong&gt;discrete set of &lt;span class="math"&gt;\(2^b\)&lt;/span&gt; values&lt;/strong&gt;, each separated by the scale &lt;span class="math"&gt;\(s\)&lt;/span&gt;.&lt;/p&gt;
&lt;h2 id="quantization-error"&gt;&lt;strong&gt;Quantization Error&lt;/strong&gt;&lt;a class="headerlink" href="#quantization-error" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;There are two “bad guys” in our quantization formula:&lt;br&gt;
(1) the &lt;strong&gt;rounding operator&lt;/strong&gt;, which introduces rounding error, and&lt;br&gt;
(2) the &lt;strong&gt;clamp operator&lt;/strong&gt;, which introduces clipping error.  &lt;/p&gt;
&lt;p&gt;These are the reasons why a quantized network deviates from the original network, a.k.a: &lt;strong&gt;quantization error&lt;/strong&gt; (or quantization noise). As we’ll see, improving one often worsens the other, so quantization is really about balancing the two.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Rounding Error&lt;/strong&gt;: for a value is the difference between the original floating-point value and the value it gets mapped to on the quantized grid.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clipping Error&lt;/strong&gt;: occurs when values fall outside the representable range and get clipped to the minimum or maximum value.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s think about the extremes again. When would the rounding error be the maximum? 
The worst case occurs when the fp value lies exactly halfway between two grid points. In this case:&lt;/p&gt;
&lt;div class="math"&gt;$$
\text{max rounding error} = \frac{s}{2}
$$&lt;/div&gt;
&lt;p&gt;So, you might think: “why not make &lt;span class="math"&gt;\(s\)&lt;/span&gt; very small to minimize this for each value?”. Unfortunately, there&amp;rsquo;s no free lunch when it comes to quantization: reducing s makes [&lt;span class="math"&gt;\(q_{\min} = s \left( x_{\text{int}}^{\min} - z \right)\)&lt;/span&gt;, &lt;span class="math"&gt;\(q_{\max} = s \left( x_{\text{int}}^{\max} - z \right)\)&lt;/span&gt;] smaller increasing your clipping error.  &lt;/p&gt;
&lt;p&gt;So how do you choose quantization parameters optimally?&lt;/p&gt;
&lt;h3 id="how-are-quantization-params-scale-and-offset-calculated"&gt;&lt;strong&gt;How are quantization params ( scale and offset) calculated?&lt;/strong&gt;&lt;a class="headerlink" href="#how-are-quantization-params-scale-and-offset-calculated" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The simplest approach is min–max quantization, where we use an &lt;strong&gt;unsigned integer grid &lt;span class="math"&gt;\((0 \text{ to } 2^b - 1)\)&lt;/span&gt;&lt;/strong&gt; and set the scale so that the entire floating-point range fits within it, avoiding clipping.
That is, you want:&lt;/p&gt;
&lt;p&gt;&lt;span class="math"&gt;\(q_{\max}\)&lt;/span&gt; to map to &lt;span class="math"&gt;\(fp_{\max}\)&lt;/span&gt; &amp;amp;  &lt;span class="math"&gt;\(q_{\min}\)&lt;/span&gt; to map to &lt;span class="math"&gt;\(fp_{\min}\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;So you solve:&lt;/p&gt;
&lt;div class="math"&gt;$$
q_{\max} = s \left(2^b - 1 - z\right) = fp_{\max}
$$&lt;/div&gt;
&lt;div class="math"&gt;$$
q_{\min} = -s z = fp_{\min}
$$&lt;/div&gt;
&lt;p&gt;Solving these gives:&lt;/p&gt;
&lt;div class="math"&gt;$$
s = \frac{fp_{\max} - fp_{\min}}{2^b - 1}
$$&lt;/div&gt;
&lt;div class="math"&gt;$$
z = \frac{fp_{\min} \cdot (2^b - 1)}{fp_{\min} - fp_{\max}}
$$&lt;/div&gt;
&lt;p&gt;For a &lt;strong&gt;signed grid &lt;span class="math"&gt;\((-2^{b-1} \text{ to } 2^{b-1} - 1)\)&lt;/span&gt;&lt;/strong&gt;, the corresponding method is abs-max quantization, where we scale using the maximum absolute value so that the range is symmetric around zero, avoiding clipping.&lt;/p&gt;
&lt;p&gt;So you solve:&lt;/p&gt;
&lt;div class="math"&gt;$$
q_{\max} = s \cdot (2^{b-1} - 1) = \max\left( |fp_{\min}|,\; |fp_{\max}| \right)
$$&lt;/div&gt;
&lt;p&gt;Solving gives:&lt;/p&gt;
&lt;div class="math"&gt;$$
s = \frac{\max\left( |fp_{\min}|,\; |fp_{\max}| \right)}{2^{b-1} - 1}
$$&lt;/div&gt;
&lt;div class="math"&gt;$$
z = 0
$$&lt;/div&gt;
&lt;p&gt;These are the most common formulas for &lt;em&gt;scale&lt;/em&gt; and &lt;em&gt;offset&lt;/em&gt; on the internet &amp;ndash; now you known where they come from!&lt;/p&gt;
&lt;p&gt;&lt;img alt="AbsMax Quantization" src="/assets/images/absmax.png"&gt;
&lt;em&gt;An example of AbsMax Quantization. Author &lt;a href="https://www.maartengrootendorst.com/blog/quantization/"&gt;Maarten Grootendorst&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Notice how some of values in integer grid get wasted with AbsMax Quantization.&lt;/p&gt;
&lt;p&gt;But what if your tensor contains &lt;strong&gt;outliers&lt;/strong&gt; - i.e., a very small fraction of values have large magnitudes while most are relatively small?&lt;/p&gt;
&lt;p&gt;If you use min–max or abs-max quantization, these outliers stretch the range of the grid. As a result, most values get mapped to a coarser grid, leading to &lt;strong&gt;higher rounding error&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Fortunately, there are alternatives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Loss-aware quantization&lt;/strong&gt;:&lt;br&gt;
Choose the quantization parameters (e.g., range, scale) to minimize a loss between the floating-point and quantized outputs - such as MSE, cross-entropy, or SQNR.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Range clipping&lt;/strong&gt;:&lt;br&gt;
Instead of covering the full range, clip extreme values (e.g., using percentiles) to reduce the impact of outliers and improve effective resolution.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even these do not guarentee that quantization will work well but thankfully we have a few more axes to play with. And that&amp;rsquo;s exactly what we&amp;rsquo;ll explore next.&lt;/p&gt;
&lt;h2 id="quantization-categories"&gt;&lt;strong&gt;Quantization Categories&lt;/strong&gt;&lt;a class="headerlink" href="#quantization-categories" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Quantization can be divided based on :&lt;/p&gt;
&lt;h3 id="quantization-mapping-affine-vs-symmetric"&gt;&lt;strong&gt;Quantization Mapping: Affine vs Symmetric&lt;/strong&gt;&lt;a class="headerlink" href="#quantization-mapping-affine-vs-symmetric" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Affine (asymmetric) quantization&lt;/strong&gt; uses a non-zero zero-point &lt;span class="math"&gt;\(z\)&lt;/span&gt;, allowing the quantized range to better align with the floating-point distribution. It is commonly used with unsigned integer grids and provides flexibility by shifting where real zero is represented.&lt;/p&gt;
&lt;p&gt;In contrast, &lt;strong&gt;symmetric quantization&lt;/strong&gt; enforces &lt;span class="math"&gt;\(z = 0\)&lt;/span&gt;, using a range centered around zero. It is typically paired with signed integer grids and works well when the data distribution is approximately zero-centered.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Asymmetric vs Symmetric Quantization" src="/assets/images/symmetric_vs_asymmetric.ppm"&gt;
&lt;em&gt;Symmetric vs Asymmertic Quantization. Source: &lt;a href="https://www.researchgate.net/figure/Examples-of-uniform-quantization-methods-8-bit-a-symmetric-and-b-asymmetric_fig2_387078361"&gt;Research Gate&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Let’s see how this choice affects computation.&lt;/p&gt;
&lt;p&gt;Recall:
&lt;/p&gt;
&lt;div class="math"&gt;$$
A_n = b_n + \sum_m (W_{n,m} \cdot x_m)
$$&lt;/div&gt;
&lt;p&gt;When &lt;span class="math"&gt;\(W_{n,m}\)&lt;/span&gt; and &lt;span class="math"&gt;\(x_m\)&lt;/span&gt; are quantized:&lt;/p&gt;
&lt;div class="math"&gt;$$
W = s_w (W_{\text{int}} - z_w),   \qquad  x = s_x (x_{\text{int}} - z_x)
$$&lt;/div&gt;
&lt;p&gt;Substituing, we get:&lt;/p&gt;
&lt;p&gt;&lt;img alt="MAC with quantization" src="/assets/images/affine_vs_symmetric.png"&gt;
&lt;em&gt;Source: &lt;a href="https://arxiv.org/pdf/2106.08295"&gt;A White Paper On Neural Network Quantization&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The first term matches the symmetric case.&lt;br&gt;
The third and fourth terms depend only on weights, scales, and offsets, so they can be precomputed and folded into the bias at negligible cost.  &lt;/p&gt;
&lt;p&gt;The second term, however, depends on the input &lt;span class="math"&gt;\(x\)&lt;/span&gt;, meaning it must be computed at inference time, adding latency and power overhead.  &lt;/p&gt;
&lt;p&gt;Hence, a common strategy is &lt;strong&gt;asymmetric activations + symmetric weights&lt;/strong&gt;, which avoids this data-dependent cost.&lt;/p&gt;
&lt;h3 id="how-quantization-is-applied-quantization-granularity"&gt;&lt;strong&gt;How Quantization Is Applied (Quantization Granularity)&lt;/strong&gt;&lt;a class="headerlink" href="#how-quantization-is-applied-quantization-granularity" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h3&gt;
&lt;h4 id="per-tensor"&gt;&lt;strong&gt;Per Tensor&lt;/strong&gt;&lt;a class="headerlink" href="#per-tensor" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;One entire tensor shares a single scale (&lt;span class="math"&gt;\(s\)&lt;/span&gt;) and zero-point (&lt;span class="math"&gt;\(z\)&lt;/span&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Effective bits per value:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Assuming the scale and zero-point are stored in 16 bits each, the effective number of bits per value is:&lt;/p&gt;
&lt;div class="math"&gt;$$
\text{effective bits} = b + \frac{16 + 16}{N}
$$&lt;/div&gt;
&lt;p&gt;where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="math"&gt;\(b\)&lt;/span&gt; = quantized bit-width (e.g., 8)&lt;/li&gt;
&lt;li&gt;&lt;span class="math"&gt;\(N\)&lt;/span&gt; = total number of elements in the tensor&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For &lt;span class="math"&gt;\(N = 1024\)&lt;/span&gt; and &lt;span class="math"&gt;\(b = 8\)&lt;/span&gt;:&lt;/p&gt;
&lt;div class="math"&gt;$$
\text{effective bits}
= 8 + \frac{32}{1024}
= 8.03125
$$&lt;/div&gt;
&lt;p&gt;So, 8-bit quantization requires slightly more than 8 bits per value once the scale and zero-point are included.&lt;/p&gt;
&lt;h4 id="per-channel"&gt;&lt;strong&gt;Per Channel&lt;/strong&gt;&lt;a class="headerlink" href="#per-channel" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;Each channel has its own scale (&lt;span class="math"&gt;\(s\)&lt;/span&gt;) and zero-point (&lt;span class="math"&gt;\(z\)&lt;/span&gt;).&lt;/p&gt;
&lt;p&gt;Assume a matrix of shape &lt;span class="math"&gt;\((C, K)\)&lt;/span&gt;, where each row (or column) is a channel.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Effective bits per value:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="math"&gt;$$
\text{effective bits}
= b + \frac{C \cdot (16 + 16)}{N}
$$&lt;/div&gt;
&lt;p&gt;where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="math"&gt;\(C\)&lt;/span&gt; = number of channels&lt;/li&gt;
&lt;li&gt;&lt;span class="math"&gt;\(N = C \cdot K\)&lt;/span&gt; = total number of elements&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For a matrix with &lt;span class="math"&gt;\(C = 64\)&lt;/span&gt; channels and &lt;span class="math"&gt;\(K = 16\)&lt;/span&gt; values per channel:&lt;/p&gt;
&lt;div class="math"&gt;$$
N = 64 \cdot 16 = 1024
$$&lt;/div&gt;
&lt;div class="math"&gt;$$
\text{effective bits}
= 8 + \frac{64 \cdot 32}{1024}
= 10
$$&lt;/div&gt;
&lt;h4 id="per-block"&gt;&lt;strong&gt;Per Block&lt;/strong&gt;&lt;a class="headerlink" href="#per-block" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;A block of values shares a single scale (&lt;span class="math"&gt;\(s\)&lt;/span&gt;) and zero-point (&lt;span class="math"&gt;\(z\)&lt;/span&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Effective bits per value:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="math"&gt;$$
\text{effective bits}
= b + \frac{16 + 16}{B}
$$&lt;/div&gt;
&lt;p&gt;where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="math"&gt;\(b\)&lt;/span&gt; = quantized bit-width (e.g., 8)&lt;/li&gt;
&lt;li&gt;&lt;span class="math"&gt;\(B\)&lt;/span&gt; = block size&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For a block size of &lt;span class="math"&gt;\(B = 64\)&lt;/span&gt; and &lt;span class="math"&gt;\(b = 8\)&lt;/span&gt;:&lt;/p&gt;
&lt;div class="math"&gt;$$
\text{effective bits}
= 8 + \frac{32}{64}
= 8.5
$$&lt;/div&gt;
&lt;p&gt;In general, finer-grained schemes (more parameters) yield better accuracy but introduce additional storage and compute overhead. Per-channel quantization generally offers a good tradeoff for weights, while per-block quantization is applied to particularly sensitive weights. However, per-channel quantization for activations is hardware-inefficient and the reason becomes clear once you expand the MAC computation.&lt;/p&gt;
&lt;p&gt;With per-channel quantization, weights and activations each have their own per-channel scale vectors:&lt;/p&gt;
&lt;div class="math"&gt;$$
\mathbf{s}_w = [s_{w_1}, s_{w_2}, \ldots, s_{w_n}], \qquad \mathbf{s}_x = [s_{x_1}, s_{x_2}, \ldots, s_{x_m}]
$$&lt;/div&gt;
&lt;p&gt;Substituting the dequantized values into the accumulator equation:&lt;/p&gt;
&lt;div class="math"&gt;$$
A_n = b_n + \sum_{m} \left( W_{n,m} \cdot s_{w_n} \right) \cdot \left( x_m \cdot s_{x_m} \right)
$$&lt;/div&gt;
&lt;div class="math"&gt;$$
A_n = b_n + s_{w_n} \sum_{m} W_{n,m} \cdot x_m \cdot s_{x_m}
$$&lt;/div&gt;
&lt;p&gt;Notice that &lt;span class="math"&gt;\(s_{w_n}\)&lt;/span&gt; does not depend on &lt;span class="math"&gt;\(m\)&lt;/span&gt;, so it can be factored out of the summation and applied as a single scalar multiply after accumulation. This is cheap and hardware-friendly.
&lt;span class="math"&gt;\(s_{x_m}\)&lt;/span&gt;, however, depends on &lt;span class="math"&gt;\(m\)&lt;/span&gt; which means it cannot be factored out. Instead, each product &lt;span class="math"&gt;\(W_{n,m} \cdot x_m\)&lt;/span&gt; must be rescaled by a different &lt;span class="math"&gt;\(s_{x_m}\)&lt;/span&gt; before accumulation. This is why &lt;strong&gt;per-channel quantization is standard for weights but avoided for activations&lt;/strong&gt;.&lt;/p&gt;
&lt;h4 id="based-on-how-activations-are-quantized"&gt;&lt;strong&gt;Based on how activations are quantized:&lt;/strong&gt;&lt;a class="headerlink" href="#based-on-how-activations-are-quantized" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;Since weights are always known before inference they can be quantized offline, however, activations depend on the input to the model, and there are two ways of quantizing them.&lt;/p&gt;
&lt;p&gt;In &lt;strong&gt;static quantization&lt;/strong&gt;, the min and max ranges of activations are calculated using a small subset of the training data called calibration dataset which typically contains around 300-500 samples.&lt;/p&gt;
&lt;p&gt;In &lt;strong&gt;dynamic quantization,&lt;/strong&gt; the min-max ranges are calculated on the fly during inference. This is useful in situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. &lt;/p&gt;
&lt;p&gt;For integer quantization, static quantization is most common.&lt;/p&gt;
&lt;h4 id="based-on-when-it-is-applied-during-training-or-inference"&gt;&lt;strong&gt;Based on when it is applied: During Training or Inference?&lt;/strong&gt;&lt;a class="headerlink" href="#based-on-when-it-is-applied-during-training-or-inference" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Quantization aware training (QAT)&lt;/strong&gt; simlutes inference-time quantization duering training by using quantized weights and activations in the forward pass, while using full precision weights in the backward pass. The idea is to emulate the inference time quantization error by using quantized values in the forward pass, and allowing the backward pass to update the full-precision weights such that they become more robust to quantization. This approach leads to better accuracy preservation as the model learns to adapt to the precision loss during training. &lt;/p&gt;
&lt;p&gt;&lt;img alt="An illustration of forward and backward pass for QAT" src="/assets/images/qat.webp"&gt;&lt;br&gt;
&lt;em&gt; Quantization Aware Training. Source: &lt;a href="https://www.researchgate.net/publication/351925867_Quantization_and_Deployment_of_Deep_Neural_Networks_on_Microcontrollers"&gt;Quantization and Deployment of Deep Neural Networks on Microcontrollers&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Unlike quantization aware training, where the model learns to adjust to lower precision during training, &lt;strong&gt;post-training quantization (PTQ)&lt;/strong&gt; directly applies this reduction in bit width after the model has been trained. While it may not offer the same level of accuracy preservation as quantization aware training, it is widely used in practice since it doesn’t require retraining the model.&lt;/p&gt;
&lt;p&gt;In this series, we will primarily focus on post-training quantization (PTQ) techniques.&lt;/p&gt;
&lt;h2 id="enough-simulation-how-do-integer-models-actually-run-on-the-mac"&gt;&lt;strong&gt;Enough simulation — how do integer models actually run on the MAC?&lt;/strong&gt;&lt;a class="headerlink" href="#enough-simulation-how-do-integer-models-actually-run-on-the-mac" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Meme" src="/assets/images/Matrix-gif.webp"&gt;&lt;/p&gt;
&lt;p&gt;The weights and activations are loaded in low precision (e.g., int8 × int8, or int8 × int16 on some hardware). Since inputs are quantized, we move less data. Each cycle performs the multiply (e.g., 8×8 or 8×16), and the results are accumulated in higher precision, typically &lt;strong&gt;int32&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;So does the next layer receive 32-bit values? Not quite. We &lt;strong&gt;requantize&lt;/strong&gt; the accumulator back to low precision before passing it on.&lt;/p&gt;
&lt;p&gt;&lt;img alt="MAC with dequantize op" src="/assets/images/MAC_quant.png"&gt;&lt;br&gt;
&lt;em&gt;Source: &lt;a href="https://arxiv.org/pdf/2106.08295"&gt;A White Paper On Neural Network Quantization&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h4 id="requantization"&gt;&lt;strong&gt;Requantization&lt;/strong&gt;&lt;a class="headerlink" href="#requantization" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;Recall that on the MAC, the accumulator computes:&lt;/p&gt;
&lt;div class="math"&gt;$$
A_n = b_n + \sum_{m} C_{n,m} = b_n + \sum_{m} W_{n,m} \cdot x_m
$$&lt;/div&gt;
&lt;p&gt;When weights and activations are quantized, the real values are recovered via dequantization:&lt;/p&gt;
&lt;div class="math"&gt;$$
W_{n,m} = s_w \left( W_{\text{int}_{n,m}} - z_w \right), \qquad x_m = s_x \left( x_{\text{int}_m} - z_x \right)
$$&lt;/div&gt;
&lt;p&gt;Substituting into the accumulator:&lt;/p&gt;
&lt;div class="math"&gt;$$
A_n = b_n + \sum_{m} s_w \left( W_{\text{int}_{n,m}} - z_w \right) \cdot s_x \left( x_{\text{int}_m} - z_x \right)
$$&lt;/div&gt;
&lt;div class="math"&gt;$$
A_n = b_n + s_w s_x \sum_{m} \left( W_{\text{int}_{n,m}} - z_w \right)\left( x_{\text{int}_m} - z_x \right)
$$&lt;/div&gt;
&lt;p&gt;The hardware computes the inner sum entirely in integer arithmetic, accumulating into a high-precision int32 register:&lt;/p&gt;
&lt;div class="math"&gt;$$
A_{\text{int}_n} = \sum_{m} \left( W_{\text{int}_{n,m}} - z_w \right)\left( x_{\text{int}_m} - z_x \right)
$$&lt;/div&gt;
&lt;p&gt;So the real-valued accumulator output is simply:&lt;/p&gt;
&lt;div class="math"&gt;$$
A_n \approx s_w s_x \cdot A_{\text{int}_n}
$$&lt;/div&gt;
&lt;p&gt;Now, this output needs to be passed to the next layer which expects quantized inputs with its own scale &lt;span class="math"&gt;\(s_y\)&lt;/span&gt; and zero-point &lt;span class="math"&gt;\(z_y\)&lt;/span&gt;. So we need to quantize &lt;span class="math"&gt;\(A_n\)&lt;/span&gt;:&lt;/p&gt;
&lt;div class="math"&gt;$$
y_{\text{int}_n} = \left\lfloor \frac{A_n}{s_y} \right\rceil + z_y = \left\lfloor \frac{s_w s_x}{s_y} \cdot A_{\text{int}_n} \right\rceil + z_y
$$&lt;/div&gt;
&lt;p&gt;Defining the &lt;strong&gt;requantization scale&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="math"&gt;$$
M = \frac{s_w s_x}{s_y}
$$&lt;/div&gt;
&lt;p&gt;The full requantization step becomes:&lt;/p&gt;
&lt;div class="math"&gt;$$
y_{\text{int}_n} = \mathrm{clamp}\!\left( \left\lfloor M \cdot A_{\text{int}_n} \right\rceil + z_y,\ y_{\text{int}}^{\min},\ y_{\text{int}}^{\max} \right)
$$&lt;/div&gt;
&lt;p&gt;This is the only floating-point operation in the pipeline - multiplying the int32 accumulator by &lt;span class="math"&gt;\(M\)&lt;/span&gt; to rescale it into the next layer&amp;rsquo;s quantization range.&lt;/p&gt;
&lt;p&gt;In practice, &lt;span class="math"&gt;\(M\)&lt;/span&gt; is implemented using fixed-point arithmetic (an integer multiply followed by a bit shift), keeping the entire pipeline in integer arithmetic - see &lt;a href="https://arxiv.org/abs/1712.05877"&gt;Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
&lt;/a&gt;, Section 2.2, for the full derivation.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;a class="headerlink" href="#conclusion" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;That was a lot to cover and, if you made it this far, well done! We looked at why quantization matters (memory, energy, throughput), how it works mathematically (the quantization and dequantization pipeline, rounding vs clipping tradeoff), the key design choices (symmetric vs asymmetric, per-tensor vs per-channel vs per-block, static vs dynamic, PTQ vs QAT), and why certain combinations like symmetric weights with asymmetric activations, or per-tensor activation quantization are preferred in practice.&lt;/p&gt;
&lt;p&gt;All of this assumes the weight and activation distributions are reasonably well-behaved. In the next post, we&amp;rsquo;ll see why transformers break that assumption and why that makes quantizing them surprisingly hard.&lt;/p&gt;
&lt;h2 id="references-further-reading"&gt;&lt;strong&gt;References &amp;amp; Further Reading&lt;/strong&gt;&lt;a class="headerlink" href="#references-further-reading" title="Permanent link"&gt;&amp;para;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2106.08295"&gt;A White Paper on Neural Network Quantization from Qualcomm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.maartengrootendorst.com/blog/quantization/"&gt;A Visual Guide to Quantization by Maarten Grootendorst&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/@OscarSavolainen/videos"&gt;Oscar Savolainen&amp;rsquo;s videos on Quantization in Pytorch&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="Quantization"/><category term="Quantization"/><category term="Integer Quantization"/><category term="Deep Learning"/><category term="Inference"/><category term="Model Compression"/></entry></feed>