QLoRA — Mathematical Formulas

Stage 1 — NF4 Quantization

Each weight block is scaled by its absolute maximum, then each weight is mapped to the nearest level on the NF4 grid.

Step 1.1 — Compute the 32-bit absmax constant

c^{(1)}_b = \max_{i \in \text{block}_b} \left| w_i \right| \quad \in \mathbb{R}^{fp32}

Where $b$ is the block index and $w_i$ are the original fp32 weights in that block.

Step 1.2 — Normalize weights into [−1, +1]

\hat{w}_i = \frac{w_i}{c^{(1)}_b}

Step 1.3 — The NF4 grid

NF4 uses 16 non-linear levels derived from the quantiles of a standard normal distribution $\mathcal{N}(0, 1)$ :

\mathcal{Q}_{NF4} = \left\{ q_k \right\}_{k=0}^{15} = \left\{ -1,\; -0.6961,\; -0.5251,\; \ldots,\; 0.6961,\; 1 \right\}

The levels are symmetric around zero and denser near it — matching where neural network weights concentrate.

Step 1.4 — Quantize each weight to a 4-bit index

q_i = \underset{k \in \{0,\ldots,15\}}{\arg\min} \left| \hat{w}_i - q_k \right|

The result $q_i \in \{0, 1, \ldots, 15\}$ fits in 4 bits.

Stage 1 storage

What	Precision	Per block
Weight indices $q_i$	4-bit uint	$n \times 0.5$ bytes
Absmax constant $c^{(1)}_b$	fp32	4 bytes

Stage 2 — Double Quantization

The fp32 absmax constants $c^{(1)}_b$ are themselves too expensive at scale (32 bits each). Stage 2 quantizes them using a single shared super-constant.

Step 2.1 — Compute the 32-bit super-constant

c^{(2)} = \max_b \left| c^{(1)}_b \right| \quad \in \mathbb{R}^{fp32}

One scalar, kept exact.

Step 2.2 — Normalize absmax constants

\hat{c}^{(1)}_b = \frac{c^{(1)}_b}{c^{(2)}} \in [0, 1]

Step 2.3 — Quantize constants to 8-bit

\tilde{c}^{(1)}_b = \text{round}\!\left( \hat{c}^{(1)}_b \times 255 \right) \quad \in \{0, \ldots, 255\}

The result fits in 8 bits (uint8).

Stage 2 storage

What	Precision	Total
Weight indices $q_i$	4-bit uint	$N \times 0.5$ bytes
Quantized constants $\tilde{c}^{(1)}_b$	int8	$B \times 1$ bytes
Super-constant $c^{(2)}$	fp32	4 bytes

Where $N$ = total weights, $B$ = number of blocks.

Decompression — Full Reconstruction

To use the weights during the forward pass, reconstruct in three steps:

Step D.1 — Recover the absmax constant

\hat{c}^{(1)}_b = \frac{\tilde{c}^{(1)}_b}{255} \times c^{(2)}

Step D.2 — Look up the NF4 level

\hat{w}_i = \mathcal{Q}_{NF4}\!\left[ q_i \right]

Step D.3 — Rescale to original magnitude

\boxed{w_i \approx \mathcal{Q}_{NF4}\!\left[ q_i \right] \times \frac{\tilde{c}^{(1)}_b}{255} \times c^{(2)}}

Full Pipeline Summary

\underbrace{w_i}_{\text{fp32}} \xrightarrow{\div\, c^{(1)}_b} \underbrace{\hat{w}_i}_{\in [-1,1]} \xrightarrow{\text{NF4 snap}} \underbrace{q_i}_{\text{4-bit}} \quad+\quad \underbrace{\tilde{c}^{(1)}_b}_{\text{int8}} \quad+\quad \underbrace{c^{(2)}}_{\text{fp32}}

\xrightarrow{\text{decompress}} w_i \approx \mathcal{Q}_{NF4}[q_i] \times \frac{\tilde{c}^{(1)}_b}{255} \times c^{(2)}

Numerical Example (w = 0.12, block absmax = 0.87)

Step	Operation	Value
Original weight	$w$	$0.1200$
32-bit constant	$c^{(1)}_0$	$0.8700$
Normalized	$\hat{w} = 0.12 \div 0.87$	$0.1379$
NF4 snap	nearest level → index 9	$q = 0.1250$
Super-constant	$c^{(2)}$	$0.8700$
8-bit constant	$\tilde{c}^{(1)}_0$	$255$
Reconstructed	$0.1250 \times (255/255) \times 0.87$	$\approx 0.1088$
Absolute error	$	0.1200 - 0.1088

Memory Savings Formula

For a weight matrix with $N$ parameters, block size $B_s$ , giving $B = N / B_s$ blocks:

\text{Mem}_{fp32} = 4N \text{ bytes}

\text{Mem}_{stage1} = \frac{N}{2} + 4B \text{ bytes}

\text{Mem}_{stage2} = \frac{N}{2} + B + 4 \text{ bytes}

\text{Savings} = 1 - \frac{N/2 + B + 4}{4N} \approx 87.5\% \text{ for large } N