← Back

QLoRA

QLoRA — Mathematical Formulas


Stage 1 — NF4 Quantization

Each weight block is scaled by its absolute maximum, then each weight is mapped to the nearest level on the NF4 grid.

Step 1.1 — Compute the 32-bit absmax constant

cb(1)=maxiblockbwiRfp32c^{(1)}_b = \max_{i \in \text{block}_b} \left| w_i \right| \quad \in \mathbb{R}^{fp32}

Where bb is the block index and wiw_i are the original fp32 weights in that block.

Step 1.2 — Normalize weights into [−1, +1]

w^i=wicb(1)\hat{w}_i = \frac{w_i}{c^{(1)}_b}

Step 1.3 — The NF4 grid

NF4 uses 16 non-linear levels derived from the quantiles of a standard normal distribution N(0,1)\mathcal{N}(0, 1):

QNF4={qk}k=015={1,  0.6961,  0.5251,  ,  0.6961,  1}\mathcal{Q}_{NF4} = \left\{ q_k \right\}_{k=0}^{15} = \left\{ -1,\; -0.6961,\; -0.5251,\; \ldots,\; 0.6961,\; 1 \right\}

The levels are symmetric around zero and denser near it — matching where neural network weights concentrate.

Step 1.4 — Quantize each weight to a 4-bit index

qi=argmink{0,,15}w^iqkq_i = \underset{k \in \{0,\ldots,15\}}{\arg\min} \left| \hat{w}_i - q_k \right|

The result qi{0,1,,15}q_i \in \{0, 1, \ldots, 15\} fits in 4 bits.

Stage 1 storage

WhatPrecisionPer block
Weight indices qiq_i4-bit uintn×0.5n \times 0.5 bytes
Absmax constant cb(1)c^{(1)}_bfp324 bytes

Stage 2 — Double Quantization

The fp32 absmax constants cb(1)c^{(1)}_b are themselves too expensive at scale (32 bits each). Stage 2 quantizes them using a single shared super-constant.

Step 2.1 — Compute the 32-bit super-constant

c(2)=maxbcb(1)Rfp32c^{(2)} = \max_b \left| c^{(1)}_b \right| \quad \in \mathbb{R}^{fp32}

One scalar, kept exact.

Step 2.2 — Normalize absmax constants

c^b(1)=cb(1)c(2)[0,1]\hat{c}^{(1)}_b = \frac{c^{(1)}_b}{c^{(2)}} \in [0, 1]

Step 2.3 — Quantize constants to 8-bit

c~b(1)=round ⁣(c^b(1)×255){0,,255}\tilde{c}^{(1)}_b = \text{round}\!\left( \hat{c}^{(1)}_b \times 255 \right) \quad \in \{0, \ldots, 255\}

The result fits in 8 bits (uint8).

Stage 2 storage

WhatPrecisionTotal
Weight indices qiq_i4-bit uintN×0.5N \times 0.5 bytes
Quantized constants c~b(1)\tilde{c}^{(1)}_bint8B×1B \times 1 bytes
Super-constant c(2)c^{(2)}fp324 bytes

Where NN = total weights, BB = number of blocks.


Decompression — Full Reconstruction

To use the weights during the forward pass, reconstruct in three steps:

Step D.1 — Recover the absmax constant

c^b(1)=c~b(1)255×c(2)\hat{c}^{(1)}_b = \frac{\tilde{c}^{(1)}_b}{255} \times c^{(2)}

Step D.2 — Look up the NF4 level

w^i=QNF4 ⁣[qi]\hat{w}_i = \mathcal{Q}_{NF4}\!\left[ q_i \right]

Step D.3 — Rescale to original magnitude

wiQNF4 ⁣[qi]×c~b(1)255×c(2)\boxed{w_i \approx \mathcal{Q}_{NF4}\!\left[ q_i \right] \times \frac{\tilde{c}^{(1)}_b}{255} \times c^{(2)}}

Full Pipeline Summary

wifp32÷cb(1)w^i[1,1]NF4 snapqi4-bit+c~b(1)int8+c(2)fp32\underbrace{w_i}_{\text{fp32}} \xrightarrow{\div\, c^{(1)}_b} \underbrace{\hat{w}_i}_{\in [-1,1]} \xrightarrow{\text{NF4 snap}} \underbrace{q_i}_{\text{4-bit}} \quad+\quad \underbrace{\tilde{c}^{(1)}_b}_{\text{int8}} \quad+\quad \underbrace{c^{(2)}}_{\text{fp32}} decompresswiQNF4[qi]×c~b(1)255×c(2)\xrightarrow{\text{decompress}} w_i \approx \mathcal{Q}_{NF4}[q_i] \times \frac{\tilde{c}^{(1)}_b}{255} \times c^{(2)}

Numerical Example (w = 0.12, block absmax = 0.87)

StepOperationValue
Original weightww0.12000.1200
32-bit constantc0(1)c^{(1)}_00.87000.8700
Normalizedw^=0.12÷0.87\hat{w} = 0.12 \div 0.870.13790.1379
NF4 snapnearest level → index 9q=0.1250q = 0.1250
Super-constantc(2)c^{(2)}0.87000.8700
8-bit constantc~0(1)\tilde{c}^{(1)}_0255255
Reconstructed0.1250×(255/255)×0.870.1250 \times (255/255) \times 0.870.1088\approx 0.1088
Absolute error$0.1200 - 0.1088

Memory Savings Formula

For a weight matrix with NN parameters, block size BsB_s, giving B=N/BsB = N / B_s blocks:

Memfp32=4N bytes\text{Mem}_{fp32} = 4N \text{ bytes} Memstage1=N2+4B bytes\text{Mem}_{stage1} = \frac{N}{2} + 4B \text{ bytes} Memstage2=N2+B+4 bytes\text{Mem}_{stage2} = \frac{N}{2} + B + 4 \text{ bytes} Savings=1N/2+B+44N87.5% for large N\text{Savings} = 1 - \frac{N/2 + B + 4}{4N} \approx 87.5\% \text{ for large } N