QLoRA — Mathematical Formulas
Each weight block is scaled by its absolute maximum, then each weight is mapped to the nearest level on the NF4 grid.
cb(1)=i∈blockbmax∣wi∣∈Rfp32
Where b is the block index and wi are the original fp32 weights in that block.
w^i=cb(1)wi
NF4 uses 16 non-linear levels derived from the quantiles of a standard normal distribution N(0,1):
QNF4={qk}k=015={−1,−0.6961,−0.5251,…,0.6961,1}
The levels are symmetric around zero and denser near it — matching where neural network weights concentrate.
qi=k∈{0,…,15}argmin∣w^i−qk∣
The result qi∈{0,1,…,15} fits in 4 bits.
| What | Precision | Per block |
|---|
| Weight indices qi | 4-bit uint | n×0.5 bytes |
| Absmax constant cb(1) | fp32 | 4 bytes |
The fp32 absmax constants cb(1) are themselves too expensive at scale (32 bits each). Stage 2 quantizes them using a single shared super-constant.
c(2)=bmaxcb(1)∈Rfp32
One scalar, kept exact.
c^b(1)=c(2)cb(1)∈[0,1]
c~b(1)=round(c^b(1)×255)∈{0,…,255}
The result fits in 8 bits (uint8).
| What | Precision | Total |
|---|
| Weight indices qi | 4-bit uint | N×0.5 bytes |
| Quantized constants c~b(1) | int8 | B×1 bytes |
| Super-constant c(2) | fp32 | 4 bytes |
Where N = total weights, B = number of blocks.
To use the weights during the forward pass, reconstruct in three steps:
c^b(1)=255c~b(1)×c(2)
w^i=QNF4[qi]
wi≈QNF4[qi]×255c~b(1)×c(2)
fp32wi÷cb(1)∈[−1,1]w^iNF4 snap4-bitqi+int8c~b(1)+fp32c(2)
decompresswi≈QNF4[qi]×255c~b(1)×c(2)
| Step | Operation | Value |
|---|
| Original weight | w | 0.1200 |
| 32-bit constant | c0(1) | 0.8700 |
| Normalized | w^=0.12÷0.87 | 0.1379 |
| NF4 snap | nearest level → index 9 | q=0.1250 |
| Super-constant | c(2) | 0.8700 |
| 8-bit constant | c~0(1) | 255 |
| Reconstructed | 0.1250×(255/255)×0.87 | ≈0.1088 |
| Absolute error | $ | 0.1200 - 0.1088 |
For a weight matrix with N parameters, block size Bs, giving B=N/Bs blocks:
Memfp32=4N bytes
Memstage1=2N+4B bytes
Memstage2=2N+B+4 bytes
Savings=1−4NN/2+B+4≈87.5% for large N