Random Thoughts

LongLoRA is an efficient fine-tuning method designed to expand the context window of Large Language Models (LLMs).

We use it to make a pretrained model handle much longer text without requiring massive compute or memory.

Unlike LoRA and QLoRA, which primarily focus on adapting or storing model weights efficiently, LongLoRA focuses on the attention mechanism used during long-context fine-tuning.

Normal transformer attention requires:

O(n^2)

computation.

If we have 1,000 tokens, the attention matrix contains roughly:

1000^2 = 1,000,000

attention scores.

As the sequence length grows, the cost grows quadratically.

How Can We Reduce It?

Instead of allowing every token to attend to every other token, we can compute only local attention.

[The, cat, sat, on, the, mat, and, slept]

         ↓ split into windows

[The, cat, sat, on]   [the, mat, and, slept]

This reduces the computational complexity from approximately:

O(n^2)

to:

O(nw)

where (w) is the local attention window size.

However, this introduces a problem.

The token The cannot directly interact with slept.

Likewise, the token on cannot directly interact with the next token the because they fall into different local windows.

Shifting (Shift by 2)

To allow information to cross window boundaries, LongLoRA uses Shifted Short Attention (S²-Attn).

Suppose the window size is 4 and we shift by 2 tokens.

Original windows:

[The, cat, sat, on]   [the, mat, and, slept]

Shift the sequence:

[sat, on, the, mat]   [and, slept, The, cat]

Now tokens that were previously separated by a window boundary become neighbors inside the same attention window.

For example:

on ↔ the

can now attend to each other.

After attention is computed, the outputs are shifted back to their original positions.

This creates overlapping attention regions while maintaining the efficiency of local attention.

Attention Masking

In standard causal attention, the attention score matrix is computed as:

\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} + M \right)V

where:

(Q) = Query matrix
(K) = Key matrix
(V) = Value matrix
(M) = Attention mask

The mask controls which tokens are allowed to attend to each other.

For decoder-only language models, (M) is typically a causal mask that prevents tokens from looking into the future.

LongLoRA introduces an additional local-attention mask.

Instead of allowing every token pair to interact, the mask only enables attention within a fixed local window.

For example:

[The, cat, sat, on]   [the, mat, and, slept]

The mask allows:

The  ↔ cat ↔ sat ↔ on
the  ↔ mat ↔ and ↔ slept

but blocks:

on  ✗  the
The ✗ slept

For normal local-attention heads:

O_{\text{normal}} = \text{softmax} \left( \frac{Q_1K_1^T}{\sqrt{d_k}} + M_{\text{normal}} \right) V_1

where $M_{\text{normal}}$ only permits interactions inside the local window.

For shifted heads, we first shift the hidden states:

X_s = \text{Shift}(X)

and then apply a shifted local mask:

O_{\text{shift}} = \text{Shift}^{-1} \left( \text{softmax} \left( \frac{Q_2K_2^T}{\sqrt{d_k}} + M_{\text{shifted}} \right) V_2 \right)

where $M_{\text{shifted}}$ is constructed using the shifted token layout.

The shifted mask creates overlapping attention windows, allowing information to cross local-window boundaries.

The key idea is that LongLoRA does not change the attention formula itself.

It changes the masks and token layout used by different attention heads.

Multi-Head Attention vs LongLoRA

Standard Multi-Head Attention

For input hidden states (X):

Q_i = XW_i^Q

K_i = XW_i^K

V_i = XW_i^V

For each head:

\text{head}_i = \text{Softmax} \left( \frac{Q_iK_i^T}{\sqrt{d_k}} \right)V_i

Concatenate all heads and project back:

\text{MHA}(X) = \text{Concat} ( \text{head}_1, \dots, \text{head}_h ) W^O

Every head can attend to all tokens.

This provides maximum context but requires:

O(n^2)

computation.

LongLoRA (S²-Attn)

Suppose there are (h) attention heads.

Split them into two groups:

Normal heads: $(1, \dots, h/2)$
Shifted heads: $(h/2 + 1, \dots, h)$

For normal heads:

\text{head}_i = \text{LocalAttention}(X)

For shifted heads:

X_s = \text{Shift}(X)

\text{head}_i = \text{Unshift} ( \text{LocalAttention}(X_s) )

The local attention itself is still:

\text{LocalAttention}(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} + M_{\text{local}} \right)V

The only difference is that the mask limits attention to a local window.

Finally:

S^2\text{-Attn}(X) = \text{Concat} ( \text{head}_1, \dots, \text{head}_{h/2}, \text{head}_{h/2+1}, \dots, \text{head}_h ) W^O

A compact representation is:

O_{\text{normal}} = \text{LocalAttention}(X)

O_{\text{shift}} = \text{Unshift} ( \text{LocalAttention} ( \text{Shift}(X) ) )

S^2\text{-Attn}(X) = \text{Concat} ( O_{\text{normal}}, O_{\text{shift}} ) W^O

Key Takeaways

LoRA modifies weight updates using low-rank matrices.
QLoRA quantizes the frozen weights to reduce memory usage.
LongLoRA focuses on efficient long-context training.
Standard attention requires (O(n^2)) computation.
LongLoRA replaces full attention with shifted local attention during fine-tuning.
Different attention heads see different shifted views of the sequence.
Attention masks restrict interactions to local windows.
Shifted windows allow information to cross local-window boundaries.
The attention formula itself remains unchanged.
LongLoRA changes which tokens each head can see, making long-context fine-tuning much more efficient.