type

status

date

slug

summary

tags

category

icon

password

Please note that the following content is intended for personal note-taking, and therefore is rather unorganized. On the occasion of any issues, please email me or make a thread in the comment section below.

Base Model

- embedding [one-hot project]

- + positional encoding

- Multi-Head Self-Attention

heads, , concatenate, dense.
parallel:

*Position-Wise*FFN

for every time step, apply the same MLP to improve network expression ability (promote then reduce dimensionality).

Computation Cost

memory & cc: quadratic.

Mode

- (1) encoder-only (e.g. for classification),

- (2) decoder-only (e.g. for language modeling), causal, upper triangular mask

- (3) encoder-decoder (e.g. for machine translation), decoder part causal due to auto-regressive

Modifications

Methodology

**FP**(fixed patterns): local windows etc.- Block-wise Patterns:
- Strided Patterns: tending by intervals. Strided or dilated windows.
- Compressed Patterns: pooling, downsample sequence length.

**CP**(Combinations of Patterns): aggregation / combination / axial of**FP**

**LP**(Learnable Patterns): data-driven, establish token**relevance**-> assign to buckets/clusters- relevance: hash, k-means, sort blocks
- enables global view on top of FP's efficiency

**M**(neural memory): learnable side memory module- Set Transformers: pooling-like compress restore information via cross attention with temporary vectors(learned)
- problem: information loss global token (
*e.g*Bert [CLS])

**LR**(Low rank methods): leverage low-rank approximations of the self-attention matrix- post softmax, the attention matrix is not full-rank
- linformer:

where are projection layers.

**KR**(Kernels): re-writing of the self-attention mechanism to avoid explicitly computing the matrix- can be viewed as a form of
**LR**

**RC**(recurrence): connect blocks via recurrence

**DS**(downsampling): e.g. patch merging in Swin-Transformer

- Sparse Models and Conditional Computation: sparsely activate a subset of the parameters to improve FLOPS

Detailed Walk-Through

Image Transformer

local attention

Self-attention computed within blocks independently.
for block of length ,

Memory-Compressed Attention

For , apply convolution along axis 0 with kernel size and stride to reduce dimension to . CC of attention then becomes .
However, it often either

- does not result in significant improvement in cc due to and being similar in orders of magnitude; or

- lose information during compression.

Two Attention Schemes

flattened in raster order, partition into non-overlapping query blocks of length , extends to memory blocks.
Loses global receptive field.

Sparse Transformer

Assumption: In softmax Attention, effective weights are sparsely distributed.

heads

fixed:
strided:
which can be visualized below:

usage

- alternate and : Justification: synthesizes block information, so striding in does not affect the receptive field.

- merge:

- multi-head, then concatenate:

cc

set , then
strided attention is more suited for images and audio; (more local)
fixed attention is more suited for texts. (more global)

Axial Transformer

Compute attention along axis. For image data,
saves
Generally, for , Axial transformer saves

models

auto-regressive:

**Inner decode: row-wise model**where is the initial dimensional embedding of size
shiftright ensures the current pixel is out of receptive field.

**Outer Decoder: capturing the rows above**The tensor represents context captured above the current pixel.
Finally, pass through LayerNorm and dense layer to produce logits
Effective for point clouds, etc.

Longformer

dilated CNNs dilated sliding windows
Increases the receptive field by layer like CNN, and therefore this indirect approach performs similarly bad at long distance modeling.
special tokens for global attention (BERT [CLS]) , which needs to attend to and be attended by all tokens.

**Crucial**. Big Bird

global tokens + fixed patterns (local sliding windows)+

**random attention**(queries attend to random keys). Justification for randomness: The standard Transformer is a*complete digraph,*which can be approximated with random graphs. The model is Turing-complete. Routing Transformer

learns attention sparsity with

*k-means*clustering where Cluster centroid vectors are shared for andFor

**decoder**(causal attention), solutions include:- additional lower-triangular masking;

- share queries and keys. Namely, Works better.

Reformer

Locality Sensitive Hashing (

**LSH**). For hashes, define random matrix similarity with random vectors.LSH attention:

where is the normalizing term in softmax.
To avoid queries with no keys, set s.t.
Multi-round:

Revnet:

Reduce memory (activation) cost with extra computation.

In reformer, set to LSH attention blocks, and to FFN.

Linformer

, reduction on dimension instead of Needs to maintain causal masking.

where are projections.
Reminiscent of depth-wise convolutions/ pooling.

Performer

Fast Attention via Orthogonal Random Features (

**FAVOR**): With Kernel , where is a random feature map, we can write attention as , thenslower on causal(autoregressive) due to additional steps for masking, therefore unable to be parallelized.

Linear Transformer

kernel method, linear-time, constant memory RNN.
, observe:

with kernel method: then

For unmasked, attends to the same keys, therefore we simply reuse
for masked, incremental:

If to compute then cc is
choose: , then

References

[1] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6, Article 109 (December 2022), 28 pages. https://doi.org/10.1145/3530811

**Author:**VernonWu**URL:**https://vernonwu.com/article/EfficientTransformers**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts