type
status
date
slug
summary
tags
category
icon
password
A brief derivation of the computational complexity of Swin-transformer.
Overall Structure
The overall structure of Swin is illustrated below.
![notion image](https://s2.loli.net/2024/04/07/t2RjYHl5xGfdC1V.png?t=fa932e14-9469-4edf-a108-7cfdbfc91d95)
Patch merging
At stage Swin performs patch merging to downsample from to to produce a hierarchical representation.
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2Fa039b029-cc20-4010-8833-b67c0829fd92%2FUntitled.png?table=block&id=7fd3269a-8a00-453a-934a-29b8e45b93dc&t=7fd3269a-8a00-453a-934a-29b8e45b93dc&width=432&cache=v2)
The W-MSA module
Matrix multiplication:
could be implemented as follows:
Therefore its computational complexity is
For a global Multi-head Self Attention (MSA) module,
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2F66ff4935-3d9d-4f71-9c93-71c5e5bb2692%2FUntitled.png?table=block&id=f618d350-cc4f-4192-9685-3955ea1d3893&t=f618d350-cc4f-4192-9685-3955ea1d3893&width=811&cache=v2)
We can divide its computation into the following 4 steps and calculate their respective cc:
- for input , compute
In total,
Let thus for the local windows,
With sufficiently small , we consider the computational complexity of Swin-transformer linear with respect to the image size .
SW-MSA
To increase modeling power by introducing connections across windows, the authors proposed a shifted window partitioning approach by displacing the windows by pixels in the sucessive layer.
![notion image](https://s2.loli.net/2024/04/06/1mOPE4W8yUrYHz5.png?t=8532af03-40d7-434f-b0b8-76427bf48c91)
However, the number of windows is increased from to of varying sizes, hindering synchronous calculation. Therefore the proposed approach is to perform a cyclic shift and mask the correlated self-attention results in each window by to set the values near after passing through softmax.
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2F347d03f9-7f5b-401b-9242-d01957e94b9d%2FUntitled.png?table=block&id=50b02885-ae3f-4552-b5e2-c5f6acee0fd9&t=50b02885-ae3f-4552-b5e2-c5f6acee0fd9&width=1237&cache=v2)
Suppose , e.g. each local window covers patches, the four Attention Masks are visualized below:
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2Ffbbce3df-5a09-476c-9978-332976c9e024%2FUntitled.png?table=block&id=0d0e4148-6a8e-41fd-a6a9-300ed47c1631&t=0d0e4148-6a8e-41fd-a6a9-300ed47c1631&width=1211&cache=v2)
Relative Position Bias
where parameter matrix is the rel.bias.
Define a 2D relative position index , it is clear that . We perform a simple linear transformation and use as the 1D relative position index. For each patch, we obtain a matrix of size . We flatten them individually and concatenate over patches, resulting in a matrix.
We train a vector table of length , which is the range of possible indices, to project the index values to index bias.
- Author:VernonWu
- URL:https://vernonwu.com/article/swin
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts