VernonWu BLOG
https://vernonwu.com/
这是VernonWu的BLOGFri, 19 Apr 2024 04:53:22 GMThttps://validator.w3.org/feed/docs/rss2.htmlhttps://github.com/jpmonette/feeden-USAll rights reserved 2024, VernonWu<![CDATA[Solution for Project Euler [484] ]]>
https://vernonwu.com/article/solveuler/484
https://vernonwu.com/article/solveuler/484Tue, 16 Apr 2024 00:00:00 GMT

ℹ️

A colllection of my project Euler solutions. Problem Archives can be found here.

Problem description

The arithmetic derivative is defined by

for any prime

for all integers (Leibniz rule)

For example, . Find for .

Note: denotes the greatest common divisor of and .

Mathematical Derivation

Let and multiplicative function). Then,

Algorithm

]]><![CDATA[Visualizing Fourier Epicycles with Manim-CE]]>
https://vernonwu.com/article/epicycles
https://vernonwu.com/article/epicyclesTue, 27 Feb 2024 00:00:00 GMT

ℹ️

Approximate and animate extracted contours from a given png using DFT. Inspired by 3b1b.

Introduction

Epicycles

In general, epicycles are curves defined by the equations:

therefore

e.g.

The parametric path for the curve can be denoted as:

DFT

The Discrete Fourier Transform (DFT) is defined as

where
Its inverse transform (IDFT) is given by:

where

Trigonometric interpolation

Consider a set of points belonging to the parametric curve

The goal of trigonometric interpolation is to find a trigonometric polynomial that passes through these points.
A trigonometric polynomial of degree (where typically for an odd number of points, or for an even number of points) is given by:

To find the coefficients , note that the coefficients can be obtained directly from the DFT coefficients by the relation .

Interpolation Process

Compute DFT: Apply DFT to the given data points to obtain the coefficients .

Construct Trigonometric Polynomial: Use the coefficients to construct the trigonometric polynomial .

Interpolation: To interpolate or estimate the value at any point , plug the value of into the polynomial . This gives you the interpolated value at that point.

Visualization with Manim

Manim was created by Grant Sanderson, better known as the person behind the YouTube channel "3Blue1Brown". It was developed to produce the distinctive and visually appealing animations seen in his videos, which explain a wide range of mathematical concepts.

ManimCE is a fork of the original Manim project. It's community-driven and open-source, with contributions from a broader community of developers.

Please note that the following content is intended for personal note-taking, and therefore is rather unorganized. On the occasion of any issues, please email me or make a thread in the comment section below.

Base Model

embed

+ positional encoding

Multi-Head Self-Attention

heads, , concatenate, dense.
parallel:

Position-Wise FFN

for every time step, apply the same MLP to improve network expression ability (promote then reduce dimensionality).

Computation Cost

memory & cc: quadratic.

Mode

(1) encoder-only (e.g. for classification),

(2) decoder-only (e.g. for language modeling), causal, upper triangular mask

(3) encoder-decoder (e.g. for machine translation), decoder part causal due to auto-regressive

Modifications

Methodology

FP (fixed patterns): local windows etc.

Block-wise Patterns:

Strided Patterns: tending by intervals. Strided or dilated windows.

Set Transformers: pooling-like compress restore information via cross attention with temporary vectors(learned)

problem: information loss global token (e.g Bert [CLS])

LR (Low rank methods): leverage low-rank approximations of the self-attention matrix

post softmax, the attention matrix is not full-rank

linformer:

where are projection layers.

KR (Kernels): re-writing of the self-attention mechanism to avoid explicitly computing the matrix

can be viewed as a form of LR

RC (recurrence): connect blocks via recurrence

DS (downsampling): e.g. patch merging in Swin-Transformer

Sparse Models and Conditional Computation: sparsely activate a subset of the parameters to improve FLOPS

Detailed Walk-Through

Image Transformer

local attention

Self-attention computed within blocks independently.
for block of length ,

Memory-Compressed Attention

For , apply convolution along axis 0 with kernel size and stride to reduce dimension to . CC of attention then becomes .
However, it often either

does not result in significant improvement in cc due to and being similar in orders of magnitude; or

lose information during compression.

Two Attention Schemes

flattened in raster order, partition into non-overlapping query blocks of length , extends to memory blocks.
Loses global receptive field.

Sparse Transformer

Assumption: In softmax Attention, effective weights are sparsely distributed.

heads

fixed:
strided:
which can be visualized below:

usage

alternate and :
Justification: synthesizes block information, so striding in does not affect the receptive field.

merge:

multi-head, then concatenate:

cc

set , then
strided attention is more suited for images and audio; (more local)
fixed attention is more suited for texts. (more global)

Axial Transformer

Compute attention along axis. For image data,
saves
Generally, for , Axial transformer saves

models

auto-regressive: Inner decode: row-wise model

where is the initial dimensional embedding of size
shiftright ensures the current pixel is out of receptive field.
Outer Decoder: capturing the rows above

The tensor represents context captured above the current pixel.
Finally, pass through LayerNorm and dense layer to produce logits
Effective for point clouds, etc.

Longformer

dilated CNNs dilated sliding windows
Increases the receptive field by layer like CNN, and therefore this indirect approach performs similarly bad at long distance modeling.
special tokens for global attention (BERT [CLS]) , which needs to attend to and be attended by all tokens. Crucial.

Big Bird

global tokens + fixed patterns (local sliding windows)+ random attention (queries attend to random keys).
Justification for randomness:
The standard Transformer is a complete digraph, which can be approximated with random graphs.
The model is Turing-complete.

Routing Transformer

learns attention sparsity with k-means clustering where
Cluster centroid vectors are shared for and

For decoder (causal attention), solutions include:

additional lower-triangular masking;

share queries and keys. Namely, Works better.

Reformer

Locality Sensitive Hashing (LSH).
For hashes, define random matrix
similarity with random vectors.

LSH attention:

where is the normalizing term in softmax.
To avoid queries with no keys, set s.t.
Multi-round:

Revnet:

Reduce memory (activation) cost with extra computation.

In reformer, set to LSH attention blocks, and to FFN.

Linformer

, reduction on dimension instead of Needs to maintain causal masking.

where are projections.
Reminiscent of depth-wise convolutions/ pooling.

Performer

Fast Attention via Orthogonal Random Features (FAVOR):
With Kernel , where is a random feature map, we can write attention as , then

slower on causal(autoregressive) due to additional steps for masking, therefore unable to be parallelized.

For unmasked, attends to the same keys, therefore we simply reuse
for masked, incremental:

If to compute then cc is
choose: , then

References

[1] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6, Article 109 (December 2022), 28 pages. https://doi.org/10.1145/3530811

]]><![CDATA[语丝]]>
https://vernonwu.com/article/linguaThread
https://vernonwu.com/article/linguaThreadWed, 07 Feb 2024 00:00:00 GMT

沉默的螺旋

沉默的螺旋（The Spiral Of Silence）是一个政治学和大众传播理论。理论基本描述了这样一个现象：人们在表达自己想法和观点的时候，如果看到自己赞同的观点且受到广泛欢迎，就会积极参与进来，这类观点越发大胆地发表和扩散；而发觉某一观点无人或很少有人理会（有时会有群起而攻之的遭遇），即使自己赞同它，也会保持沉默。意见一方的沉默造成另一方意见的增势，如此循环往复，便形成一方的声音越来越强大，另一方越来越沉默下去的螺旋发展过程。理论是基于这样一个假设：大多数个人会力图避免由于单独持有某些态度和信念而产生的孤立。

Open-source alternative to the Kuudra gang discord /au command.

Introduction

Kuudra attributes

Attributes are a special type of buff exclusive to certain items obtained from the Crimson Isle. Similar to Enchantments, they can either increase Stats or grant other bonuses, such as increased 🪄Skill XP, ❤️ Health regeneration, and more. Items that support Attributes automatically come included with 2 random ones at level 1 when created.

Upgrading attributes

You can upgrade the levels of the Attributes on your items up to level 10 by combining the item with another one of its type or an Attribute Shard with at least 1 matching attribute. Similarly to combining Enchantments in an Anvil, each matching attribute on the sacrifice item ( also known as fusion piece) must be either equal to or higher than the level of the same attribute as the input item. If the level is the same, the level of the attribute will be upgraded, up to level 10. If the level is higher, then the level will be set to that increased value.

It is worth noting that a fusion piece is equivalent to an attribute shard with the same desired attribute, despite being constrained by its armor type.

Main objective

We aim to design an algorithm that calculates the most cost-efficient method to upgrade a certain attribute through purchasing fusion pieces / attribute shards from the Skyblock auction house.

Currently a similar functionality is provided by the kuudra gang discord as a paid feature.

Assumptions

We will only consider same order attribute fusion ().

Justification:It is easily observed that there is no point in fusion when , for the piece is simply wasted and does not contribute to overall attribute level accumulation.

We will generate sufficient normally distributed data for each tier of attribute, with the mean and var doubling upon each tier upgrade.

Justification: In practice we will be fetching real-time data using the hypixel skyblock api, however for demonstration convenience we adopt the approach above which provides a decent approximation to the actual distribution.

We will only consider the case where there is sufficient supply (as will be defined later) for all tiers of attributes on the auction house.

Justification: The scenario above requires the largest amount of calculations, while only slight adjustments are required to make it robust to all cases

Algorithm

Import packages

Determine data size for each tier

Generate simulation data

We generate the data for each tier using normal distribution.

Node fusion

Node fusion, which is the core of our algorithm, is implemented as follows.

Fused node insertion

Visualization

The uuid_map here is essentially a binary tree. Therefore we choose to visualize it using networkx.

Output example

The figure below displays an example fusing route, where light blue represents the original item while cyan marks the recommended bin auctions.

🪄 Playground

For those of interest, detailed code are provided at GitHub.

You can also play with the code yourself via binder.

A brief derivation of the computational complexity of Swin-transformer.

Overall Structure

The overall structure of Swin is illustrated below.

Patch merging

At stage Swin performs patch merging to downsample from to to produce a hierarchical representation.

The W-MSA module

Matrix multiplication:

could be implemented as follows:

Therefore its computational complexity is

For a global Multi-head Self Attention (MSA) module,

We can divide its computation into the following 4 steps and calculate their respective cc:

for input , compute

In total,

Let thus for the local windows,

With sufficiently small , we consider the computational complexity of Swin-transformer linear with respect to the image size .

SW-MSA

To increase modeling power by introducing connections across windows, the authors proposed a shifted window partitioning approach by displacing the windows by pixels in the sucessive layer.

However, the number of windows is increased from to of varying sizes, hindering synchronous calculation. Therefore the proposed approach is to perform a cyclic shift and mask the correlated self-attention results in each window by to set the values near after passing through softmax.

Suppose , e.g. each local window covers patches, the four Attention Masks are visualized below:

Relative Position Bias

where parameter matrix is the rel.bias.

Define a 2D relative position index , it is clear that . We perform a simple linear transformation and use as the 1D relative position index. For each patch, we obtain a matrix of size . We flatten them individually and concatenate over patches, resulting in a matrix.

We train a vector table of length , which is the range of possible indices, to project the index values to index bias.