Skip to content
Go back

Lecture 3 - Architecture

Edit page

Architecture

Pre-norm vs Post-norm

Almost all modern LMs use pre-norm.

pre-norm 提高训练的稳定性,将 normalization 放在 residual stream 之外,提高input 传递的流畅性。

residual stream 中只放 identify connection f(x)=xf(x) = x

double norm

尝试结合 pre-norm 和 post-norm,既然将 normalization 放在 residual norm 中不好,把 post-norm 放到残差连接的外部。

double-norm

LayerNorm vs RMSNorm

y=xE(x)Var(x)+ϵγ+βy = \frac{x - E(x)}{\sqrt{Var(x) + \epsilon}} * \gamma + \beta y=xmean(x2)+ϵγy = \frac{x}{\sqrt{mean(x^2) + \epsilon}} * \gamma

Why RMSNorm?

Modern explanation - faster and just as good:

Really make sense?

Matrix multiplies are the vast majority of FLOPs and memory.

CAUTION

FLOPs are not runtime!

RMSNorm 能减少整体runtime是因为减少了data movement

More generally: dropping bias terms

Most modern transformers don’t have bias terms.

Reasons:

Activations

TODO: GeLU

Gated activations (*GLU)

GLUs modify the first part of a FF (Feed Forward) layer.

From ReLU to ReGLU:

max(0,xW1)max(0,xW1)(xV)\max(0, xW_1) \rArr \max(0, xW_1) \odot (xV)

GeGLU:

FFNGeGLU(x,W,V,W2)=(GeLU(xW)xV)W2FFN_{GeGLU}(x, W, V, W_2) = (GeLU(xW) \odot xV)W_2

SwiGLU (swish is x * sigmoid(x)):

FFNSwiGLU(x,W,V,W2)=(Swish(xW)xV)W2FFN_{SwiGLU}(x, W, V, W_2) = (Swish(xW) \odot xV)W_2

NOTE

Gated models use smaller dimensions for the dffd_{ff} by 2/3. 因为GLU多用了一个矩阵,为了不增加总参数量,需要把矩阵变窄。

Serial vs Parallel layers

Normal transformer blocks are serial - they compute attention, then the MLP:

y=x+MLP(LayerNorm(x+Attention(LayerNorm(x))))y = x + MLP(LayerNorm(x + Attention(LayerNorm(x))))

Try parallelization:

y=x+MLP(LayerNorm(x))+Attention(LayerNorm(x))y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))

No extremely serious ablations, but has a compute win.

RoPE: Rotary Position Embedding

TIP

Position embeddings: 给原始的输入添加位置信息,在attention计算Q×KQ \times K的时候使用

Many variations in position embeddings

Embed(x,i)=vx+PEposEmbed(x, i) = v_x + PE_{pos}

RoPE

A relative position embedding should be some f(x, i) s.t.

<f(x,i),f(y,j)>=g(x,y,ij)<f(x, i), f(y, j)> = g(x, y, i - j)

The attention function only gets to depend on the relative position (i - j).

rope

对于二维向量,旋转只要乘一个2×22 \times 2的旋转矩阵,对于经过word embedding的vector,只要两两维度进行切分,乘下面这个大的矩阵:

R(θ)=(cosθ1sinθ1sinθ1cosθ1cosθd/2sinθd/2sinθd/2cosθd/2)R(\theta)= \begin{pmatrix} \cos\theta_1 & -\sin\theta_1 & & \\ \sin\theta_1 & \cos\theta_1 & & \\ & & \ddots & \\ & & & \cos\theta_{d/2} & -\sin\theta_{d/2} \\ & & & \sin\theta_{d/2} & \cos\theta_{d/2} \end{pmatrix}

RoPE最终作用在attention的计算Attn(i)=Σjsoftmax(QiKj)VjAttn(i) = \Sigma_j softmax(Q_i \cdot K_j)V_j中的qk相乘部分:

QiR(θi)QiKiR(θi)KiQ_i \leftarrow R(\theta_i)Q_i \quad K_i \leftarrow R(\theta_i)K_i

因为每个位置ii 旋转的角度是iθi\theta,而且qk进行inner product,点乘要转置,最终会有计算一个角度差,体现相对位置信息。

Hyperparameters

Feedforward

根据经验, ratio of feedforward dim (dffd_{ff}) and model dim (dmodeld_{model}):

dff=4dmodeld_{ff} = 4 d_{model}

NOTE

GLU 中dff=83×dmodeld_{ff} = \frac{8}{3} \times d_{model}, 因为dmodeld_{model}是原来的 23\space\frac{2}{3}.

Ratio of head_dim * head_num to model_dim

Most models have ratios around 1.

Aspect ratio

Deep v.s. Wide

Most models are surprisingly consistent on the ratio of d_model to n_layer being 1 too.

Vocabulary size

Typically, monolingual models have vocab size around 30k to 50k, while multilingual models have vocab size around 100k to 250k.

Dropout and other reularization

Many older models use dropout during pretraining. Newer models rely only on weight decay.

Weight decay interacts with learning rates

Stability tricks

Softmaxes - can be ill-behaved due to exponential / division by zero

Output softmax stability – the ‘z-loss’

log(P(x))=logeUr(x)Z(x)Z(x)=r=1VeUr(x)\begin{align} log(P(x)) & = \log{\frac{e^{U_{r'}(x)}}{Z(x)}} \\ Z(x) & = \sum_{r' = 1}^{|V|}e^{U_{r'}(x)} \end{align}

通过引入z-loss来解决softmax exponential overflow的问题:

L=i[log(P(xi))αlog2(Z(xi))]L = \sum_i[log(P(x_i)) - \alpha log^2(Z(x_i))]

因为要最大化L,所以softmax会尽量让Z(x)区域趋于0,避免softmax的overflow,从而提高整体的稳定性。

Attention softmax stability – the ‘QK norm’

The query and keys are Layer (RMS) normed before going into the softmax operation.

Logit soft-capping.

Soft-capping the logits to some maximum value via Tanh


Edit page
Share this post on:

Previous Post
Value Iteration and Policy Iteration
Next Post
Kubernetes Learning