Transformer Decoder Architecture

A detailed view of a Transformer decoder architecture with 100 billion parameters.

A Transformer decoder architecture with 100 billion parameters.

Overall Architecture

Inputs

  • At the bottom - left of the figure, there are inputs labeled as .

  • Following the inputs is the Positional Encoding, which adds position information to each position in the input sequence.

Decoders

  • There are multiple decoder layers (Decoder #1, Decoder #2, …, Decoder #n) in the figure.

  • Each decoder layer contains several key components:

    • Masked Multi - Head Attention: This is the first sub - layer of the decoder, dealing with self - attention within the input sequence.

    • Add & Norm: After each sub - layer, there is a residual connection and layer normalization operation to aid gradient propagation and training stability.

    • Multi - Head Attention: This is the second sub - layer of the decoder, handling encoder - decoder attention.

    • Add & Norm: Similar to before, it processes the output of the multi - head attention mechanism.

    • Feed Forward: This is the final sub - layer of the decoder, typically consisting of two linear transformations and an activation function (e.g., ReLU).

    • Add & Norm: Another residual connection and layer normalization.

Linear Layer

  • After the decoder layers, there is a linear layer (labeled as Linear) that maps the decoder's output to an appropriate dimension.

Outputs

  • At the top - right of the figure are the output layers, labeled as .

Attention Mechanism

  • The right side of the figure shows the internal structure of the attention mechanism in detail.

  • The attention mechanism includes three main operations:

    • MatMul: Used to calculate the relationships between queries (Q), keys (K), and values (V).

    • Mask (opt.): Used to prevent the model from seeing future information during training.

    • Softmax: Used to normalize the attention weights between 0 and 1.

  • Additionally, there is a Scale operation to adjust the size of the attention weights.

Number of Parameters

  • At the bottom - right of the figure, it is noted that the model has 100 billion parameters, indicating that it is an extremely large - scale model.

Last updated