Transformer Decoder Architecture

A detailed view of a Transformer decoder architecture with 100 billion parameters.

At the bottom - left of the figure, there are inputs labeled as .
Following the inputs is the Positional Encoding, which adds position information to each position in the input sequence.

There are multiple decoder layers (Decoder #1, Decoder #2, …, Decoder #n) in the figure.
Each decoder layer contains several key components:
- Masked Multi - Head Attention: This is the first sub - layer of the decoder, dealing with self - attention within the input sequence.
- Add & Norm: After each sub - layer, there is a residual connection and layer normalization operation to aid gradient propagation and training stability.
- Multi - Head Attention: This is the second sub - layer of the decoder, handling encoder - decoder attention.
- Add & Norm: Similar to before, it processes the output of the multi - head attention mechanism.
- Feed Forward: This is the final sub - layer of the decoder, typically consisting of two linear transformations and an activation function (e.g., ReLU).
- Add & Norm: Another residual connection and layer normalization.

After the decoder layers, there is a linear layer (labeled as Linear) that maps the decoder's output to an appropriate dimension.

The right side of the figure shows the internal structure of the attention mechanism in detail.
The attention mechanism includes three main operations:
- MatMul: Used to calculate the relationships between queries (Q), keys (K), and values (V).
- Mask (opt.): Used to prevent the model from seeing future information during training.
- Softmax: Used to normalize the attention weights between 0 and 1.
Additionally, there is a Scale operation to adjust the size of the attention weights.

At the bottom - right of the figure, it is noted that the model has 100 billion parameters, indicating that it is an extremely large - scale model.

Last updated 7 months ago