Transformer Decoder Architecture
A detailed view of a Transformer decoder architecture with 100 billion parameters.
Last updated
A detailed view of a Transformer decoder architecture with 100 billion parameters.
Last updated
At the bottom - left of the figure, there are inputs labeled as .
Following the inputs is the Positional Encoding, which adds position information to each position in the input sequence.
There are multiple decoder layers (Decoder #1, Decoder #2, …, Decoder #n) in the figure.
Each decoder layer contains several key components:
Masked Multi - Head Attention: This is the first sub - layer of the decoder, dealing with self - attention within the input sequence.
Add & Norm: After each sub - layer, there is a residual connection and layer normalization operation to aid gradient propagation and training stability.
Multi - Head Attention: This is the second sub - layer of the decoder, handling encoder - decoder attention.
Add & Norm: Similar to before, it processes the output of the multi - head attention mechanism.
Feed Forward: This is the final sub - layer of the decoder, typically consisting of two linear transformations and an activation function (e.g., ReLU).
Add & Norm: Another residual connection and layer normalization.
After the decoder layers, there is a linear layer (labeled as Linear) that maps the decoder's output to an appropriate dimension.
At the top - right of the figure are the output layers, labeled as .
The right side of the figure shows the internal structure of the attention mechanism in detail.
The attention mechanism includes three main operations:
MatMul: Used to calculate the relationships between queries (Q), keys (K), and values (V).
Mask (opt.): Used to prevent the model from seeing future information during training.
Softmax: Used to normalize the attention weights between 0 and 1.
Additionally, there is a Scale operation to adjust the size of the attention weights.
At the bottom - right of the figure, it is noted that the model has 100 billion parameters, indicating that it is an extremely large - scale model.