# Transformer Decoder Architecture

<figure><img src="https://2202465257-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FfeTJMmefEjVoLviFIGxM%2Fuploads%2FelvEAwtzjCREywjxUt3x%2Fimage.png?alt=media&#x26;token=f2ed3c27-6482-46c8-bd68-6da565b8c5ff" alt=""><figcaption><p>A Transformer decoder architecture with 100 billion parameters.</p></figcaption></figure>

### Overall Architecture

#### Inputs

* At the bottom - left of the figure, there are inputs labeled as .
* Following the inputs is the Positional Encoding, which adds position information to each position in the input sequence.

#### Decoders

* There are multiple decoder layers (Decoder #1, Decoder #2, …, Decoder #n) in the figure.
* Each decoder layer contains several key components:
  * **Masked Multi - Head Attention**: This is the first sub - layer of the decoder, dealing with self - attention within the input sequence.
  * **Add & Norm**: After each sub - layer, there is a residual connection and layer normalization operation to aid gradient propagation and training stability.
  * **Multi - Head Attention**: This is the second sub - layer of the decoder, handling encoder - decoder attention.
  * **Add & Norm**: Similar to before, it processes the output of the multi - head attention mechanism.
  * **Feed Forward**: This is the final sub - layer of the decoder, typically consisting of two linear transformations and an activation function (e.g., ReLU).
  * **Add & Norm**: Another residual connection and layer normalization.

#### Linear Layer

* After the decoder layers, there is a linear layer (labeled as Linear) that maps the decoder's output to an appropriate dimension.

#### Outputs

* At the top - right of the figure are the output layers, labeled as .

### Attention Mechanism

* The right side of the figure shows the internal structure of the attention mechanism in detail.
* The attention mechanism includes three main operations:
  * **MatMul**: Used to calculate the relationships between queries (Q), keys (K), and values (V).
  * **Mask (opt.)**: Used to prevent the model from seeing future information during training.
  * **Softmax**: Used to normalize the attention weights between 0 and 1.
* Additionally, there is a Scale operation to adjust the size of the attention weights.

### Number of Parameters

* At the bottom - right of the figure, it is noted that the model has 100 billion parameters, indicating that it is an extremely large - scale model.
