> For the complete documentation index, see [llms.txt](https://supernet.gitbook.io/supernet/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://supernet.gitbook.io/supernet/distributed-network/ai-operating-system/transformer-decoder-architecture.md).

# Transformer Decoder Architecture

<figure><img src="/files/OdGOaQJJOiEzsyJLjF8M" alt=""><figcaption><p>A Transformer decoder architecture with 100 billion parameters.</p></figcaption></figure>

### Overall Architecture

#### Inputs

* At the bottom - left of the figure, there are inputs labeled as .
* Following the inputs is the Positional Encoding, which adds position information to each position in the input sequence.

#### Decoders

* There are multiple decoder layers (Decoder #1, Decoder #2, …, Decoder #n) in the figure.
* Each decoder layer contains several key components:
  * **Masked Multi - Head Attention**: This is the first sub - layer of the decoder, dealing with self - attention within the input sequence.
  * **Add & Norm**: After each sub - layer, there is a residual connection and layer normalization operation to aid gradient propagation and training stability.
  * **Multi - Head Attention**: This is the second sub - layer of the decoder, handling encoder - decoder attention.
  * **Add & Norm**: Similar to before, it processes the output of the multi - head attention mechanism.
  * **Feed Forward**: This is the final sub - layer of the decoder, typically consisting of two linear transformations and an activation function (e.g., ReLU).
  * **Add & Norm**: Another residual connection and layer normalization.

#### Linear Layer

* After the decoder layers, there is a linear layer (labeled as Linear) that maps the decoder's output to an appropriate dimension.

#### Outputs

* At the top - right of the figure are the output layers, labeled as .

### Attention Mechanism

* The right side of the figure shows the internal structure of the attention mechanism in detail.
* The attention mechanism includes three main operations:
  * **MatMul**: Used to calculate the relationships between queries (Q), keys (K), and values (V).
  * **Mask (opt.)**: Used to prevent the model from seeing future information during training.
  * **Softmax**: Used to normalize the attention weights between 0 and 1.
* Additionally, there is a Scale operation to adjust the size of the attention weights.

### Number of Parameters

* At the bottom - right of the figure, it is noted that the model has 100 billion parameters, indicating that it is an extremely large - scale model.
