Supernet
  • Supernet|Guides
  • Multi-Agent Collaboration
    • Overview
    • Best Route
  • Decentralized GPU System
  • Whitepaper
    • Whitepaper of Supernet
  • distributed network
    • Supernet | Introduction
    • AI Operating System
      • Transformer Decoder Architecture
      • LLM OS
    • AI OS Design
      • Decentralize Validator
      • Decentralize Storage
    • TEE-Enhanced Design
    • AVS of Agent
    • Fault Tolerance
  • Product
    • Supernet | Intelligent Node
    • Supernet | AI OS SDK
    • Supernet | AI Explorer
    • Supernet | Agent Studio
    • Supernet | Public API
Powered by GitBook
On this page
  • Overall Architecture
  • Attention Mechanism
  • Number of Parameters
  1. distributed network
  2. AI Operating System

Transformer Decoder Architecture

A detailed view of a Transformer decoder architecture with 100 billion parameters.

PreviousAI Operating SystemNextLLM OS

Last updated 5 months ago

Overall Architecture

Inputs

  • At the bottom - left of the figure, there are inputs labeled as .

  • Following the inputs is the Positional Encoding, which adds position information to each position in the input sequence.

Decoders

  • There are multiple decoder layers (Decoder #1, Decoder #2, …, Decoder #n) in the figure.

  • Each decoder layer contains several key components:

    • Masked Multi - Head Attention: This is the first sub - layer of the decoder, dealing with self - attention within the input sequence.

    • Add & Norm: After each sub - layer, there is a residual connection and layer normalization operation to aid gradient propagation and training stability.

    • Multi - Head Attention: This is the second sub - layer of the decoder, handling encoder - decoder attention.

    • Add & Norm: Similar to before, it processes the output of the multi - head attention mechanism.

    • Feed Forward: This is the final sub - layer of the decoder, typically consisting of two linear transformations and an activation function (e.g., ReLU).

    • Add & Norm: Another residual connection and layer normalization.

Linear Layer

  • After the decoder layers, there is a linear layer (labeled as Linear) that maps the decoder's output to an appropriate dimension.

Outputs

  • At the top - right of the figure are the output layers, labeled as .

Attention Mechanism

  • The right side of the figure shows the internal structure of the attention mechanism in detail.

  • The attention mechanism includes three main operations:

    • MatMul: Used to calculate the relationships between queries (Q), keys (K), and values (V).

    • Mask (opt.): Used to prevent the model from seeing future information during training.

    • Softmax: Used to normalize the attention weights between 0 and 1.

  • Additionally, there is a Scale operation to adjust the size of the attention weights.

Number of Parameters

  • At the bottom - right of the figure, it is noted that the model has 100 billion parameters, indicating that it is an extremely large - scale model.

A Transformer decoder architecture with 100 billion parameters.