Introduction to LLMs - Architecture and Components - BERT, RoBERTa, and T5

Language models have revolutionized the field of natural language processing (NLP) and understanding. In this article, we'll dive into the architectures and components of three major language models: BERT, RoBERTa, and T5. By understanding their inner workings, you'll gain a better grasp of how they have impacted NLP tasks and the applications they enable.

BERT - Bidirectional Encoder Representations from Transformers

BERT, or Bidirectional Encoder Representations from Transformers, is a pre-trained language model developed by Google AI. BERT has had a significant impact on NLP tasks due to its powerful ability to capture context from both directions in a sentence.

Architecture

BERT's architecture consists of a multi-layer bidirectional Transformer encoder, which allows it to process and understand text in a more nuanced way than previous models. There are two primary versions of BERT:

BERT-Base: 12 layers, 12 heads, and 110 million parameters
BERT-Large: 24 layers, 16 heads, and 340 million parameters

Components

The main components of BERT are:

WordPiece Tokenization: BERT uses WordPiece tokenization to break down input text into smaller subword units, allowing it to handle out-of-vocabulary words more effectively.
Positional Encoding: Positional encoding is added to the input embeddings to provide information about the position of words in a sequence.
Multi-head Self-attention: This mechanism allows BERT to focus on different aspects of the input words and capture contextual information from multiple perspectives.
Transformer Layers: BERT's multi-layer Transformer architecture enables it to learn complex language structures and contextual relationships.

RoBERTa - Robustly Optimized BERT Pretraining Approach

Developed by Facebook AI, RoBERTa is an optimized version of BERT that addresses some of its limitations and achieves better performance on various NLP tasks.

Architecture

RoBERTa's architecture is similar to BERT, with improvements in pretraining and training techniques. It also comes in base and large versions, with 12 and 24 layers respectively.

Components

RoBERTa's key components and improvements include:

Dynamic Masking: RoBERTa introduces dynamic masking, which randomly changes the masked tokens during pretraining, allowing the model to learn deeper representations.
Larger Batch Size & Training Data: RoBERTa uses a larger batch size and more training data for pretraining, which results in improved performance.
Removal of Next Sentence Prediction (NSP) Task: RoBERTa removes the NSP task used in BERT, as it was found to be less useful for downstream tasks.