Fine-tune Pre-trained Models using Hugging Face and Python

In this tutorial, we will explore how to fine-tune pre-trained models from the Hugging Face library using Python. Fine-tuning pre-trained models is an essential step to improve the performance of your machine learning projects, especially when working with Natural Language Processing (NLP) tasks.

Introduction to Hugging Face
Pre-requisites
Loading a Pre-trained Model
Preparing the Dataset
Fine-tuning the Model
Evaluating the Model
Using the Fine-tuned Model
Conclusion

Introduction to Hugging Face

Hugging Face is an open-source provider of pre-trained models and datasets for NLP. They offer state-of-the-art models like BERT, GPT-2, RoBERTa, and many others, which can be fine-tuned for specific tasks, such as sentiment analysis, text classification, and more.

Pre-requisites

Before we start, make sure to have Python 3.x installed on your machine. Next, you need to install the following packages:

pip install transformers
pip install datasets

transformers is the Hugging Face library for pre-trained models, and datasets is the library for datasets.

Loading a Pre-trained Model

First, let's load a pre-trained model and its tokenizer. In this example, we will use the distilbert-base-uncased model.

from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

Preparing the Dataset

Next, let's load a dataset to fine-tune the model. We will use the imdb dataset, which is available in the Hugging Face datasets library.

from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Now, we need to tokenize the dataset using the tokenizer we loaded earlier.

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Fine-tuning the Model

To fine-tune the model, we need to create a Trainer object and a training configuration.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    "test_trainer",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

Now, let's start the fine-tuning process.

trainer.train()

Evaluating the Model

After fine-tuning, let's evaluate the model's performance on the test dataset.

trainer.evaluate()

Using the Fine-tuned Model

Now that we have fine-tuned our model, let's use it to make predictions.

import torch

text = "This movie was amazing!"
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits
predicted_label = torch.argmax(logits, dim=1)

print(predicted_label)

This will output the predicted label for the given text.

Conclusion

In this tutorial, we learned how to fine-tune pre-trained models using the Hugging Face library and Python. Fine-tuning is an essential step to improve the performance of your NLP tasks, and Hugging Face makes it simple and efficient. Happy fine-tuning!