Fine-tune Pre-trained Models using Hugging Face and Python
In this tutorial, we will explore how to fine-tune pre-trained models from the Hugging Face library using Python. Fine-tuning pre-trained models is an essential step to improve the performance of your machine learning projects, especially when working with Natural Language Processing (NLP) tasks.
Table of Contents
- Introduction to Hugging Face
- Pre-requisites
- Loading a Pre-trained Model
- Preparing the Dataset
- Fine-tuning the Model
- Evaluating the Model
- Using the Fine-tuned Model
- Conclusion
Introduction to Hugging Face
Hugging Face is an open-source provider of pre-trained models and datasets for NLP. They offer state-of-the-art models like BERT, GPT-2, RoBERTa, and many others, which can be fine-tuned for specific tasks, such as sentiment analysis, text classification, and more.
Pre-requisites
Before we start, make sure to have Python 3.x installed on your machine. Next, you need to install the following packages:
pip install transformers
pip install datasets
transformers
is the Hugging Face library for pre-trained models, and datasets
is the library for datasets.
Loading a Pre-trained Model
First, let's load a pre-trained model and its tokenizer. In this example, we will use the distilbert-base-uncased
model.
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
Preparing the Dataset
Next, let's load a dataset to fine-tune the model. We will use the imdb
dataset, which is available in the Hugging Face datasets
library.
from datasets import load_dataset
raw_datasets = load_dataset("imdb")
Now, we need to tokenize the dataset using the tokenizer we loaded earlier.
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
Fine-tuning the Model
To fine-tune the model, we need to create a Trainer
object and a training configuration.
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
"test_trainer",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
Now, let's start the fine-tuning process.
trainer.train()
Evaluating the Model
After fine-tuning, let's evaluate the model's performance on the test dataset.
trainer.evaluate()
Using the Fine-tuned Model
Now that we have fine-tuned our model, let's use it to make predictions.
import torch
text = "This movie was amazing!"
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits
predicted_label = torch.argmax(logits, dim=1)
print(predicted_label)
This will output the predicted label for the given text.
Conclusion
In this tutorial, we learned how to fine-tune pre-trained models using the Hugging Face library and Python. Fine-tuning is an essential step to improve the performance of your NLP tasks, and Hugging Face makes it simple and efficient. Happy fine-tuning!