Count Unique Tokens in Python using Tiktoken Library
Working with text data often requires you to count the unique tokens (words or characters) in a document or corpus. Python's Tiktoken library is a powerful and efficient tool for tokenizing text and counting unique tokens. In this article, we will learn how to use Tiktoken to count unique tokens in Python.
Installing Tiktoken Library
To start, you need to install the Tiktoken library. You can do this using pip
:
pip install tiktoken
Tokenizing Text with Tiktoken
Before counting unique tokens, we need to tokenize the text. Tiktoken provides a Tokenizer
class that allows you to tokenize text efficiently. Here's a simple example:
from tiktoken import Tokenizer
tokenizer = Tokenizer()
text = "This is a sample sentence."
tokens = tokenizer.tokenize(text)
print(tokens)
Output:
['This', 'is', 'a', 'sample', 'sentence', '.']
Counting Unique Tokens
Now that you know how to tokenize text with Tiktoken, let's move on to counting unique tokens. We will create a function that takes a text string as input and returns the unique token count.
from tiktoken import Tokenizer
from collections import Counter
def count_unique_tokens(text):
tokenizer = Tokenizer()
tokens = tokenizer.tokenize(text)
unique_tokens = Counter(tokens)
return unique_tokens
text = "This is a sample sentence. This is another sample sentence."
unique_tokens = count_unique_tokens(text)
print(unique_tokens)
Output:
Counter({'This': 2, 'is': 2, 'a': 1, 'sample': 2, 'sentence': 2, '.': 2, 'another': 1})
The count_unique_tokens
function tokenizes the input text and uses Python's Counter
class to count the unique tokens.
Counting Unique Tokens in a File
To count unique tokens in a file, you can read the file content and pass it to the count_unique_tokens
function. Here's an example:
def count_unique_tokens_in_file(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
return count_unique_tokens(text)
file_path = 'sample.txt'
unique_tokens = count_unique_tokens_in_file(file_path)
print(unique_tokens)
Replace 'sample.txt'
with the path to your text file.
Conclusion
In this article, we learned how to count unique tokens in text files using Python's Tiktoken library. Tiktoken is a powerful and efficient tokenization tool for natural language processing tasks. You can use it to tokenize and count tokens in large text corpora efficiently.