Tokenization · The NLP Lab

Try it — type anything

Try these:

Tokens

The small #number under each token is its ID — its slot in the model's dictionary (its “vocabulary”). The ## (in word-piece mode) is a different thing: it just means “join me to the piece before me”. Our chopper here is simple — a real tokenizer learns smarter pieces from data.

🤖 In a real LLM — does it really use ##? Big models use word-pieces too, so they can build any word — even a new one — from smaller pieces. But they mark the joins differently:

BERT-style tokenizers do use ## on a joined piece: play + ##ing.
GPT and Llama chat models do not use ##. They instead put a tiny mark before the start of a new word (a special space symbol), and joined pieces get no mark.

Either way, the mark only shows which pieces stick together. And real models learn their pieces from huge amounts of text, so they split words more cleverly than our simple chopper. The same tokenizer is used in training and when the model answers you (≈ 1 token is about ¾ of a word).

🔪 Tokenization — chopping words into pieces

Practice 🎯