🧪 The AI Lab← All labs
Lab 1

🔪 Tokenization — chopping words into pieces

A computer can't read letters the way you do. The very first thing it does is cut your text into small pieces called “tokens”, and give each piece a number (an ID). Words, word-pieces, or even single letters can be tokens.

Try it — type anything
Try these:
Tokens

The small #number under each token is its ID — its slot in the model's dictionary (its “vocabulary”). The ## (in word-piece mode) is a different thing: it just means “join me to the piece before me”. Our chopper here is simple — a real tokenizer learns smarter pieces from data.

🤖 In a real LLM — does it really use ##? Big models use word-pieces too, so they can build any word — even a new one — from smaller pieces. But they mark the joins differently:
Either way, the mark only shows which pieces stick together. And real models learn their pieces from huge amounts of text, so they split words more cleverly than our simple chopper. The same tokenizer is used in training and when the model answers you (≈ 1 token is about ¾ of a word).

Practice 🎯

Next lab → 🧹 Cleaning words