A computer can't read letters the way you do. The very first thing it does is
cut your text into small pieces called “tokens”, and give each piece a number (an ID).
Words, word-pieces, or even single letters can be tokens.
Try it — type anything
Try these:
Tokens
The small #number under each token is its
ID — its slot in the model's dictionary (its “vocabulary”). The ## (in word-piece mode) is a
different thing: it just means “join me to the piece before me”. Our chopper here is simple — a real
tokenizer learns smarter pieces from data.
🤖 In a real LLM — does it really use ##?
Big models use word-pieces too, so they can build any word — even a new one — from smaller pieces.
But they mark the joins differently:
BERT-style tokenizers do use ## on a joined piece: play + ##ing.
GPT and Llama chat models do not use ##. They instead put a tiny mark before
the start of a new word (a special space symbol), and joined pieces get no mark.
Either way, the mark only shows which pieces stick together. And real
models learn their pieces from huge amounts of text, so they split words more cleverly than our
simple chopper. The same tokenizer is used in training and when the model answers you
(≈ 1 token is about ¾ of a word).