Before a simple program tries to understand text, it often tidies it: drop the
boring common words, and shrink words to a root so that play, plays, playing, played
all count as the same idea. There are two ways to find the root โ a fast crude one (stemming)
and a smart one (lemmatization).
Try it โ turn the cleaning steps on and off
Try these:
Original
After cleaning
struck-out = a stop-word that was removed ยท
green = the word was shrunk to its root.
โ๏ธ Stemming
Chops common endings with simple rules. Fast, but rough โ it can make a non-word.
studies โ studi, running โ run.
๐ Lemmatization
Uses a dictionary to find the real base word. Smarter (and slower).
studies โ study, better โ good, children โ child.
๐ค In a real LLM
Surprise: big LLMs usually don't stem or remove stop-words! They keep every piece (even
"the" and "is") as tokens, because small words carry grammar and meaning the model needs. Stemming and
stop-word removal are classic NLP tricks still used in search engines and simpler text
tools โ and they're perfect for understanding how "find the root idea" works.