๐Ÿงช The AI Labโ† All labs
Lab 2

๐Ÿงน Cleaning words โ€” roots, stems & stop-words

Before a simple program tries to understand text, it often tidies it: drop the boring common words, and shrink words to a root so that play, plays, playing, played all count as the same idea. There are two ways to find the root โ€” a fast crude one (stemming) and a smart one (lemmatization).

Try it โ€” turn the cleaning steps on and off
Try these:
Original
After cleaning

struck-out = a stop-word that was removed ยท green = the word was shrunk to its root.

โœ‚๏ธ Stemming Chops common endings with simple rules. Fast, but rough โ€” it can make a non-word. studies โ†’ studi,  running โ†’ run.
๐Ÿ“– Lemmatization Uses a dictionary to find the real base word. Smarter (and slower). studies โ†’ study,  better โ†’ good,  children โ†’ child.
๐Ÿค– In a real LLM Surprise: big LLMs usually don't stem or remove stop-words! They keep every piece (even "the" and "is") as tokens, because small words carry grammar and meaning the model needs. Stemming and stop-word removal are classic NLP tricks still used in search engines and simpler text tools โ€” and they're perfect for understanding how "find the root idea" works.

Practice ๐ŸŽฏ

Next lab โ†’ ๐Ÿ”ข Word numbers