Cleaning words · The NLP Lab

Try it — turn the cleaning steps on and off

Try these:

Original

After cleaning

struck-out = a stop-word that was removed · green = the word was shrunk to its root.

✂️ Stemming Chops common endings with simple rules. Fast, but rough — it can make a non-word. studies → studi, running → run.

📖 Lemmatization Uses a dictionary to find the real base word. Smarter (and slower). studies → study, better → good, children → child.

🤖 In a real LLM Surprise: big LLMs usually don't stem or remove stop-words! They keep every piece (even "the" and "is") as tokens, because small words carry grammar and meaning the model needs. Stemming and stop-word removal are classic NLP tricks still used in search engines and simpler text tools — and they're perfect for understanding how "find the root idea" works.

🧹 Cleaning words — roots, stems & stop-words

Practice 🎯