Week 11 • Monday

Block 101: Text Preprocessing & Tokenization

Clean and tokenize raw text for NLP tasks.

Concepts

Lowercasing, removing punctuation and numbers
word_tokenize() and sent_tokenize() from NLTK
Token frequency distribution: FreqDist
Stopwords removal

Code Examples

See exercise below.

Exercise

Tokenize a paragraph. Print top 20 word frequencies before and after stopword removal. Write a function clean_text(text) that lowercases, removes punctuation, and strips stopwords.

Homework

Why do stopwords matter? Are there cases where you should NOT remove them?