Block 101: Text Preprocessing & Tokenization
Clean and tokenize raw text for NLP tasks.
Concepts
- Lowercasing, removing punctuation and numbers
- word_tokenize() and sent_tokenize() from NLTK
- Token frequency distribution: FreqDist
- Stopwords removal
Code Examples
See exercise below.
Exercise
Tokenize a paragraph. Print top 20 word frequencies before and after stopword removal. Write a function clean_text(text) that lowercases, removes punctuation, and strips stopwords.
Homework
Why do stopwords matter? Are there cases where you should NOT remove them?