Block 103: Bag-of-Words & TF-IDF
Convert text into numeric features for machine learning.
Concepts
- CountVectorizer: bag-of-words matrix
- TfidfVectorizer: term frequency-inverse document frequency
- Interpreting the feature matrix
- Sparse matrices and memory efficiency
Code Examples
See exercise below.
Exercise
Vectorize 5 example sentences with CountVectorizer. Print feature names and matrix. Compare top TF-IDF words for 2 short documents about different topics.
Homework
Why does TF-IDF often work better than raw word counts for classification? Explain intuitively.