Block 104: Text Classification with scikit-learn
Build a supervised text classifier.
Concepts
- Pipeline: TfidfVectorizer + classifier
- Naive Bayes, Logistic Regression for text
- Train/test split on text data
- classification_report and confusion matrix
Code Examples
See exercise below.
Exercise
Build a spam vs ham classifier on a small labelled dataset (e.g., SMS spam corpus). Compare Naive Bayes vs Logistic Regression accuracy.
Homework
Why is Naive Bayes surprisingly effective for text classification despite its naive assumption? Wednesday