Block 76: Pipelines: Preventing Data Leakage
Build sklearn Pipelines to combine preprocessing and model training safely.
Concepts
- Pipeline([('step_name', transformer), ..., ('model', estimator)])
- Fit pipeline on train set only
- pipeline.predict() handles all transformations automatically
- make_pipeline() shortcut
Code Examples
See exercise below.
Exercise
Build a Pipeline: StandardScaler + Logistic Regression. Fit on train, evaluate on test. Build a Pipeline for a dataset with both numeric and categorical features using ColumnTransformer.
Homework
Explain why fitting a scaler on the entire dataset (before splitting) is a form of data leakage. Thursday