Week 8 • Wednesday

Block 76: Pipelines: Preventing Data Leakage

Build sklearn Pipelines to combine preprocessing and model training safely.

Concepts

Code Examples

See exercise below.

Exercise

Build a Pipeline: StandardScaler + Logistic Regression. Fit on train, evaluate on test. Build a Pipeline for a dataset with both numeric and categorical features using ColumnTransformer.

Homework

Explain why fitting a scaler on the entire dataset (before splitting) is a form of data leakage. Thursday