Week 8 • Wednesday

Block 76: Pipelines: Preventing Data Leakage

Build sklearn Pipelines to combine preprocessing and model training safely.

Concepts

Pipeline([('step_name', transformer), ..., ('model', estimator)])
Fit pipeline on train set only
pipeline.predict() handles all transformations automatically
make_pipeline() shortcut

Code Examples

See exercise below.

Exercise

Build a Pipeline: StandardScaler + Logistic Regression. Fit on train, evaluate on test. Build a Pipeline for a dataset with both numeric and categorical features using ColumnTransformer.

Homework

Explain why fitting a scaler on the entire dataset (before splitting) is a form of data leakage. Thursday