The Cardinal Rules

Rule #1 — Never Leak Test Data Into Training. Fit ALL preprocessing (scalers, encoders, imputers) on TRAINING data only. Then transform both train and test with the fitted transformer. The sklearn Pipeline enforces this automatically.

❌ Data leakage!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)
# Scaler learned from ALL data!
✅ No leakage
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier())
])
pipe.fit(X_train, y_train)    # internally only fits scaler on X_train
pipe.score(X_test, y_test)

Reproducibility

# Set seeds EVERYWHERE
import numpy as np, random, os

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# sklearn
model = RandomForestClassifier(random_state=SEED)
X_train, X_test = train_test_split(X, y, random_state=SEED)

# PyTorch
import torch
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

# TensorFlow
import tensorflow as tf
tf.random.set_seed(SEED)

Experiment Tracking

# Use MLflow or Weights & Biases
import mlflow

with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    mlflow.log_metric("test_accuracy", accuracy)
    mlflow.log_metric("test_f1", f1)
    mlflow.sklearn.log_model(model, "model")

Feature Engineering Checklist

Evaluation Checklist

Hyperparameter Tuning

Don't tune on the test set! Use nested cross-validation for unbiased evaluation.

from sklearn.model_selection import cross_val_score, GridSearchCV

outer_cv = StratifiedKFold(5)
inner_cv = StratifiedKFold(3)

param_grid = {"n_estimators": [100, 200], "max_depth": [5, None]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=inner_cv)

# Outer CV gives unbiased estimate of tuned model performance
scores = cross_val_score(grid, X, y, cv=outer_cv, scoring="roc_auc")
print(f"Nested CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

Data Leakage — The Silent Killer

Data leakage occurs when information from OUTSIDE the training set is used to build the model. The model appears amazing during development but fails catastrophically in production.

Scenario 1: Test-Time Leakage (Most Common)

Fit preprocessor on ALL data (train + test), then split.

Impact: Model shows 95% accuracy in development, 60% in production. Career-ending.
❌ Leakage
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)
# Scaler learned from ALL data!
✅✅ Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])
X_train, X_test = train_test_split(X, y)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)  # No leakage

Scenario 2: Target Leakage

Using information from AFTER the prediction time.

❌ Future information
# Predict loan default within 12 months
# 'last_payment_date' is AFTER origination!
df['days_since_last_payment'] = (
    df['last_payment_date']
    - df['origination_date']
).days
✅ Known at prediction time
# Only features known at origination
df['credit_score'] = ...
df['income'] = ...
df['loan_age_months'] = (
    df['data_date']
    - df['origination_date']
).days / 30

Real Case: Model trained to predict pneumonia risk with 98% accuracy → deployed at 52%. Root cause: oxygen saturation (measured in hospital AFTER diagnosis) was used as a feature.

Scenario 3: Proxy Variables

Using variables that encode the target rather than predict it.

❌ Lagging indicator
# Predict churn
features = ['support_tickets_open']
# Tickets open BECAUSE they're churning
# not BEFORE they churn (proxy!)
✅ Leading indicators
features = [
    'months_as_customer',
    'monthly_usage_hours',
    'feature_diversity',
    'sentiment_score'
]
# These LEAD to churn, not follow

Scenario 4: Train-Test Contamination

Choosing features after seeing test performance (data snooping).

# ✅ CORRECT: Pre-register features before seeing test results
# 1. Choose features based on domain knowledge or train-only analysis
# 2. Train model
# 3. Evaluate on test ONCE
# 4. Report results (don't iterate based on test performance)

# Use cross-validation on TRAINING data only
gs = GridSearchCV(
    RandomForestClassifier(),
    {'max_depth': [5, 10, 15]},
    cv=5  # Cross-validate only on TRAIN
)
gs.fit(X_train, y_train)

# Final evaluation on test (ONCE)
test_score = gs.score(X_test, y_test)

How to Detect Leakage

# 1. Train-test gap
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
if train_score - test_score > 0.20:
    print("⚠️  Possible leakage! Large gap")

# 2. Surprisingly high performance
if test_score > 0.95:
    print("⚠️  Too good? Check for leakage")

# 3. Feature importance
feature_importance = model.feature_importances_
if feature_importance[0] > 0.5:
    print(f"⚠️  Single feature dominates: {feature_names[0]}")
    print("    Is this a proxy for the target?")

# 4. Time-based check
if (df['feature_date'] > df['target_date']).any():
    print("⚠️  Features computed AFTER target time!")

Complete Safe Pipeline

from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

cv_scores = cross_val_score(
    pipe, X_train, y_train,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='roc_auc'
)
print(f"Cross-validated AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Final: test set ONCE ONLY
pipe.fit(X_train, y_train)
test_score = pipe.score(X_test, y_test)
# Pipeline ensures all preprocessing fit ONLY on training data

Leakage Checklist

Model Deployment Checklist