ML Best Practices

The Cardinal Rules

Rule #1 — Never Leak Test Data Into Training. Fit ALL preprocessing (scalers, encoders, imputers) on TRAINING data only. Then transform both train and test with the fitted transformer. The sklearn Pipeline enforces this automatically.

❌ Data leakage!

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)
# Scaler learned from ALL data!

✅ No leakage

X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier())
])
pipe.fit(X_train, y_train)    # internally only fits scaler on X_train
pipe.score(X_test, y_test)

Reproducibility

# Set seeds EVERYWHERE
import numpy as np, random, os

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# sklearn
model = RandomForestClassifier(random_state=SEED)
X_train, X_test = train_test_split(X, y, random_state=SEED)

# PyTorch
import torch
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

# TensorFlow
import tensorflow as tf
tf.random.set_seed(SEED)

Experiment Tracking

# Use MLflow or Weights & Biases
import mlflow

with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    mlflow.log_metric("test_accuracy", accuracy)
    mlflow.log_metric("test_f1", f1)
    mlflow.sklearn.log_model(model, "model")

Feature Engineering Checklist

Are features scaled appropriately for the algorithm?
Are categorical variables encoded?
Are missing values handled — fitted on train only?
Is there target leakage? (features computed after the target event)
Are date features extracted correctly?
Are correlated features reviewed (multicollinearity risk)?
Are outliers addressed?
Is feature importance checked post-training?

Evaluation Checklist

Is the baseline (dummy) model evaluated?
Is cross-validation used instead of a single split?
Is class imbalance addressed (for classification)?
Are multiple metrics reported (not just accuracy)?
Is the final model evaluated ONCE on the test set?
Are confidence intervals reported for key metrics?
Is the learning curve checked (bias/variance diagnosis)?

Hyperparameter Tuning

Don't tune on the test set! Use nested cross-validation for unbiased evaluation.

from sklearn.model_selection import cross_val_score, GridSearchCV

outer_cv = StratifiedKFold(5)
inner_cv = StratifiedKFold(3)

param_grid = {"n_estimators": [100, 200], "max_depth": [5, None]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=inner_cv)

# Outer CV gives unbiased estimate of tuned model performance
scores = cross_val_score(grid, X, y, cv=outer_cv, scoring="roc_auc")
print(f"Nested CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

Data Leakage — The Silent Killer

Data leakage occurs when information from OUTSIDE the training set is used to build the model. The model appears amazing during development but fails catastrophically in production.

Scenario 1: Test-Time Leakage (Most Common)

Fit preprocessor on ALL data (train + test), then split.

Impact: Model shows 95% accuracy in development, 60% in production. Career-ending.

❌ Leakage

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)
# Scaler learned from ALL data!

✅✅ Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])
X_train, X_test = train_test_split(X, y)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)  # No leakage

Scenario 2: Target Leakage

Using information from AFTER the prediction time.

❌ Future information

# Predict loan default within 12 months
# 'last_payment_date' is AFTER origination!
df['days_since_last_payment'] = (
    df['last_payment_date']
    - df['origination_date']
).days

✅ Known at prediction time

# Only features known at origination
df['credit_score'] = ...
df['income'] = ...
df['loan_age_months'] = (
    df['data_date']
    - df['origination_date']
).days / 30

Real Case: Model trained to predict pneumonia risk with 98% accuracy → deployed at 52%. Root cause: oxygen saturation (measured in hospital AFTER diagnosis) was used as a feature.

Scenario 3: Proxy Variables

Using variables that encode the target rather than predict it.

❌ Lagging indicator

# Predict churn
features = ['support_tickets_open']
# Tickets open BECAUSE they're churning
# not BEFORE they churn (proxy!)

✅ Leading indicators

features = [
    'months_as_customer',
    'monthly_usage_hours',
    'feature_diversity',
    'sentiment_score'
]
# These LEAD to churn, not follow

Scenario 4: Train-Test Contamination

Choosing features after seeing test performance (data snooping).

# ✅ CORRECT: Pre-register features before seeing test results
# 1. Choose features based on domain knowledge or train-only analysis
# 2. Train model
# 3. Evaluate on test ONCE
# 4. Report results (don't iterate based on test performance)

# Use cross-validation on TRAINING data only
gs = GridSearchCV(
    RandomForestClassifier(),
    {'max_depth': [5, 10, 15]},
    cv=5  # Cross-validate only on TRAIN
)
gs.fit(X_train, y_train)

# Final evaluation on test (ONCE)
test_score = gs.score(X_test, y_test)

How to Detect Leakage

# 1. Train-test gap
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
if train_score - test_score > 0.20:
    print("⚠️  Possible leakage! Large gap")

# 2. Surprisingly high performance
if test_score > 0.95:
    print("⚠️  Too good? Check for leakage")

# 3. Feature importance
feature_importance = model.feature_importances_
if feature_importance[0] > 0.5:
    print(f"⚠️  Single feature dominates: {feature_names[0]}")
    print("    Is this a proxy for the target?")

# 4. Time-based check
if (df['feature_date'] > df['target_date']).any():
    print("⚠️  Features computed AFTER target time!")

Complete Safe Pipeline

from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

cv_scores = cross_val_score(
    pipe, X_train, y_train,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='roc_auc'
)
print(f"Cross-validated AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Final: test set ONCE ONLY
pipe.fit(X_train, y_train)
test_score = pipe.score(X_test, y_test)
# Pipeline ensures all preprocessing fit ONLY on training data

Leakage Checklist

All preprocessing fit on TRAIN only
Features observable BEFORE prediction time
No proxy variables that encode the target
Cross-validation done on TRAIN only
Hyperparameter tuning done on TRAIN only
No feature engineering based on test set performance
Test set evaluated ONCE at the end
For time-series: no forward leakage

Model Deployment Checklist

Model saved with joblib/pickle (or ONNX for cross-language)
Preprocessing pipeline saved WITH the model
Input validation added (check dtypes, ranges, required features)
Prediction logging implemented
Model version tracked
Performance monitoring set up (data drift detection)
Rollback plan defined

ML Best Practices

The Cardinal Rules

Reproducibility

Experiment Tracking

Feature Engineering Checklist

Evaluation Checklist

Hyperparameter Tuning

Data Leakage — The Silent Killer

Scenario 1: Test-Time Leakage (Most Common)

Scenario 2: Target Leakage

Scenario 3: Proxy Variables

Scenario 4: Train-Test Contamination

How to Detect Leakage

Complete Safe Pipeline

Leakage Checklist

Model Deployment Checklist

🔗 Related Notes