Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python
Scikit-learn Best Practices
Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.
Code Style and Structure
Write concise, technical responses with accurate Python examples
Prioritize reproducibility in machine learning workflows
Use functional programming for data pipelines
Use object-oriented programming for custom estimators
Prefer vectorized operations over explicit loops
Follow PEP 8 style guidelines
Machine Learning Workflow
Data Preparation
Always split data before any preprocessing: train/validation/test
Use train_test_split() with random_state for reproducibility
Stratify splits for imbalanced classification: stratify=y
Keep test set completely separate until final evaluation
Feature Engineering
Scale features appropriately for distance-based algorithms
Use StandardScaler for normally distributed features
Use MinMaxScaler for bounded features
Use RobustScaler for data with outliers
Encode categorical variables: OneHotEncoder, OrdinalEncoder, LabelEncoder
Handle missing values: SimpleImputer, KNNImputer
Pipelines
Always use Pipeline to chain preprocessing and modeling
Prevents data leakage by fitting transformers only on training data
Makes code cleaner and more reproducible
Enables easy deployment and serialization
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
Column Transformers
Use ColumnTransformer for different preprocessing per feature type
Combine numeric and categorical preprocessing in single pipeline
Model Selection and Tuning
Cross-Validation
Use cross-validation for reliable performance estimates
cross_val_score() for quick evaluation
cross_validate() for multiple metrics
Use appropriate CV strategy:
KFold for regression
StratifiedKFold for classification
TimeSeriesSplit for temporal data
GroupKFold for grouped data
Hyperparameter Tuning
Use GridSearchCV for exhaustive search
Use RandomizedSearchCV for large parameter spaces
Always tune on training/validation data, never test data
Set n_jobs=-1 for parallel processing
Model Evaluation
Classification Metrics
Use appropriate metrics for your problem:
accuracy_score for balanced classes
precision_score, recall_score, f1_score for imbalanced
roc_auc_score for ranking ability
Use classification_report() for comprehensive overview
Examine confusion_matrix() for error analysis
Regression Metrics
mean_squared_error (MSE) for general use
mean_absolute_error (MAE) for interpretability
r2_score for explained variance
Evaluation Best Practices
Report confidence intervals, not just point estimates
Use multiple metrics to understand model behavior
Compare against meaningful baselines
Evaluate on held-out test set only once, at the end
Handling Imbalanced Data
Use stratified splitting and cross-validation
Consider class weights: class_weight='balanced'
Use appropriate metrics (F1, AUC-PR, not accuracy)
Adjust decision threshold based on business needs
Feature Selection
Use SelectKBest with statistical tests
Use RFE (Recursive Feature Elimination)
Use model-based selection: SelectFromModel
Examine feature importances from tree-based models
Model Persistence
Use joblib for saving and loading models
Save entire pipelines, not just models
Version control model artifacts
Document model metadata
Performance Optimization
Use n_jobs=-1 for parallel processing where available
Consider warm_start=True for iterative training
Use sparse matrices for high-dimensional sparse data
Consider incremental learning with partial_fit() for large data
Key Conventions
Import from submodules: from sklearn.ensemble import RandomForestClassifier
Set random_state for reproducibility
Use pipelines to prevent data leakage
Document model choices and hyperparametersdon't have the plugin yet? install it then click "run inline in claude" again.
by @mindrally