Item: scikit-learn-best-practices
Rating: 6.9
Author: Implexa

Best practices for scikit-learn machine learning, model development, evaluation, and deployment in Python

SKILL.md

Scikit-learn Best Practices

Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.

Code Style and Structure

Write concise, technical responses with accurate Python examples

Prioritize reproducibility in machine learning workflows

Use functional programming for data pipelines

Use object-oriented programming for custom estimators

Prefer vectorized operations over explicit loops

Follow PEP 8 style guidelines

Machine Learning Workflow

Data Preparation

Always split data before any preprocessing: train/validation/test

Use train_test_split() with random_state for reproducibility

Stratify splits for imbalanced classification: stratify=y

Keep test set completely separate until final evaluation

Feature Engineering

Scale features appropriately for distance-based algorithms

Use StandardScaler for normally distributed features

Use MinMaxScaler for bounded features

Use RobustScaler for data with outliers

Encode categorical variables: OneHotEncoder, OrdinalEncoder, LabelEncoder

Handle missing values: SimpleImputer, KNNImputer

Pipelines

Always use Pipeline to chain preprocessing and modeling

Prevents data leakage by fitting transformers only on training data

Makes code cleaner and more reproducible

Enables easy deployment and serialization

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Column Transformers

Use ColumnTransformer for different preprocessing per feature type

Combine numeric and categorical preprocessing in single pipeline

Model Selection and Tuning

Cross-Validation

Use cross-validation for reliable performance estimates

cross_val_score() for quick evaluation

cross_validate() for multiple metrics

Use appropriate CV strategy:

KFold for regression

StratifiedKFold for classification

TimeSeriesSplit for temporal data

GroupKFold for grouped data

Hyperparameter Tuning

Use GridSearchCV for exhaustive search

Use RandomizedSearchCV for large parameter spaces

Always tune on training/validation data, never test data

Set n_jobs=-1 for parallel processing

Model Evaluation

Classification Metrics

Use appropriate metrics for your problem:

accuracy_score for balanced classes

precision_score, recall_score, f1_score for imbalanced

roc_auc_score for ranking ability

Use classification_report() for comprehensive overview

Examine confusion_matrix() for error analysis

Regression Metrics

mean_squared_error (MSE) for general use

mean_absolute_error (MAE) for interpretability

r2_score for explained variance

Evaluation Best Practices

Report confidence intervals, not just point estimates

Use multiple metrics to understand model behavior

Compare against meaningful baselines

Evaluate on held-out test set only once, at the end

Handling Imbalanced Data

Use stratified splitting and cross-validation

Consider class weights: class_weight='balanced'

Use appropriate metrics (F1, AUC-PR, not accuracy)

Adjust decision threshold based on business needs

Feature Selection

Use SelectKBest with statistical tests

Use RFE (Recursive Feature Elimination)

Use model-based selection: SelectFromModel

Examine feature importances from tree-based models

Model Persistence

Use joblib for saving and loading models

Save entire pipelines, not just models

Version control model artifacts

Document model metadata

Performance Optimization

Use n_jobs=-1 for parallel processing where available

Consider warm_start=True for iterative training

Use sparse matrices for high-dimensional sparse data

Consider incremental learning with partial_fit() for large data

Key Conventions

Import from submodules: from sklearn.ensemble import RandomForestClassifier

Set random_state for reproducibility

Use pipelines to prevent data leakage

Document model choices and hyperparameters

scikit-learn-best-practices

SKILL.md

related skills