clawhubby @delock

Deepspeed Finetune

Fine-tune large language models using DeepSpeed on local or remote GPUs.

view source

installs

stars

karma

SkillRank score ↗

7.6/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

deepspeed-finetune guides users through hardware assessment, plan selection, and training launch for large language models using deepspeed optimization, with explicit vram budgeting, activation checkpointing tuning, and multi-gpu support.

structure

8.0

trigger phrases

6.0

procedure

8.0

edge cases

8.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: deepspeed-finetune
description: Fine-tune large language models using DeepSpeed on local or remote GPUs.
version: 1.0.5
metadata:
  openclaw:
    requires:
      bins:
        - python3
        - deepspeed
        - sshpass
    emoji: "⚡"
    homepage: https://github.com/deepspeedai/deepspeed-skills/tree/main/openclaw/deepspeed-finetune
---

# DeepSpeed Fine-tuning Skill

This skill enables efficient model fine-tuning using DeepSpeed with various optimization strategies.

## Prerequisites

- Python 3.8+
- GPU(s) or accelerator(s) with DeepSpeed-supported backend (CUDA, ROCm, Intel XPU, etc.)
- DeepSpeed: `pip install deepspeed`
- Transformers, Datasets, PEFT (for LoRA support)
- sshpass: `sudo apt-get install sshpass` (for remote training)

## Plan Selection Workflow

**Never auto-select a plan.** List viable options based on user hardware and requirements, and let the user decide.

### Step 1: Gather Information

Confirm the following with the user:
- **Target model**: Model name and parameter count (e.g., Qwen2.5-7B)
- **Hardware environment**:
  - GPU VRAM x count (e.g., "single 24GB GPU")
  - CPU core count
  - RAM size
  - Free disk space
  - NVMe SSD availability (affects ZeRO NVMe offload)
- **Training goal**: Full fine-tuning or parameter-efficient? Dataset size? Expected quality?
- **Budget/time constraints**: Acceptable training duration?

If the user only provides an SSH or remote machine address, connect first and auto-detect hardware (`nvidia-smi`, `free -h`, `df -h`, `nproc`).

### Step 2: Evaluate Feasibility

Estimate VRAM requirements based on model size (bf16):

| Params | Model Weights (bf16) | + Adam Optimizer + Gradients |
|--------|---------------------|----------------------------|
| 0.5B | ~1 GB | ~5 GB |
| 1.5B | ~3 GB | ~15 GB |
| 3B | ~6 GB | ~30 GB |
| 7B | ~14 GB | ~70 GB |
| 14B | ~28 GB | ~140 GB |
| 32B | ~64 GB | ~320 GB |
| 72B | ~144 GB | ~720 GB |

**Breakdown**: Adam optimizer stores 2 fp32 state tensors (momentum + variance) = 8 bytes/param. Gradients = 2 bytes/param (bf16). Total approx. 10 bytes/param (5x model weight size).

**Activation memory**: Depends on sequence length and batch size, not model params alone.
- Formula: `activation approx. 34 x seq_len x hidden_size x batch_size x bytes_per_element`
- Example: 7B model (hidden=4096), seq_len=2048, batch_size=4, bf16 -> ~1.5 GB per layer; ~60 GB total (can dominate VRAM)
- Gradient checkpointing reduces this by ~80% (recomputes instead of storing), but adds ~20% compute overhead
- **Rule of thumb**: if seq_len x batch_size > 8192, activation memory likely exceeds model weights

**LoRA/QLoRA**: VRAM depends on rank, target modules, and layer dimensions — not directly proportional to total model params. See [references/lora_guide.md](references/lora_guide.md) for LoRA-specific memory estimation.

### Step 2.5: Activation Checkpointing

If VRAM is tight, activation checkpointing is the most impactful knob — it can reduce activation memory by ~80%.

**How it works**: Instead of storing all intermediate activations for backprop, only save checkpoints at select layers. Remaining activations are recomputed during backward pass. Trades compute for memory.

**Two ways to enable:**

1. **HF Trainer flag** (simplest, works out of the box):
```bash
python scripts/ds_train.py --gradient_checkpointing ...
```

2. **DeepSpeed config** (fine-grained control):
```json
{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4
  }
}
```

| Option | Effect | When to use |
|--------|--------|-------------|
| `partition_activations` | Shard checkpoints across model-parallel GPUs | Multi-GPU with model parallelism |
| `cpu_checkpointing` | Store checkpoints in CPU RAM instead of GPU | GPU memory very tight |
| `contiguous_memory_optimization` | Reduce memory fragmentation | Large models, many checkpoints |
| `number_checkpoints` | Control checkpoint frequency (fewer = less VRAM, more compute) | Tune based on VRAM budget |

### Step 3: List Options

Based on the VRAM assessment, list all viable approaches. Example:

```
Based on your hardware (single 24GB GPU, 64GB RAM, 500GB disk),
Qwen2.5-7B has these training options:

Option A: LoRA Fine-tuning (Recommended)
  - VRAM needed: ~22 GB
  - Speed: Fast
  - Quality: Good for instruction alignment, style adaptation
  - Trainable params: ~20M (0.4% of total)

Option B: QLoRA Fine-tuning (Saves VRAM)
  - VRAM needed: ~12 GB
  - Speed: Medium (quantization/dequantization overhead)
  - Quality: Slightly below LoRA, but gap is small

Option C: Full Fine-tuning (Not feasible)
  - VRAM needed: ~56 GB (exceeds 24GB)
  - Requires ZeRO-2 + CPU offload, or larger GPU

Which option do you prefer?
```

### Step 4: Hardware Insufficient? Make Recommendations

If no plan is viable on current hardware, recommend specs using generic hardware metrics (no brand names):

```
You want to fully fine-tune a 7B model, but current hardware (single 24GB GPU) is insufficient.
Recommended hardware specs:

Minimum:
  - GPU: single 80GB VRAM
  - CPU: 16+ cores
  - RAM: 128 GB+
  - Disk: 200 GB+ free space

Recommended:
  - GPU: 2x 80GB VRAM (ZeRO-2 doubles training speed)
  - CPU: 32+ cores
  - RAM: 256 GB+
  - Disk: 500 GB+ free space

Alternatively, use LoRA — 24GB VRAM is sufficient for 7B models.
```

### Key Principles

- **Never auto-select and start training** — always list options and wait for user confirmation
- **Recommend but don't decide** — say "I recommend Option A because..." but let the user choose
- **Use generic hardware metrics** — VRAM in GB, GPU count, CPU cores, RAM in GB, disk in GB. No brand names.
- **Leave VRAM headroom** — recommend at least 20% buffer to avoid OOM
- **If user picks an infeasible option, warn them clearly** rather than silently switching

## Core Capabilities

### 1. Training Configuration

Generate DeepSpeed ZeRO configurations:

```python
from scripts.generate_ds_config import generate_zero_config

# ZeRO Stage 2 with optimizer offloading
config = generate_zero_config(
    zero_stage=2,
    offload_optimizer=True,
    offload_device="nvme",
    nvme_path="/local_nvme"
)
```

### 2. Training Launch

Use the training launcher script:

```bash
python scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --dataset_path data/my_dataset \
  --output_dir ./outputs \
  --deepspeed assets/ds_config_zero2.json \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-5 \
  --lora_r 16 \
  --lora_alpha 32
```

### 3. LoRA/QLoRA Integration

For parameter-efficient fine-tuning:

```python
# LoRA config is auto-generated based on arguments
peft_config = {
    "peft_type": "LORA",
    "r": 16,
    "lora_alpha": 32,
    "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}
```

### 4. Multi-GPU Training

Use the `deepspeed` launcher for multi-GPU training (recommended over `torchrun`):

```bash
# Multi-GPU on single node
deepspeed --num_gpus=4 scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --deepspeed assets/ds_config_zero3.json \
  ...

# Multi-node
deepspeed --hostfile hosts.txt scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --deepspeed assets/ds_config_zero3.json \
  ...
```

### 5. Training Monitoring

Monitor training progress:

```python
from scripts.monitor_training import TrainingMonitor

monitor = TrainingMonitor(log_dir="./outputs")
monitor.plot_loss()
monitor.get_latest_checkpoint()
```

### 6. Early Stopping

Automatically monitors eval loss and stops training early when there's no improvement across consecutive evaluations, then loads the best checkpoint.

**Parameters:**
- `--early_stopping_patience` — How many consecutive evals without improvement to tolerate. Set to 0 to disable (default). Recommended: 3-10.
- `--early_stopping_threshold` — Minimum eval loss improvement to count as an improvement. Default 0.0 (any decrease counts).

**Example:**

```bash
python scripts/ds_train.py \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_path tatsu-lab/alpaca \
  --use_peft True \
  --early_stopping_patience 5 \
  --early_stopping_threshold 0.001 \
  --eval_strategy steps \
  --eval_steps 100 \
  --num_train_epochs 3 \
  ...
```

**Auto-configuration:** When `early_stopping_patience > 0`, the script automatically:
1. Enables `load_best_model_at_end=True`
2. Sets `metric_for_best_model=eval_loss`, `greater_is_better=False`
3. Aligns `save_strategy` with `eval_strategy` (synced saving is needed to restore best checkpoint)

**Notes:**
- Must also set `eval_strategy` (e.g., `steps` + `eval_steps`), otherwise early stopping won't work
- Don't set `patience` too low (<3) — early training fluctuations may cause premature stopping
- For LoRA fine-tuning, `patience=5` with `eval_steps=100` typically works well

## Remote Training

When training needs to run on a remote GPU server, see [references/remote_training.md](references/remote_training.md) for the complete guide including agent guidelines, security model, and command reference.

## Troubleshooting

### OOM Errors
- Reduce batch size or increase gradient accumulation steps
- Enable gradient checkpointing: `--gradient_checkpointing`
- Use ZeRO-3 with CPU/NVMe offloading
- Reduce LoRA rank: `--lora_r 8`
- See [references/troubleshooting.md](references/troubleshooting.md) for detailed solutions

### Slow Training
- Ensure bf16/fp16 is enabled
- Check GPU utilization with `nvidia-smi`
- Use FlashAttention if available
- Optimize data loading with `--dataloader_num_workers`
- See [references/troubleshooting.md](references/troubleshooting.md) for detailed solutions

### Checkpoint Issues
- Use `--save_strategy steps` with `--save_steps`
- Enable `--save_total_limit` to cap checkpoint count
- For ZeRO-3, use `--zero3_save_16bit_model` to save FP16 weights
- See [references/troubleshooting.md](references/troubleshooting.md) for detailed solutions

### MPI Errors (multi-GPU only)
- Single-GPU training does **not** need MPI
- If you see MPI errors on single GPU, use `python3` directly instead of `deepspeed` launcher
- See [references/troubleshooting.md](references/troubleshooting.md#mpi-errors) for full MPI debugging guide

### Single-GPU Strategy
- See [references/single_gpu_strategy.md](references/single_gpu_strategy.md) for strategy selection, CPU/NVMe offload examples, and decision principles

## References

- **[Quick Start Guide](references/quick_start.md)** — Common training patterns and full examples
- **[DeepSpeed Guide](references/deepspeed_guide.md)** — DeepSpeed documentation and configuration reference
- **[LoRA/PEFT Best Practices](references/lora_guide.md)** — LoRA/QLoRA parameter tuning guide
- **[ZeRO Optimization Guide](references/zero_optimization.md)** — ZeRO stage comparison and optimization tips
- **[Single-GPU Strategy](references/single_gpu_strategy.md)** — Strategy selection for single-GPU training
- **[Remote Training Guide](references/remote_training.md)** — Remote training via SSH, agent guidelines, and security model
- **[Troubleshooting](references/troubleshooting.md)** — Common errors and solutions (OOM, NaN loss, MPI, NCCL, etc.)

related skills

semantically similar in the cross-vendor index

smithery

71% match

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100

don't have the plugin yet? install it then click "run inline in claude" again.

DeepSpeed Fine-tuning Skill

intent

fine-tune large language models using deepspeed with hardware-aware optimization strategies. use this skill when you need to adapt a pretrained model to your data, domain, or style, and want to maximize training efficiency across single or multi-gpu setups. the skill walks you through hardware assessment, plan selection (lora, qlora, full fine-tuning, or zero-stage strategies), and safe launch procedures that avoid silent oom or performance cliffs.

inputs

required software:

python 3.8+
deepspeed: pip install deepspeed
transformers: pip install transformers
datasets: pip install datasets
peft (for lora/qlora): pip install peft
sshpass (for remote training): apt-get install sshpass or brew install sshpass

required hardware:

at least one gpu with deepspeed-supported backend: cuda (nvidia), rocm (amd), intel xpu, or cpu fallback (very slow)
for remote training: ssh access to target machine with passwordless ssh or sshpass credentials

required data and models:

target model name or path (e.g., meta-llama/Llama-2-7b-hf, Qwen/Qwen2.5-7B)
training dataset: huggingface dataset identifier or local file path (jsonl, csv, parquet, or arrow format)
optional: evaluation dataset for early stopping and validation metrics

environment variables (optional but recommended):

HUGGINGFACE_TOKEN: hf hub token for gated models (set via huggingface-cli login or export)
DEEPSPEED_OFFLOAD_PATH: path for zero-offload nvme staging (defaults to /tmp)
CUDA_VISIBLE_DEVICES: limit gpus visible to training (e.g., 0,1 for first two gpus)

external connections:

huggingface model hub: requires internet access to download model weights and tokenizers
huggingface datasets: requires internet access (or local cache at ~/.cache/huggingface)

remote training setup (if applicable):

ssh host: user@hostname or ip:port
ssh credentials: password or key-based auth (key recommended for security)
target machine must have: python 3.8+, cuda toolkit or rocm, deepspeed, transformers installed

procedure

step 1: gather hardware and training context

confirm with the user (or auto-detect via ssh if remote):

target model: name and parameter count (e.g., "Qwen2.5-7B", "Llama-2-70B")
hardware environment:
- total gpu vram in gb (e.g., "single 24gb gpu" or "4x 80gb a100s")
- gpu count
- cpu core count
- total system ram in gb
- free disk space in gb
- nvme ssd available: yes/no (affects zero offload strategy)
training goal: full fine-tuning, lora, or qlora? dataset size in samples? quality expectations?
time budget: acceptable training duration in hours or days?
inference needs: after training, will you run inference on same hardware or different hardware?

input: user responses or remote ssh connection

output: structured hardware profile and training goals documented

if user provides only ssh address, auto-detect via remote commands:

gpu: nvidia-smi --query-gpu=memory.total --format=csv,noheader
cpu: nproc (logical cores)
ram: free -h (system memory)
disk: df -h / (free space in training directory)
nvme: lsblk | grep nvme (check for nvme drives)

step 2: estimate vram requirements and feasibility

use the memory breakdown table below to compute worst-case vram needs. all estimates assume bf16 dtype (mixed precision).

model weight memory (bf16):

params	model weights (bf16)
0.5B	~1 GB
1.5B	~3 GB
3B	~6 GB
7B	~14 GB
14B	~28 GB
32B	~64 GB
72B	~144 GB

optimizer state memory (full fine-tuning only):

adam optimizer stores 2 fp32 tensors per parameter (momentum + variance) = 8 bytes/param
gradients in bf16 = 2 bytes/param
total for adam: ~10 bytes/param = ~5x model weight size
example: 7B model = ~70 gb for full fine-tuning (model + optimizer + gradients)

activation memory (sequence length and batch size dependent, not just model size):

formula: activation memory ~= 34 x sequence_length x hidden_size x batch_size x bytes_per_element
example: 7B model (hidden_size=4096), seq_len=2048, batch_size=4, bf16 = ~1.5 gb per layer, ~60 gb total across all layers
rule of thumb: if (seq_len x batch_size) > 8192, activation memory likely exceeds model weights
gradient checkpointing (see step 2.5) reduces activation memory by ~80% at cost of ~20% compute overhead

lora/qlora vram (parameter-efficient):

vram depends on rank, target module count, hidden dimensions, not total model params
rough estimate: lora needs model_weights + (rank x num_target_modules x hidden_size x 2 dtypes) + activations
example: 7B model with rank=16, 4 target modules (q,v,k,o), hidden=4096 = ~14gb (model) + ~8mb (lora params) + ~15gb (activations) = ~29gb
lora params overhead is negligible; activation memory dominates

input: model name, gpu vram, sequence length, batch size

output: feasibility assessment with vram breakdown; list which strategies (lora, qlora, full, zero-x) fit within gpu vram budget with 20% headroom

step 2.5: activate gradient checkpointing if vram is tight

if vram is tight (estimated usage > 70% of gpu vram), enable activation checkpointing. this trades compute (~20% slower) for memory (~80% reduction in activation memory).

two enable methods:

via trainer flag (simplest):

python scripts/ds_train.py --gradient_checkpointing True ...

via deepspeed config json (fine-grained control):

{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4
  }
}

activation checkpointing options:

option	effect	when to use
`partition_activations`	shard checkpoints across model-parallel gpus	multi-gpu setup with model parallelism (zero-3)
`cpu_checkpointing`	store checkpoints in cpu ram instead of gpu	gpu memory critically tight; cpu ram abundant
`contiguous_memory_optimization`	reduce memory fragmentation	large models (>30b params), many checkpoints
`number_checkpoints`	how many layers to checkpoint (fewer = less vram, more compute)	tune if still hitting oom after enabling checkpointing

input: tight vram assessment from step 2

output: deepspeed config with activation checkpointing enabled, or trainer arg --gradient_checkpointing True

step 3: list all viable training options

based on vram budget, list every option that fits (with 20% headroom). do not auto-select. present them as:

example for single 24gb gpu, qwen2.5-7b:

Qwen2.5-7B on your hardware (single 24GB GPU, 64GB RAM, 500GB disk):

Option A: LoRA Fine-tuning (Recommended)
  VRAM needed: ~22 GB
  Speed: ~150 samples/sec (typical)
  Quality: Good for style, instruction alignment, domain adaptation
  Trainable params: ~20M (0.4% of 7B total)
  Inference: Can run on same 24GB GPU or any smaller GPU
  Best for: Quick iteration, limited vram, preserving general knowledge

Option B: QLoRA Fine-tuning (Most VRAM-efficient)
  VRAM needed: ~12 GB
  Speed: ~80 samples/sec (quantization overhead)
  Quality: Slightly below LoRA, gap is small for most tasks
  Trainable params: ~20M (0.4% of 7B total)
  Inference: Can run on same 24GB GPU or smaller
  Best for: Severe VRAM constraints, acceptable speed tradeoff

Option C: Full Fine-tuning (Requires ZeRO + Offload)
  VRAM needed: ~56 GB (with ZeRO-3 + CPU offload) or ~70 GB (ZeRO-2)
  Effective hardware: Exceeds your 24GB GPU; would require CPU offload (very slow)
  Speed: Not feasible without hardware upgrade
  Trainable params: All 7B (100%)
  Quality: Maximum quality, but not worth the slowdown
  Recommendation: Skip this. Use LoRA instead.

Which option do you prefer? (A, B, or C)

input: hardware profile, vram estimates, model size

output: numbered list of viable options with explicit vram, speed, quality, trainability, and inference notes; user selection of one option

step 4: if no option is feasible, recommend hardware

if user's hardware cannot fit any strategy (even qlora with checkpointing), give explicit hardware upgrade recommendations using generic metrics only (no brand names).

example:

Your goal (full fine-tune Qwen2.5-7B) is not feasible on current hardware (24GB GPU).

To achieve your goal, upgrade to:

Minimum spec:
  GPU: single 80GB VRAM
  CPU: 16+ cores
  RAM: 128GB+
  Disk: 200GB+ free

Recommended spec:
  GPU: 2x 80GB VRAM (doubles training speed via ZeRO-2)
  CPU: 32+ cores
  RAM: 256GB+
  Disk: 500GB+ free

Alternative: Switch to LoRA fine-tuning (feasible on 24GB GPU, 98% of full FT quality).

if user picks infeasible option, warn explicitly: "Option C (full FT) needs ~56GB VRAM but you have 24GB. training will oom. pick option A or B instead, or upgrade hardware."

input: user selection of infeasible strategy, current hardware specs

output: hardware upgrade recommendations or fallback strategy suggestion; explicit oom warning if user insists

step 5: generate deepspeed config for selected strategy

based on selected option (lora, qlora, or full ft), generate or load appropriate deepspeed zero config.

lora/qlora config (minimal, single or multi-gpu):

{
  "train_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 2e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.0
    }
  },
  "scheduler": {
    "type": "WarmupLinear",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 2e-5,
      "warmup_num_steps": 500,
      "total_num_steps": 10000
    }
  },
  "bf16": {
    "enabled": true
  },
  "gradient_checkpointing": true,
  "zero_optimization": {
    "stage": 1,
    "offload_optimizer": {
      "device": "cpu"
    }
  }
}

full fine-tuning config (zero-2, single node multi-gpu):

{
  "train_batch_size": 8,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "AdamW",
    "params": { "lr": 1e-4 }
  },
  "bf16": { "enabled": true },
  "gradient_checkpointing": true,
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true
    }
  }
}

full fine-tuning config (zero-3, multi-node or extreme vram constraint):

{
  "train_batch_size": 16,
  "gradient_accumulation_steps": 1,
  "optimizer": { "type": "AdamW", "params": { "lr": 1e-4 } },
  "bf16": { "enabled": true },
  "gradient_checkpointing": true,
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/local_nvme"
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme"
    },
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

input: selected strategy (lora/qlora/full ft), gpu count, nvme availability, batch size preference

output: deepspeed config json file (e.g., assets/ds_config_lora.json); save to disk in working directory

step 6: prepare dataset and validate format

load dataset and check format. supported formats: huggingface dataset id, jsonl, csv, parquet, or arrow.

jsonl example (minimal: text column required):

{"text": "Your training text here. Can be single or multiple columns."}
{"text": "Another sample."}

csv example (columns: text, instruction, output, etc.):

text,label
"sample text 1","category1"
"sample text 2","category2"

huggingface dataset example:

from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca")  # or local path
print(dataset.column_names)  # check available columns

validate dataset:

has at least 100 samples (minimum; <1000 samples risks overfitting)
no empty or null text columns
text length reasonable (100-10000 tokens typically)
if using eval dataset, must be separate from train (no data leakage)

input: dataset path or huggingface id, dataset format

output: loaded dataset with schema confirmed; sample row printed; train/eval split assigned

step 7: launch training

use deepspeed launcher for single or multi-gpu. never use plain python for multi-gpu (use deepspeed launcher or torchrun, not python).

single gpu (no deepspeed launcher needed, but can use):

python scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --dataset_path data/my_dataset \
  --output_dir ./outputs \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --num_train_epochs 3 \
  --learning_rate 2e-5 \
  --use_peft True \
  --lora_r 16 \
  --lora_alpha 32 \
  --gradient_checkpointing True \
  --bf16 True \
  --logging_steps 10 \
  --save_steps 500 \
  --eval_strategy steps \
  --eval_steps 100 \
  --early_stopping_patience 5

multi-gpu single node (4 gpus):

deepspeed --num_gpus=4 scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --dataset_path data/my_dataset \
  --output_dir ./outputs \
  --deepspeed assets/ds_config_zero2.json \
  --per_device_train_batch_size 4 \
  --num_train_epochs 3 \
  --bf16 True \
  --gradient_checkpointing True

multi-node (2 nodes, 4 gpus each):

# step 1: create hosts.txt on node 0
# node0_ip slots=4
# node1_ip slots=4

# step 2: launch from node 0
deepspeed --hostfile hosts.txt scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --deepspeed assets/ds_config_zero3.json \
  --per_device_train_batch_size 4 \
  --num_train_epochs 3

input: deepspeed config, model name, dataset path, training hyperparams, output dir

output: training process started; logs streamed to stdout; checkpoints saved to output_dir

step 8: monitor training and detect issues early

during training, watch for:

good signs:

loss decreasing smoothly every 10-50 steps
gpu utilization 80-95% (lower = data loading bottleneck)
throughput consistent (samples/sec stable)

warning signs (stop and debug if seen):

loss nan/inf (usually indicates lr too high or numeric instability)
loss flat or increasing (lr too low or model not learning)
gpu utilization <50% (data loading is bottleneck, increase workers)
out-of-memory error (oom; reduce batch size or enable checkpointing)

monitoring commands:

# watch loss in real-time
tail -f outputs/training_logs.txt | grep loss

# check gpu memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1

# check training throughput
grep "samples/sec" outputs/training_logs.txt | tail -1

input: training process running, output logs available

output: visual confirmation that training is progressing normally; early detection of oom or divergence

step 9: evaluate with early stopping (optional)

if you specified eval_dataset and early_stopping_patience > 0, training will automatically:

evaluate on eval_dataset every eval_steps
track best eval_loss
load best checkpoint when training ends (if patience exceeded)

parameters:

--eval_strategy steps or epoch: frequency of eval
--eval_steps 100: evaluate every n steps (if eval_strategy=steps)
--early_stopping_patience 5: stop if no improvement for 5 consecutive evals (0 = disabled)
--early_stopping_threshold 0.001: minimum improvement (in absolute loss) to count as improvement

example:

python scripts/ds_train.py \
  --model_name_or_path Qwen/Qwen2.5-7B \
  --dataset_path tatsu-lab/alpaca \
  --eval_dataset_path tatsu-lab/alpaca_eval \
  --use_peft True \
  --eval_strategy steps \
  --eval_steps 100 \
  --early_stopping_patience 5 \
  --early_stopping_threshold 0.001 \
  ...

auto-configuration (when early_stopping_patience > 0):

sets load_best_model_at_end=True
sets metric_for_best_model=eval_loss, greater_is_better=False
syncs save_strategy with eval_strategy

notes:

requires eval_strategy + eval_steps/eval_epochs to be set
patience too low (<3) risks premature stopping due to early training noise
for lora on 7b model, patience=5 with eval_steps=100 is typical

input: eval_dataset, early_stopping hyperparams, eval_strategy set

output: eval_loss printed every eval_steps; best model checkpoint saved; training stops early if no improvement

step 10: save and validate final model

after training completes, validate the saved model:

# check final checkpoint
ls -lah outputs/checkpoint-final/

# verify model weights exist
ls outputs/checkpoint-final/pytorch_model.bin  # or model.safetensors

# for lora: check adapter weights
ls outputs/checkpoint-final/adapter_config.json
ls outputs/checkpoint-final/adapter_model.bin

merge lora weights into base model (optional, for inference deployment):

from peft import AutoPeftModelForCausalLM

# load lora + base model
model = AutoPeftModelForCausalLM.from_pretrained(
    "outputs/checkpoint-final",
    device_map="auto"
)

# merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("outputs/merged_model")
merged_model.push_to_hub("your-username/my-finetuned-model")  # optional

input: training output_dir with saved checkpoints

output: final model weights validated and accessible for inference; optionally merged (lora only) and pushed to huggingface hub

decision points

when to use lora vs qlora vs full fine-tuning:

use lora if: gpu has >25gb vram, you want <1% quality loss vs full ft, need fast training, model size is 7b-70b
use qlora if: gpu has 12-24gb vram, you accept tiny quality hit vs lora (usually <2%), training speed is secondary
use full fine-tuning if: you have >80gb vram per gpu (or multi-gpu setup), need maximum quality, willing to train for hours/days, model is <30b params (larger needs zero-3)
recommendation: for most users, lora is the sweet spot (quality, speed, vram efficiency)

when to enable gradient checkpointing:

enable if: estimated activation memory + model weights > 70% of gpu vram, acceptable to lose 20% training speed
disable if: gpu vram is abundant (>50% unused), or speed is critical and vram is not bottleneck
auto-enable in procedure step 2.5 if vram is tight

when to use zero-1 vs zero-2 vs zero-3:

zero-1 (optimizer state partitioning): single or 2-gpu setup, limited multi-gpu benefit, use for lora/qlora
zero-2 (optimizer + gradient partitioning): 2-8 gpus per node, good speed/vram balance, use for full ft with moderate vram
zero-3 (parameter + optimizer + gradient partitioning): 8+ gpus or multi-node, extreme vram savings but slower (overlapping comm overhead), use only if zero-2 oom or multi-node
recommendation: start with zero-1 or zero-2; only move to zero-3 if oom or multi-node setup

when to offload to cpu vs nvme:

cpu offload: offload_device="cpu", best if cpu ram is abundant (>256gb) and cpu-gpu bandwidth is high (nvlink, thunderbolt)
nvme offload: offload_device="nvme" or "disk", slower but necessary if cpu ram is limited and you must fit model; requires fast nvme (>3gb/sec read/write)
no offload: if model fits in vram (lora, small models), skip offloading entirely for speed

when to use remote training:

use remote if: local hardware insufficient, have ssh access to remote gpu server, network latency <100ms typical
avoid remote if: network is unreliable (interrupts training), you need real-time monitoring, latency >200ms (slow data transfer)
see: references/remote_training.md for complete remote setup and security guidelines

when to run validation/early stopping:

use if: eval_dataset available, want to avoid overfitting, training epochs > 1, hyperparams are uncertain
skip if: dataset small (<1000 samples), single epoch training, computational budget tight (eval adds overhead)
patience tuning: set patience=3-5 for most cases; higher (10+) if training is noisy or dataset small; lower (<3) if dataset very clean

output contract

final trained model saved at: {output_dir}/checkpoint-final/ or latest checkpoint (e.g., checkpoint-500/)

model structure:

pytorch_model.bin or model.safetensors: base model weights (full ft) or adapter weights (lora)
config.json: model architecture config
tokenizer.json or tokenizer_config.json: tokenizer config
adapter_config.json + adapter_model.bin: lora config and weights (