Fine-tune large language models using DeepSpeed on local or remote GPUs.
---
name: deepspeed-finetune
description: Fine-tune large language models using DeepSpeed on local or remote GPUs.
version: 1.0.5
metadata:
openclaw:
requires:
bins:
- python3
- deepspeed
- sshpass
emoji: "⚡"
homepage: https://github.com/deepspeedai/deepspeed-skills/tree/main/openclaw/deepspeed-finetune
---
# DeepSpeed Fine-tuning Skill
This skill enables efficient model fine-tuning using DeepSpeed with various optimization strategies.
## Prerequisites
- Python 3.8+
- GPU(s) or accelerator(s) with DeepSpeed-supported backend (CUDA, ROCm, Intel XPU, etc.)
- DeepSpeed: `pip install deepspeed`
- Transformers, Datasets, PEFT (for LoRA support)
- sshpass: `sudo apt-get install sshpass` (for remote training)
## Plan Selection Workflow
**Never auto-select a plan.** List viable options based on user hardware and requirements, and let the user decide.
### Step 1: Gather Information
Confirm the following with the user:
- **Target model**: Model name and parameter count (e.g., Qwen2.5-7B)
- **Hardware environment**:
- GPU VRAM x count (e.g., "single 24GB GPU")
- CPU core count
- RAM size
- Free disk space
- NVMe SSD availability (affects ZeRO NVMe offload)
- **Training goal**: Full fine-tuning or parameter-efficient? Dataset size? Expected quality?
- **Budget/time constraints**: Acceptable training duration?
If the user only provides an SSH or remote machine address, connect first and auto-detect hardware (`nvidia-smi`, `free -h`, `df -h`, `nproc`).
### Step 2: Evaluate Feasibility
Estimate VRAM requirements based on model size (bf16):
| Params | Model Weights (bf16) | + Adam Optimizer + Gradients |
|--------|---------------------|----------------------------|
| 0.5B | ~1 GB | ~5 GB |
| 1.5B | ~3 GB | ~15 GB |
| 3B | ~6 GB | ~30 GB |
| 7B | ~14 GB | ~70 GB |
| 14B | ~28 GB | ~140 GB |
| 32B | ~64 GB | ~320 GB |
| 72B | ~144 GB | ~720 GB |
**Breakdown**: Adam optimizer stores 2 fp32 state tensors (momentum + variance) = 8 bytes/param. Gradients = 2 bytes/param (bf16). Total approx. 10 bytes/param (5x model weight size).
**Activation memory**: Depends on sequence length and batch size, not model params alone.
- Formula: `activation approx. 34 x seq_len x hidden_size x batch_size x bytes_per_element`
- Example: 7B model (hidden=4096), seq_len=2048, batch_size=4, bf16 -> ~1.5 GB per layer; ~60 GB total (can dominate VRAM)
- Gradient checkpointing reduces this by ~80% (recomputes instead of storing), but adds ~20% compute overhead
- **Rule of thumb**: if seq_len x batch_size > 8192, activation memory likely exceeds model weights
**LoRA/QLoRA**: VRAM depends on rank, target modules, and layer dimensions — not directly proportional to total model params. See [references/lora_guide.md](references/lora_guide.md) for LoRA-specific memory estimation.
### Step 2.5: Activation Checkpointing
If VRAM is tight, activation checkpointing is the most impactful knob — it can reduce activation memory by ~80%.
**How it works**: Instead of storing all intermediate activations for backprop, only save checkpoints at select layers. Remaining activations are recomputed during backward pass. Trades compute for memory.
**Two ways to enable:**
1. **HF Trainer flag** (simplest, works out of the box):
```bash
python scripts/ds_train.py --gradient_checkpointing ...
```
2. **DeepSpeed config** (fine-grained control):
```json
{
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"number_checkpoints": 4
}
}
```
| Option | Effect | When to use |
|--------|--------|-------------|
| `partition_activations` | Shard checkpoints across model-parallel GPUs | Multi-GPU with model parallelism |
| `cpu_checkpointing` | Store checkpoints in CPU RAM instead of GPU | GPU memory very tight |
| `contiguous_memory_optimization` | Reduce memory fragmentation | Large models, many checkpoints |
| `number_checkpoints` | Control checkpoint frequency (fewer = less VRAM, more compute) | Tune based on VRAM budget |
### Step 3: List Options
Based on the VRAM assessment, list all viable approaches. Example:
```
Based on your hardware (single 24GB GPU, 64GB RAM, 500GB disk),
Qwen2.5-7B has these training options:
Option A: LoRA Fine-tuning (Recommended)
- VRAM needed: ~22 GB
- Speed: Fast
- Quality: Good for instruction alignment, style adaptation
- Trainable params: ~20M (0.4% of total)
Option B: QLoRA Fine-tuning (Saves VRAM)
- VRAM needed: ~12 GB
- Speed: Medium (quantization/dequantization overhead)
- Quality: Slightly below LoRA, but gap is small
Option C: Full Fine-tuning (Not feasible)
- VRAM needed: ~56 GB (exceeds 24GB)
- Requires ZeRO-2 + CPU offload, or larger GPU
Which option do you prefer?
```
### Step 4: Hardware Insufficient? Make Recommendations
If no plan is viable on current hardware, recommend specs using generic hardware metrics (no brand names):
```
You want to fully fine-tune a 7B model, but current hardware (single 24GB GPU) is insufficient.
Recommended hardware specs:
Minimum:
- GPU: single 80GB VRAM
- CPU: 16+ cores
- RAM: 128 GB+
- Disk: 200 GB+ free space
Recommended:
- GPU: 2x 80GB VRAM (ZeRO-2 doubles training speed)
- CPU: 32+ cores
- RAM: 256 GB+
- Disk: 500 GB+ free space
Alternatively, use LoRA — 24GB VRAM is sufficient for 7B models.
```
### Key Principles
- **Never auto-select and start training** — always list options and wait for user confirmation
- **Recommend but don't decide** — say "I recommend Option A because..." but let the user choose
- **Use generic hardware metrics** — VRAM in GB, GPU count, CPU cores, RAM in GB, disk in GB. No brand names.
- **Leave VRAM headroom** — recommend at least 20% buffer to avoid OOM
- **If user picks an infeasible option, warn them clearly** rather than silently switching
## Core Capabilities
### 1. Training Configuration
Generate DeepSpeed ZeRO configurations:
```python
from scripts.generate_ds_config import generate_zero_config
# ZeRO Stage 2 with optimizer offloading
config = generate_zero_config(
zero_stage=2,
offload_optimizer=True,
offload_device="nvme",
nvme_path="/local_nvme"
)
```
### 2. Training Launch
Use the training launcher script:
```bash
python scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--dataset_path data/my_dataset \
--output_dir ./outputs \
--deepspeed assets/ds_config_zero2.json \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--learning_rate 2e-5 \
--lora_r 16 \
--lora_alpha 32
```
### 3. LoRA/QLoRA Integration
For parameter-efficient fine-tuning:
```python
# LoRA config is auto-generated based on arguments
peft_config = {
"peft_type": "LORA",
"r": 16,
"lora_alpha": 32,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM"
}
```
### 4. Multi-GPU Training
Use the `deepspeed` launcher for multi-GPU training (recommended over `torchrun`):
```bash
# Multi-GPU on single node
deepspeed --num_gpus=4 scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--deepspeed assets/ds_config_zero3.json \
...
# Multi-node
deepspeed --hostfile hosts.txt scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--deepspeed assets/ds_config_zero3.json \
...
```
### 5. Training Monitoring
Monitor training progress:
```python
from scripts.monitor_training import TrainingMonitor
monitor = TrainingMonitor(log_dir="./outputs")
monitor.plot_loss()
monitor.get_latest_checkpoint()
```
### 6. Early Stopping
Automatically monitors eval loss and stops training early when there's no improvement across consecutive evaluations, then loads the best checkpoint.
**Parameters:**
- `--early_stopping_patience` — How many consecutive evals without improvement to tolerate. Set to 0 to disable (default). Recommended: 3-10.
- `--early_stopping_threshold` — Minimum eval loss improvement to count as an improvement. Default 0.0 (any decrease counts).
**Example:**
```bash
python scripts/ds_train.py \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_path tatsu-lab/alpaca \
--use_peft True \
--early_stopping_patience 5 \
--early_stopping_threshold 0.001 \
--eval_strategy steps \
--eval_steps 100 \
--num_train_epochs 3 \
...
```
**Auto-configuration:** When `early_stopping_patience > 0`, the script automatically:
1. Enables `load_best_model_at_end=True`
2. Sets `metric_for_best_model=eval_loss`, `greater_is_better=False`
3. Aligns `save_strategy` with `eval_strategy` (synced saving is needed to restore best checkpoint)
**Notes:**
- Must also set `eval_strategy` (e.g., `steps` + `eval_steps`), otherwise early stopping won't work
- Don't set `patience` too low (<3) — early training fluctuations may cause premature stopping
- For LoRA fine-tuning, `patience=5` with `eval_steps=100` typically works well
## Remote Training
When training needs to run on a remote GPU server, see [references/remote_training.md](references/remote_training.md) for the complete guide including agent guidelines, security model, and command reference.
## Troubleshooting
### OOM Errors
- Reduce batch size or increase gradient accumulation steps
- Enable gradient checkpointing: `--gradient_checkpointing`
- Use ZeRO-3 with CPU/NVMe offloading
- Reduce LoRA rank: `--lora_r 8`
- See [references/troubleshooting.md](references/troubleshooting.md) for detailed solutions
### Slow Training
- Ensure bf16/fp16 is enabled
- Check GPU utilization with `nvidia-smi`
- Use FlashAttention if available
- Optimize data loading with `--dataloader_num_workers`
- See [references/troubleshooting.md](references/troubleshooting.md) for detailed solutions
### Checkpoint Issues
- Use `--save_strategy steps` with `--save_steps`
- Enable `--save_total_limit` to cap checkpoint count
- For ZeRO-3, use `--zero3_save_16bit_model` to save FP16 weights
- See [references/troubleshooting.md](references/troubleshooting.md) for detailed solutions
### MPI Errors (multi-GPU only)
- Single-GPU training does **not** need MPI
- If you see MPI errors on single GPU, use `python3` directly instead of `deepspeed` launcher
- See [references/troubleshooting.md](references/troubleshooting.md#mpi-errors) for full MPI debugging guide
### Single-GPU Strategy
- See [references/single_gpu_strategy.md](references/single_gpu_strategy.md) for strategy selection, CPU/NVMe offload examples, and decision principles
## References
- **[Quick Start Guide](references/quick_start.md)** — Common training patterns and full examples
- **[DeepSpeed Guide](references/deepspeed_guide.md)** — DeepSpeed documentation and configuration reference
- **[LoRA/PEFT Best Practices](references/lora_guide.md)** — LoRA/QLoRA parameter tuning guide
- **[ZeRO Optimization Guide](references/zero_optimization.md)** — ZeRO stage comparison and optimization tips
- **[Single-GPU Strategy](references/single_gpu_strategy.md)** — Strategy selection for single-GPU training
- **[Remote Training Guide](references/remote_training.md)** — Remote training via SSH, agent guidelines, and security model
- **[Troubleshooting](references/troubleshooting.md)** — Common errors and solutions (OOM, NaN loss, MPI, NCCL, etc.)
don't have the plugin yet? install it then click "run inline in claude" again.
by @davila7
fine-tune large language models using deepspeed with hardware-aware optimization strategies. use this skill when you need to adapt a pretrained model to your data, domain, or style, and want to maximize training efficiency across single or multi-gpu setups. the skill walks you through hardware assessment, plan selection (lora, qlora, full fine-tuning, or zero-stage strategies), and safe launch procedures that avoid silent oom or performance cliffs.
required software:
pip install deepspeedpip install transformerspip install datasetspip install peftapt-get install sshpass or brew install sshpassrequired hardware:
required data and models:
meta-llama/Llama-2-7b-hf, Qwen/Qwen2.5-7B)environment variables (optional but recommended):
HUGGINGFACE_TOKEN: hf hub token for gated models (set via huggingface-cli login or export)DEEPSPEED_OFFLOAD_PATH: path for zero-offload nvme staging (defaults to /tmp)CUDA_VISIBLE_DEVICES: limit gpus visible to training (e.g., 0,1 for first two gpus)external connections:
~/.cache/huggingface)remote training setup (if applicable):
confirm with the user (or auto-detect via ssh if remote):
input: user responses or remote ssh connection
output: structured hardware profile and training goals documented
if user provides only ssh address, auto-detect via remote commands:
nvidia-smi --query-gpu=memory.total --format=csv,noheadernproc (logical cores)free -h (system memory)df -h / (free space in training directory)lsblk | grep nvme (check for nvme drives)use the memory breakdown table below to compute worst-case vram needs. all estimates assume bf16 dtype (mixed precision).
model weight memory (bf16):
| params | model weights (bf16) |
|---|---|
| 0.5B | ~1 GB |
| 1.5B | ~3 GB |
| 3B | ~6 GB |
| 7B | ~14 GB |
| 14B | ~28 GB |
| 32B | ~64 GB |
| 72B | ~144 GB |
optimizer state memory (full fine-tuning only):
activation memory (sequence length and batch size dependent, not just model size):
lora/qlora vram (parameter-efficient):
input: model name, gpu vram, sequence length, batch size
output: feasibility assessment with vram breakdown; list which strategies (lora, qlora, full, zero-x) fit within gpu vram budget with 20% headroom
if vram is tight (estimated usage > 70% of gpu vram), enable activation checkpointing. this trades compute (~20% slower) for memory (~80% reduction in activation memory).
two enable methods:
python scripts/ds_train.py --gradient_checkpointing True ...
{
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": false,
"contiguous_memory_optimization": true,
"number_checkpoints": 4
}
}
activation checkpointing options:
| option | effect | when to use |
|---|---|---|
partition_activations |
shard checkpoints across model-parallel gpus | multi-gpu setup with model parallelism (zero-3) |
cpu_checkpointing |
store checkpoints in cpu ram instead of gpu | gpu memory critically tight; cpu ram abundant |
contiguous_memory_optimization |
reduce memory fragmentation | large models (>30b params), many checkpoints |
number_checkpoints |
how many layers to checkpoint (fewer = less vram, more compute) | tune if still hitting oom after enabling checkpointing |
input: tight vram assessment from step 2
output: deepspeed config with activation checkpointing enabled, or trainer arg --gradient_checkpointing True
based on vram budget, list every option that fits (with 20% headroom). do not auto-select. present them as:
example for single 24gb gpu, qwen2.5-7b:
Qwen2.5-7B on your hardware (single 24GB GPU, 64GB RAM, 500GB disk):
Option A: LoRA Fine-tuning (Recommended)
VRAM needed: ~22 GB
Speed: ~150 samples/sec (typical)
Quality: Good for style, instruction alignment, domain adaptation
Trainable params: ~20M (0.4% of 7B total)
Inference: Can run on same 24GB GPU or any smaller GPU
Best for: Quick iteration, limited vram, preserving general knowledge
Option B: QLoRA Fine-tuning (Most VRAM-efficient)
VRAM needed: ~12 GB
Speed: ~80 samples/sec (quantization overhead)
Quality: Slightly below LoRA, gap is small for most tasks
Trainable params: ~20M (0.4% of 7B total)
Inference: Can run on same 24GB GPU or smaller
Best for: Severe VRAM constraints, acceptable speed tradeoff
Option C: Full Fine-tuning (Requires ZeRO + Offload)
VRAM needed: ~56 GB (with ZeRO-3 + CPU offload) or ~70 GB (ZeRO-2)
Effective hardware: Exceeds your 24GB GPU; would require CPU offload (very slow)
Speed: Not feasible without hardware upgrade
Trainable params: All 7B (100%)
Quality: Maximum quality, but not worth the slowdown
Recommendation: Skip this. Use LoRA instead.
Which option do you prefer? (A, B, or C)
input: hardware profile, vram estimates, model size
output: numbered list of viable options with explicit vram, speed, quality, trainability, and inference notes; user selection of one option
if user's hardware cannot fit any strategy (even qlora with checkpointing), give explicit hardware upgrade recommendations using generic metrics only (no brand names).
example:
Your goal (full fine-tune Qwen2.5-7B) is not feasible on current hardware (24GB GPU).
To achieve your goal, upgrade to:
Minimum spec:
GPU: single 80GB VRAM
CPU: 16+ cores
RAM: 128GB+
Disk: 200GB+ free
Recommended spec:
GPU: 2x 80GB VRAM (doubles training speed via ZeRO-2)
CPU: 32+ cores
RAM: 256GB+
Disk: 500GB+ free
Alternative: Switch to LoRA fine-tuning (feasible on 24GB GPU, 98% of full FT quality).
if user picks infeasible option, warn explicitly: "Option C (full FT) needs ~56GB VRAM but you have 24GB. training will oom. pick option A or B instead, or upgrade hardware."
input: user selection of infeasible strategy, current hardware specs
output: hardware upgrade recommendations or fallback strategy suggestion; explicit oom warning if user insists
based on selected option (lora, qlora, or full ft), generate or load appropriate deepspeed zero config.
lora/qlora config (minimal, single or multi-gpu):
{
"train_batch_size": 4,
"gradient_accumulation_steps": 4,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 2e-5,
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0.0
}
},
"scheduler": {
"type": "WarmupLinear",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 2e-5,
"warmup_num_steps": 500,
"total_num_steps": 10000
}
},
"bf16": {
"enabled": true
},
"gradient_checkpointing": true,
"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "cpu"
}
}
}
full fine-tuning config (zero-2, single node multi-gpu):
{
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "AdamW",
"params": { "lr": 1e-4 }
},
"bf16": { "enabled": true },
"gradient_checkpointing": true,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true
}
}
}
full fine-tuning config (zero-3, multi-node or extreme vram constraint):
{
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"optimizer": { "type": "AdamW", "params": { "lr": 1e-4 } },
"bf16": { "enabled": true },
"gradient_checkpointing": true,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/local_nvme"
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme"
},
"overlap_comm": true,
"contiguous_gradients": true
}
}
input: selected strategy (lora/qlora/full ft), gpu count, nvme availability, batch size preference
output: deepspeed config json file (e.g., assets/ds_config_lora.json); save to disk in working directory
load dataset and check format. supported formats: huggingface dataset id, jsonl, csv, parquet, or arrow.
jsonl example (minimal: text column required):
{"text": "Your training text here. Can be single or multiple columns."}
{"text": "Another sample."}
csv example (columns: text, instruction, output, etc.):
text,label
"sample text 1","category1"
"sample text 2","category2"
huggingface dataset example:
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca") # or local path
print(dataset.column_names) # check available columns
validate dataset:
input: dataset path or huggingface id, dataset format
output: loaded dataset with schema confirmed; sample row printed; train/eval split assigned
use deepspeed launcher for single or multi-gpu. never use plain python for multi-gpu (use deepspeed launcher or torchrun, not python).
single gpu (no deepspeed launcher needed, but can use):
python scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--dataset_path data/my_dataset \
--output_dir ./outputs \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--learning_rate 2e-5 \
--use_peft True \
--lora_r 16 \
--lora_alpha 32 \
--gradient_checkpointing True \
--bf16 True \
--logging_steps 10 \
--save_steps 500 \
--eval_strategy steps \
--eval_steps 100 \
--early_stopping_patience 5
multi-gpu single node (4 gpus):
deepspeed --num_gpus=4 scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--dataset_path data/my_dataset \
--output_dir ./outputs \
--deepspeed assets/ds_config_zero2.json \
--per_device_train_batch_size 4 \
--num_train_epochs 3 \
--bf16 True \
--gradient_checkpointing True
multi-node (2 nodes, 4 gpus each):
# step 1: create hosts.txt on node 0
# node0_ip slots=4
# node1_ip slots=4
# step 2: launch from node 0
deepspeed --hostfile hosts.txt scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--deepspeed assets/ds_config_zero3.json \
--per_device_train_batch_size 4 \
--num_train_epochs 3
input: deepspeed config, model name, dataset path, training hyperparams, output dir
output: training process started; logs streamed to stdout; checkpoints saved to output_dir
during training, watch for:
good signs:
warning signs (stop and debug if seen):
monitoring commands:
# watch loss in real-time
tail -f outputs/training_logs.txt | grep loss
# check gpu memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1
# check training throughput
grep "samples/sec" outputs/training_logs.txt | tail -1
input: training process running, output logs available
output: visual confirmation that training is progressing normally; early detection of oom or divergence
if you specified eval_dataset and early_stopping_patience > 0, training will automatically:
parameters:
--eval_strategy steps or epoch: frequency of eval--eval_steps 100: evaluate every n steps (if eval_strategy=steps)--early_stopping_patience 5: stop if no improvement for 5 consecutive evals (0 = disabled)--early_stopping_threshold 0.001: minimum improvement (in absolute loss) to count as improvementexample:
python scripts/ds_train.py \
--model_name_or_path Qwen/Qwen2.5-7B \
--dataset_path tatsu-lab/alpaca \
--eval_dataset_path tatsu-lab/alpaca_eval \
--use_peft True \
--eval_strategy steps \
--eval_steps 100 \
--early_stopping_patience 5 \
--early_stopping_threshold 0.001 \
...
auto-configuration (when early_stopping_patience > 0):
load_best_model_at_end=Truemetric_for_best_model=eval_loss, greater_is_better=Falsesave_strategy with eval_strategynotes:
input: eval_dataset, early_stopping hyperparams, eval_strategy set
output: eval_loss printed every eval_steps; best model checkpoint saved; training stops early if no improvement
after training completes, validate the saved model:
# check final checkpoint
ls -lah outputs/checkpoint-final/
# verify model weights exist
ls outputs/checkpoint-final/pytorch_model.bin # or model.safetensors
# for lora: check adapter weights
ls outputs/checkpoint-final/adapter_config.json
ls outputs/checkpoint-final/adapter_model.bin
merge lora weights into base model (optional, for inference deployment):
from peft import AutoPeftModelForCausalLM
# load lora + base model
model = AutoPeftModelForCausalLM.from_pretrained(
"outputs/checkpoint-final",
device_map="auto"
)
# merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("outputs/merged_model")
merged_model.push_to_hub("your-username/my-finetuned-model") # optional
input: training output_dir with saved checkpoints
output: final model weights validated and accessible for inference; optionally merged (lora only) and pushed to huggingface hub
when to use lora vs qlora vs full fine-tuning:
when to enable gradient checkpointing:
when to use zero-1 vs zero-2 vs zero-3:
when to offload to cpu vs nvme:
when to use remote training:
when to run validation/early stopping:
final trained model saved at: {output_dir}/checkpoint-final/ or latest checkpoint (e.g., checkpoint-500/)
model structure:
pytorch_model.bin or model.safetensors: base model weights (full ft) or adapter weights (lora)config.json: model architecture configtokenizer.json or tokenizer_config.json: tokenizer configadapter_config.json + adapter_model.bin: lora config and weights (