scvi-tools

Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.

installs

stars

6,647

karma

SkillRank score ↗

7.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

scvi-tools covers deep learning workflows for single-cell genomics including batch correction, multi-modal integration, reference mapping, and spatial analysis. provides model selection guidance, modular scripts, and decision trees for workflow routing.

structure

9.0

trigger phrases

9.0

procedure

7.0

edge cases

5.0

documentation

7.0

strengths

view original SKILL.md from smitheryclick to expand

---
name: scvi-tools
description: Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
---

# scvi-tools Deep Learning Skill

This skill provides guidance for deep learning-based single-cell analysis using scvi-tools, the leading framework for probabilistic models in single-cell genomics.

## How to Use This Skill

1. Identify the appropriate workflow from the model/workflow tables below
2. Read the corresponding reference file for detailed steps and code
3. Use scripts in `scripts/` to avoid rewriting common code
4. For installation or GPU issues, consult `references/environment_setup.md`
5. For debugging, consult `references/troubleshooting.md`

## When to Use This Skill

- When scvi-tools, scVI, scANVI, or related models are mentioned
- When deep learning-based batch correction or integration is needed
- When working with multi-modal data (CITE-seq, multiome)
- When reference mapping or label transfer is required
- When analyzing ATAC-seq or spatial transcriptomics data
- When learning latent representations of single-cell data

## Model Selection Guide

| Data Type | Model | Primary Use Case |
|-----------|-------|------------------|
| scRNA-seq | **scVI** | Unsupervised integration, DE, imputation |
| scRNA-seq + labels | **scANVI** | Label transfer, semi-supervised integration |
| CITE-seq (RNA+protein) | **totalVI** | Multi-modal integration, protein denoising |
| scATAC-seq | **PeakVI** | Chromatin accessibility analysis |
| Multiome (RNA+ATAC) | **MultiVI** | Joint modality analysis |
| Spatial + scRNA reference | **DestVI** | Cell type deconvolution |
| RNA velocity | **veloVI** | Transcriptional dynamics |
| Cross-technology | **sysVI** | System-level batch correction |

## Workflow Reference Files

| Workflow | Reference File | Description |
|----------|---------------|-------------|
| Environment Setup | `references/environment_setup.md` | Installation, GPU, version info |
| Data Preparation | `references/data_preparation.md` | Formatting data for any model |
| scRNA Integration | `references/scrna_integration.md` | scVI/scANVI batch correction |
| ATAC-seq Analysis | `references/atac_peakvi.md` | PeakVI for accessibility |
| CITE-seq Analysis | `references/citeseq_totalvi.md` | totalVI for protein+RNA |
| Multiome Analysis | `references/multiome_multivi.md` | MultiVI for RNA+ATAC |
| Spatial Deconvolution | `references/spatial_deconvolution.md` | DestVI spatial analysis |
| Label Transfer | `references/label_transfer.md` | scANVI reference mapping |
| scArches Mapping | `references/scarches_mapping.md` | Query-to-reference mapping |
| Batch Correction | `references/batch_correction_sysvi.md` | Advanced batch methods |
| RNA Velocity | `references/rna_velocity_velovi.md` | veloVI dynamics |
| Troubleshooting | `references/troubleshooting.md` | Common issues and solutions |

## CLI Scripts

Modular scripts for common workflows. Chain together or modify as needed.

### Pipeline Scripts

| Script | Purpose | Usage |
|--------|---------|-------|
| `prepare_data.py` | QC, filter, HVG selection | `python scripts/prepare_data.py raw.h5ad prepared.h5ad --batch-key batch` |
| `train_model.py` | Train any scvi-tools model | `python scripts/train_model.py prepared.h5ad results/ --model scvi` |
| `cluster_embed.py` | Neighbors, UMAP, Leiden | `python scripts/cluster_embed.py adata.h5ad results/` |
| `differential_expression.py` | DE analysis | `python scripts/differential_expression.py model/ adata.h5ad de.csv --groupby leiden` |
| `transfer_labels.py` | Label transfer with scANVI | `python scripts/transfer_labels.py ref_model/ query.h5ad results/` |
| `integrate_datasets.py` | Multi-dataset integration | `python scripts/integrate_datasets.py results/ data1.h5ad data2.h5ad` |
| `validate_adata.py` | Check data compatibility | `python scripts/validate_adata.py data.h5ad --batch-key batch` |

### Example Workflow

```bash
# 1. Validate input data
python scripts/validate_adata.py raw.h5ad --batch-key batch --suggest

# 2. Prepare data (QC, HVG selection)
python scripts/prepare_data.py raw.h5ad prepared.h5ad --batch-key batch --n-hvgs 2000

# 3. Train model
python scripts/train_model.py prepared.h5ad results/ --model scvi --batch-key batch

# 4. Cluster and visualize
python scripts/cluster_embed.py results/adata_trained.h5ad results/ --resolution 0.8

# 5. Differential expression
python scripts/differential_expression.py results/model results/adata_clustered.h5ad results/de.csv --groupby leiden
```

### Python Utilities

The `scripts/model_utils.py` provides importable functions for custom workflows:

| Function | Purpose |
|----------|---------|
| `prepare_adata()` | Data preparation (QC, HVG, layer setup) |
| `train_scvi()` | Train scVI or scANVI |
| `evaluate_integration()` | Compute integration metrics |
| `get_marker_genes()` | Extract DE markers |
| `save_results()` | Save model, data, plots |
| `auto_select_model()` | Suggest best model |
| `quick_clustering()` | Neighbors + UMAP + Leiden |

## Critical Requirements

1. **Raw counts required**: scvi-tools models require integer count data
   ```python
   adata.layers["counts"] = adata.X.copy()  # Before normalization
   scvi.model.SCVI.setup_anndata(adata, layer="counts")
   ```

2. **HVG selection**: Use 2000-4000 highly variable genes
   ```python
   sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="batch", layer="counts", flavor="seurat_v3")
   adata = adata[:, adata.var['highly_variable']].copy()
   ```

3. **Batch information**: Specify batch_key for integration
   ```python
   scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
   ```

## Quick Decision Tree

```
Need to integrate scRNA-seq data?
├── Have cell type labels? → scANVI (references/label_transfer.md)
└── No labels? → scVI (references/scrna_integration.md)

Have multi-modal data?
├── CITE-seq (RNA + protein)? → totalVI (references/citeseq_totalvi.md)
├── Multiome (RNA + ATAC)? → MultiVI (references/multiome_multivi.md)
└── scATAC-seq only? → PeakVI (references/atac_peakvi.md)

Have spatial data?
└── Need cell type deconvolution? → DestVI (references/spatial_deconvolution.md)

Have pre-trained reference model?
└── Map query to reference? → scArches (references/scarches_mapping.md)

Need RNA velocity?
└── veloVI (references/rna_velocity_velovi.md)

Strong cross-technology batch effects?
└── sysVI (references/batch_correction_sysvi.md)
```

## Key Resources

- [scvi-tools Documentation](https://docs.scvi-tools.org/)
- [scvi-tools Tutorials](https://docs.scvi-tools.org/en/stable/tutorials/index.html)
- [Model Hub](https://huggingface.co/scvi-tools)
- [GitHub Issues](https://github.com/scverse/scvi-tools/issues)

don't have the plugin yet? install it then click "run inline in claude" again.

expanded original skill to include explicit intent, detailed inputs with environment and data requirements, numbered procedure with edge cases and IO per step, six decision points (workflow selection, label transfer, hardware/environment, data prep), formal output contract with file locations and formats, and outcome signal criteria for success validation.

scvi-tools Deep Learning Skill

Item: scvi-tools
Rating: 7.3
Author: Implexa

intent

Use this skill when working with single-cell RNA-seq, ATAC-seq, CITE-seq, multiome, or spatial transcriptomics data and you need probabilistic deep learning models for batch correction, data integration, multi-modal analysis, label transfer, or RNA velocity. scvi-tools is the framework to reach for when variational autoencoders (VAEs) or related methods are mentioned, when integrating data across multiple technologies or batches, when mapping query samples to reference models, or when extracting latent representations for downstream analysis.

inputs

Python environment:

Python 3.8+
scvi-tools 0.20.0+ (install via pip install scvi-tools or conda install -c conda-forge scvi-tools)
scanpy, anndata, numpy, scipy, pandas for data handling
Optional: GPU support (CUDA 11.8+, cuDNN) for faster training on large datasets

Data requirements:

AnnData object (.h5ad format) with raw integer count data stored in a dedicated layer (e.g., adata.layers["counts"])
Minimum 2000-4000 highly variable genes selected before model setup
Batch/condition metadata as a column in adata.obs (e.g., batch_key="batch")
For label transfer workflows: pre-trained reference model (file path or HuggingFace model ID)
For spatial analysis: spatial coordinates and optionally a reference single-cell dataset

External resources (optional):

HuggingFace Model Hub access for downloading pre-trained models (https://huggingface.co/scvi-tools)
scvi-tools documentation (https://docs.scvi-tools.org/)

Model selection context:

Data modality (scRNA-seq, scATAC-seq, CITE-seq, multiome, spatial)
Availability of cell type labels for semi-supervised learning
Whether cross-technology batch correction is needed
Presence of pre-trained reference models for query mapping

procedure

Validate and prepare input data
- Input: raw .h5ad file or AnnData object
- Run QC: filter cells (min genes, max mitochondrial %, min counts), filter genes (min cells, max genes)
- Output: QC metrics logged, low-quality cells/genes removed
- Edge case: if data is already normalized or log-transformed, flag for warning and preserve raw counts in separate layer
Select highly variable genes (HVGs)
- Input: filtered AnnData, batch_key if multi-batch
- Run sc.pp.highly_variable_genes() with flavor="seurat_v3", n_top_genes=2000 (adjust 2000-4000 based on dataset size and sparsity)
- Output: adata.var["highly_variable"] boolean column, filtered AnnData subset to HVGs
- Edge case: if batch_key is provided, compute HVGs per batch then take union to avoid losing batch-specific signals
Prepare raw count layer
- Input: HVG-filtered AnnData
- Ensure adata.layers["counts"] contains original integer counts (not normalized, not log-transformed)
- Store normalized/log counts in adata.X if needed for other downstream tools
- Output: adata.layers["counts"] verified as raw counts, shape (n_obs, n_vars_hvg)
- Edge case: if counts layer missing, check if adata.X is raw; if not, fail and request raw data
Select appropriate scvi-tools model based on data type and task
- Input: data modality, presence of labels, batch structure, downstream goal
- Decision branch (see "decision points" section below)
- Output: model class selected (scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, or sysVI)
Setup AnnData for the selected model
- Input: AnnData, model class, batch_key, optional label_key for scANVI
- Call model.setup_anndata() with appropriate arguments (e.g., scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch"))
- Output: model-specific setup metadata stored in adata.uns["scvi"] (or equivalent)
- Edge case: if setup_anndata() fails due to missing columns, provide clear error indicating which .obs or .var columns are required
Initialize and train the model
- Input: setup AnnData, hyperparameters (hidden_dim, latent_dim, n_layers, dropout_rate, learning_rate, n_epochs)
- Create model instance: model = SCVI(adata, ...)
- Train: model.train(max_epochs=N, early_stopping=True, early_stopping_patience=20)
- Output: trained model object, training history (loss curves), saved model checkpoint (directory or .pth file)
- Edge case: if GPU out-of-memory, reduce batch_size or latent_dim; if training loss plateaus early, increase n_epochs or adjust learning_rate; if training diverges, reduce learning_rate by 10x
Extract and visualize latent representations
- Input: trained model, AnnData
- Get latent codes: latent = model.get_latent_representation()
- Compute neighbors, UMAP, and Leiden clustering: sc.pp.neighbors(adata, use_rep="X_scvi"), sc.tl.umap(adata), sc.tl.leiden(adata, resolution=0.8)
- Output: latent representation stored in adata.obsm["X_scvi"], UMAP in adata.obsm["X_umap"], cluster labels in adata.obs["leiden"]
- Edge case: if UMAP computation is slow on large datasets (>100k cells), subsample for visualization or use alternative (e.g., trimap, PCA)
Perform task-specific downstream analysis (differential expression, label transfer, deconvolution, etc.)
- Input varies by task; see task-specific reference files
- Output: task-dependent (DE results table, transferred labels, deconvolution weights, velocity vectors)
- Edge case: ensure model is kept in memory or reloaded from checkpoint before downstream steps
Validate integration quality and save results
- Input: integrated AnnData, model, task-specific outputs
- Compute integration metrics if multi-batch: kBET score, iLISI, cLISI, batch correction metrics
- Save: model (model.save(dirpath)), AnnData with results (adata.write_h5ad(path)), plots (UMAP, DE volcano, etc.)
- Output: validated results directory with model checkpoint, processed AnnData, metrics CSV, visualizations
- Edge case: if model directory already exists, move or version old checkpoint to avoid overwrite; if save fails due to disk space, provide size estimate

decision points

Workflow selection (data type + task):

If data is scRNA-seq only and you have cell type labels, use scANVI for semi-supervised integration and label transfer (reference: label_transfer.md)
Else if data is scRNA-seq only and you have no labels, use scVI for unsupervised batch correction and integration (reference: scrna_integration.md)
If data is CITE-seq (RNA + surface proteins), use totalVI for multi-modal integration (reference: citeseq_totalvi.md)
Else if data is multiome (RNA + ATAC from same cells), use MultiVI for joint modality learning (reference: multiome_multivi.md)
Else if data is scATAC-seq only, use PeakVI for chromatin accessibility analysis (reference: atac_peakvi.md)
Else if data is spatial transcriptomics with a single-cell reference, use DestVI for cell type deconvolution (reference: spatial_deconvolution.md)
Else if you need RNA velocity / transcriptional dynamics, use veloVI (reference: rna_velocity_velovi.md)
Else if you have strong cross-technology batch effects (e.g., 10x + Dropseq + SMART-seq), use sysVI (reference: batch_correction_sysvi.md)

Label transfer decision:

If you have a pre-trained reference model and a new query dataset, use scArches for query-to-reference mapping (reference: scarches_mapping.md)
Else if you have scANVI reference and query is unlabeled, retrain scANVI in semi-supervised mode on combined ref+query data

Hardware / environment decision:

If training stalls or GPU memory errors occur, check references/environment_setup.md for CUDA/cuDNN versions, reduce batch_size, or switch to CPU training
If you encounter dependency conflicts, consult references/troubleshooting.md for version pinning and virtual environment setup

Data preparation edge case:

If raw count layer is missing or data appears pre-normalized, stop and request raw counts; do not attempt to reverse-engineer normalization

output contract

Model checkpoint:

Directory containing model.pt, var_names.csv, setup_dict.json
Location: user-specified (default: results/<model_name>_checkpoint/)

Processed AnnData:

File: adata_processed.h5ad
Contains: original raw counts in adata.layers["counts"], HVG-filtered genes, latent representation in adata.obsm["X_scvi"] (or model-specific key), UMAP/clustering in adata.obsm["X_umap"], adata.obs["leiden"]
Format: HDF5-backed AnnData, lossless

Integration metrics (if multi-batch):

File: integration_metrics.csv
Columns: kBET_score, iLISI, cLISI, silhouette_score, batch_purity (if applicable)
Single row or per-batch rows depending on metric

Task-specific outputs:

Differential expression: CSV with columns [gene, log2fc, p_value, q_value]
Label transfer: AnnData with transferred labels in adata.obs["transferred_labels"], confidence scores in adata.obs["transfer_confidence"]
Spatial deconvolution: CSV with cell type proportions per spatial location
RNA velocity: vector field visualization (PDF/PNG) and velocity vectors in adata.obsm["velocity"]

Logs and diagnostics:

File: training.log (loss curves, convergence info)
File: validation_report.txt (data QC summary, model hyperparameters, warnings)

outcome signal

Training converged: ELBO loss plateaus and does not spike; training log shows stable loss for final 10+ epochs
Latent space is meaningful: UMAP visualization clusters cells by cell type (if labels known) and batches are mixed (if integration is successful)
Integration successful (multi-batch): kBET p-value > 0.05 (fails to reject null of random batch distribution), iLISI > 1.5, cLISI > 1.0 (cell type label impurity is low)
Downstream task passes sanity checks: DE markers align with known biology, transferred labels match expected cell types (if reference is reliable), spatial deconvolution proportions sum to ~1.0 per location
Model saves without error: model checkpoint directory exists with all required files; AnnData file is valid and readable via sc.read_h5ad()
No warnings or errors in logs: validation_report.txt contains no "ERROR" or "CRITICAL" lines; any "WARNING" lines are expected (e.g., "convergence stalled but acceptable")

scvi-tools

related skills

scvi-tools Deep Learning Skill

intent

inputs

procedure

decision points

output contract

outcome signal