Lossless DFlash speculative decoding for MLX on Apple Silicon — 1.7–4x faster LLM inference using block diffusion drafting with target model verification.
dflash-mlx Speculative Decoding Skill by ara.so — Daily 2026 Skills collection. DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model (~1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax). Typical speedups: 1.7x–4.1x over baseline mlx_lm depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models. Installation pip install dflash-mlx # or isolated install pipx install dflash-mlx Requires Python 3.10+, MLX 0.31.1+, Apple Silicon Mac.
don't have the plugin yet? install it then click "run inline in claude" again.