flash-moe-inference

Name: flash-moe-inference
Availability: InStock
Author: aradotso

flash-moe-inference — an installable skill for AI agents, published by aradotso/trending-skills.

view source

installs

stars

karma

SKILL.md

Flash-MoE Inference Engine

Skill by ara.so — Daily 2026 Skills collection.

Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.

Requirements

Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free

OS: macOS 26+ (Darwin 25+)

Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)

Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)

Installation & Build

# Clone the repo
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer

related skills

semantically similar in the cross-vendor index

skills.sh

60% match

dflash-mlx-speculative-decoding

Lossless DFlash speculative decoding for MLX on Apple Silicon — 1.7–4x faster LLM inference using block diffusion drafting with target model verification.

don't have the plugin yet? install it then click "run inline in claude" again.