Expert‑Sliced GPU Scheduling for MoE
Nexus orchestrates Mixture‑of‑Experts execution with Triton routing kernels, CUDA Graph replay, and dynamic GPU slices mapped across CUDA streams — with built‑in energy telemetry.
pip install nexus
python -m nexus.demo
git clone https://github.com/Esmail-ibraheem/Nexus
cd Nexus
pip install -e .
What Nexus Provides
Triton Routing Kernel
Fuses softmax → top‑k → atomic expert counts in a single pass to reduce memory traffic and create expert token buckets.
Expert Profiler
Tracks per‑expert usage with a rolling window to distinguish hot/warm/cold experts for better placement.
GPU Slice Manager
Dynamically assigns SM/BW/cache budget slices; integrates with MIG partitions when available.
CUDA Graph Manager
Captures warm‑started execution patterns and replays them to minimize kernel launch overhead.
Stream Scheduler
Maps experts to CUDA streams (N configurable) for parallel execution across slices.
Energy Monitor
NVML‑based power telemetry; reports tokens/J and feeds the profiler for energy‑aware scheduling.
Architecture Overview
Data‑flow vs control‑flow are separated; slices can be resized or migrated at runtime.
- Data plane: Token ingress → Triton Routing → Expert kernels → Weighted aggregation → MoE output
- Control plane: Expert Profiler, GPU Slice Manager, CUDA Graph Manager, Stream Scheduler
- Execution plane: Dynamic GPU slices (Hot/Warm/Cold, Aux, Router‑Backward, optional MIG)
Quickstart
Install (Dev)
git clone https://github.com/Esmail-ibraheem/Nexus
cd Nexus
pip install -e .
python -m nexus.demo
Minimal Usage (Pseudo)
from nexus import Router, SliceManager, schedule
router = Router(top_k=2)
buckets = router.route(tokens, logits)
plan = SliceManager().allocate(buckets)
schedule(plan).run()
Citation
@article{nexus2025,
title = {Nexus: Expert-Sliced GPU Scheduling for Mixture-of-Experts},
author = {Gumaan, Esmail},
journal = {arXiv preprint arXiv:xxxx.xxxxx},
year = {2025}
}