Expert‑Sliced GPU Scheduling for MoE

Nexus orchestrates Mixture‑of‑Experts execution with Triton routing kernels, CUDA Graph replay, and dynamic GPU slices mapped across CUDA streams — with built‑in energy telemetry.

CUDA 12.x Triton 2.x A100/H100 Ready Streams: 8 (configurable)
pip one‑liner
pip install nexus
python -m nexus.demo    
Or build from source:
git clone https://github.com/Esmail-ibraheem/Nexus
cd Nexus
pip install -e .

What Nexus Provides

Triton Routing Kernel

Fuses softmax → top‑k → atomic expert counts in a single pass to reduce memory traffic and create expert token buckets.

Expert Profiler

Tracks per‑expert usage with a rolling window to distinguish hot/warm/cold experts for better placement.

GPU Slice Manager

Dynamically assigns SM/BW/cache budget slices; integrates with MIG partitions when available.

CUDA Graph Manager

Captures warm‑started execution patterns and replays them to minimize kernel launch overhead.

Stream Scheduler

Maps experts to CUDA streams (N configurable) for parallel execution across slices.

Energy Monitor

NVML‑based power telemetry; reports tokens/J and feeds the profiler for energy‑aware scheduling.

Architecture Overview

Data‑flow vs control‑flow are separated; slices can be resized or migrated at runtime.

  • Data plane: Token ingress → Triton Routing → Expert kernels → Weighted aggregation → MoE output
  • Control plane: Expert Profiler, GPU Slice Manager, CUDA Graph Manager, Stream Scheduler
  • Execution plane: Dynamic GPU slices (Hot/Warm/Cold, Aux, Router‑Backward, optional MIG)
Token Ingress & Gating • Inputs (tokens, shapes) • Router logits (k‑of‑N top‑k selection) Triton Routing Kernel softmax ⟶ top‑k ⟶ atomic expert counts single‑pass, no intermed. tensors 3× lower memory traffic A100 / H100 GPU Complex HBM, SMs, L2; MIG optional Expert Profiler runtime usage (ui tokens/s) rolling window; hot/cold experts GPU Slice Manager dynamic/proportional allocation evict/assign SM & BW slices CUDA Graph Manager warm‑up ⟶ capture ⟶ replay minimize launch overhead Stream Scheduler assign experts ⟶ CUDA streams (N=8) GPU Execution Plane — Dynamic Slices SM groups, BW, cache budget; MIG partitions when available Slice A — Hot Experts larger SM share Fused Expert MLP (Triton) Slice B — Warm Experts medium SM share Batched Expert Kernel Slice C — Cold Experts share SMs Time‑sliced / queued Slice D — Aux aggregation / residuals Slice E — Router Backward (training) Slice F — MIG Instance optional A100/H100 Expert Outputs Aggregation (weighted sum / concat) MoE Layer Output ⟶ next layer / loss Energy Monitor NVML power · tokens/J emits telemetry ⟶ profiler Legend Data plane / tensors Control kernels / managers Schedulers / telemetry data‑flow control‑flow / feedback

Quickstart

Install (Dev)

git clone https://github.com/Esmail-ibraheem/Nexus
cd Nexus
pip install -e .
python -m nexus.demo

Minimal Usage (Pseudo)

from nexus import Router, SliceManager, schedule

router = Router(top_k=2)
buckets = router.route(tokens, logits)
plan = SliceManager().allocate(buckets)
schedule(plan).run()

Citation

@article{nexus2025,
  title   = {Nexus: Expert-Sliced GPU Scheduling for Mixture-of-Experts},
  author  = {Gumaan, Esmail},
  journal = {arXiv preprint arXiv:xxxx.xxxxx},
  year    = {2025}
}