Expert‑Sliced GPU Scheduling for MoE

Nexus orchestrates Mixture‑of‑Experts execution with Triton routing kernels, CUDA Graph replay, and dynamic GPU slices mapped across CUDA streams — with built‑in energy telemetry.

CUDA 12.x Triton 2.x A100/H100 Ready Streams: 8 (configurable)

★ Star on GitHub Get Started

pip one‑liner

pip install nexus
python -m nexus.demo

Or build from source:

git clone https://github.com/Esmail-ibraheem/Nexus
cd Nexus
pip install -e .

What Nexus Provides

Triton Routing Kernel

Fuses softmax → top‑k → atomic expert counts in a single pass to reduce memory traffic and create expert token buckets.

Expert Profiler

Tracks per‑expert usage with a rolling window to distinguish hot/warm/cold experts for better placement.

GPU Slice Manager

Dynamically assigns SM/BW/cache budget slices; integrates with MIG partitions when available.

CUDA Graph Manager

Captures warm‑started execution patterns and replays them to minimize kernel launch overhead.

Stream Scheduler

Maps experts to CUDA streams (N configurable) for parallel execution across slices.

Energy Monitor

NVML‑based power telemetry; reports tokens/J and feeds the profiler for energy‑aware scheduling.

Architecture Overview

Data‑flow vs control‑flow are separated; slices can be resized or migrated at runtime.

Data plane: Token ingress → Triton Routing → Expert kernels → Weighted aggregation → MoE output
Control plane: Expert Profiler, GPU Slice Manager, CUDA Graph Manager, Stream Scheduler
Execution plane: Dynamic GPU slices (Hot/Warm/Cold, Aux, Router‑Backward, optional MIG)

Quickstart

Install (Dev)

git clone https://github.com/Esmail-ibraheem/Nexus
cd Nexus
pip install -e .
python -m nexus.demo

Minimal Usage (Pseudo)

from nexus import Router, SliceManager, schedule

router = Router(top_k=2)
buckets = router.route(tokens, logits)
plan = SliceManager().allocate(buckets)
schedule(plan).run()

Citation

@article{nexus2025,
  title   = {Nexus: Expert-Sliced GPU Scheduling for Mixture-of-Experts},
  author  = {Gumaan, Esmail},
  journal = {arXiv preprint arXiv:xxxx.xxxxx},
  year    = {2025}
}