Designed and developed a high-throughput sequential decision engine for a cloud computing decision platform, delivering real-time sequential decisions for resource provisioning and scheduling under high-throughput, high-frequency runtime workloads. This ongoing stealth pilot applies deep reinforcement learning at systems scale—bridging algorithm design, distributed GPU training, and safety-aware policy alignment.
Key Technologies: PyTorch · Branching Dueling Q-Network (BDQ) · HF Accelerate · DeepSpeed · Ray Train/Serve · DDP · Preference Alignment
System Overview & Sequential Decision Loop
The engine formulates provisioning and scheduling on the platform as a Markov Decision Process (MDP): a macro information encoder and a micro information encoder process the heterogeneous platform observability in parallel—the former aggregating platform level signals, the latter capturing fast local observations—before fusion into the Actor-Critic backbone. The primary head emits discrete control actions; a parallel preference alignment auxiliary network regularizes the shared policy representation for safety and preference constraints. Decisions run at millisecond-scale intervals, balancing throughput, tail latency, and resource utilization.
Decision Loop: Platform observability feeds macro and micro encoders in parallel; fused representations enter the BDQ backbone. The primary head emits control actions; a preference alignment auxiliary network runs in parallel on the shared representation; transitions land in a GPU-resident replay buffer for distributed off-policy updates.
Core Technical Contributions
1. Branching Dueling Q-Network (BDQ) Actor-Critic Engine
- Parallel State Encoding: A macro information encoder and a micro information encoder process the heterogeneous platform observability in parallel, extracting macro signals and fast micro features before fusion into the BDQ backbone.
- Dual-Head Outputs: The Actor-Critic primary head emits discrete control actions; a preference alignment auxiliary network runs in parallel on the shared policy representation to enforce safety and preference constraints during training.
- Architecture: Implemented an Actor-Critic stack with Dueling Advantage decomposition to stabilize Q-value estimation when action spaces branch across concurrent control dimensions (e.g., capacity tiers, priority weights, policy levels).
- Branching Actions: Used a Branching Dueling Q-Network (BDQ) formulation so the agent factorizes multi-dimensional control signals instead of flattening them into an exponentially large discrete space—preserving sample efficiency under sparse, delayed rewards typical of platform sequential decision problems.
2. Distributed Training at Scale
- Multi-Node GPU Clusters: Scaled training across multi-node GPU clusters using Hugging Face Accelerate, Distributed Data Parallel (DDP), and DeepSpeed (ZeRO-2) to partition optimizer states and sustain high throughput on large replay batches.
- Ray Train / Ray Serve: Orchestrated experiment sweeps and serving prototypes with Ray Train and Ray Serve, separating offline policy improvement from low-latency online inference paths during pilot evaluation.
Training stack (schematic): Ray Train schedules a multi-node job across GPU workers; within the same run, each worker executes PyTorch via Accelerate with DDP gradient synchronization and DeepSpeed ZeRO-2 optimizer-state sharding. Ray Serve (online inference) is a separate path and omitted here.
3. GPU-Resident Replay Buffer
- Zero-Copy Training Path: Designed a GPU-resident replay buffer so experience tuples remain device-local across sampling and gradient steps—eliminating CPU→GPU copy overheads that otherwise dominate short-horizon, high-frequency DRL workloads.
- Throughput Impact: Keeping transitions on-device improved effective training throughput during pilot benchmarks where decision cadence and batch sampling rates approached HPC-style duty cycles.
4. RLHF-Style Preference Alignment Network
- Safety Regularization: Built an auxiliary preference alignment network (RLHF-style) to regularize shared policy representations against predefined safety bounds—penalizing action trajectories that violate latency SLOs, overload thresholds, or fairness constraints during exploration.
- Human-in-the-Loop Ready: The alignment module accepts ranked trajectory pairs from operator feedback, enabling iterative policy refinement without destabilizing the core BDQ critic during stealth pilot iterations.
Dual-head multi-objective (schematic): The main branch flows horizontally left-to-right: shared BDQ representation → Primary output network → control actions → RL objective. The Auxiliary preference-alignment network drops down in an L-shaped path from the shared trunk (width aligned from Primary left edge to Control right edge) into a preference regularizer vertically aligned with the RL objective; ranked pairs (A ≻ B) inject into the aux head without destabilizing the core critic.
Engineering Takeaways
- Systems + RL: Demonstrated end-to-end ownership from MDP formulation and branching action design through distributed PyTorch training and serving prototypes.