Designed and developed a high-throughput sequential decision engine for a cloud computing decision platform, delivering real-time sequential decisions for resource provisioning and scheduling under high-throughput, high-frequency runtime workloads. This ongoing stealth pilot applies deep reinforcement learning at systems scale—bridging algorithm design, distributed GPU training, and safety-aware policy alignment.

Key Technologies: PyTorch · Branching Dueling Q-Network (BDQ) · HF Accelerate · DeepSpeed · Ray Train/Serve · DDP · Preference Alignment


System Overview & Sequential Decision Loop

The engine formulates provisioning and scheduling on the platform as a Markov Decision Process (MDP): a macro information encoder and a micro information encoder process the heterogeneous platform observability in parallel—the former aggregating platform level signals, the latter capturing fast local observations—before fusion into the Actor-Critic backbone. The primary head emits discrete control actions; a parallel preference alignment auxiliary network regularizes the shared policy representation for safety and preference constraints. Decisions run at millisecond-scale intervals, balancing throughput, tail latency, and resource utilization.

[Figure 1: High-Frequency DRL Decision Loop]
PlatformObservabilityMacro EncparallelMicro EncparallelBDQ PolicyActor-Criticshared repr.ControlprimaryPref AlignauxiliaryGPU ReplayBuffer

Decision Loop: Platform observability feeds macro and micro encoders in parallel; fused representations enter the BDQ backbone. The primary head emits control actions; a preference alignment auxiliary network runs in parallel on the shared representation; transitions land in a GPU-resident replay buffer for distributed off-policy updates.


Core Technical Contributions

1. Branching Dueling Q-Network (BDQ) Actor-Critic Engine

  • Parallel State Encoding: A macro information encoder and a micro information encoder process the heterogeneous platform observability in parallel, extracting macro signals and fast micro features before fusion into the BDQ backbone.
  • Dual-Head Outputs: The Actor-Critic primary head emits discrete control actions; a preference alignment auxiliary network runs in parallel on the shared policy representation to enforce safety and preference constraints during training.
  • Architecture: Implemented an Actor-Critic stack with Dueling Advantage decomposition to stabilize Q-value estimation when action spaces branch across concurrent control dimensions (e.g., capacity tiers, priority weights, policy levels).
  • Branching Actions: Used a Branching Dueling Q-Network (BDQ) formulation so the agent factorizes multi-dimensional control signals instead of flattening them into an exponentially large discrete space—preserving sample efficiency under sparse, delayed rewards typical of platform sequential decision problems.

2. Distributed Training at Scale

  • Multi-Node GPU Clusters: Scaled training across multi-node GPU clusters using Hugging Face Accelerate, Distributed Data Parallel (DDP), and DeepSpeed (ZeRO-2) to partition optimizer states and sustain high throughput on large replay batches.
  • Ray Train / Ray Serve: Orchestrated experiment sweeps and serving prototypes with Ray Train and Ray Serve, separating offline policy improvement from low-latency online inference paths during pilot evaluation.
[Figure 2: Distributed Training Topology]
Distributed Training Job (Ray Train)Node 1Node 2Ray HeadGPU WorkerGPU WorkerGPU WorkerGPU WorkerDDP syncDDP syncEach worker: Accelerate · DDP · DeepSpeed ZeRO-2

Training stack (schematic): Ray Train schedules a multi-node job across GPU workers; within the same run, each worker executes PyTorch via Accelerate with DDP gradient synchronization and DeepSpeed ZeRO-2 optimizer-state sharding. Ray Serve (online inference) is a separate path and omitted here.

3. GPU-Resident Replay Buffer

  • Zero-Copy Training Path: Designed a GPU-resident replay buffer so experience tuples remain device-local across sampling and gradient steps—eliminating CPU→GPU copy overheads that otherwise dominate short-horizon, high-frequency DRL workloads.
  • Throughput Impact: Keeping transitions on-device improved effective training throughput during pilot benchmarks where decision cadence and batch sampling rates approached HPC-style duty cycles.

4. RLHF-Style Preference Alignment Network

  • Safety Regularization: Built an auxiliary preference alignment network (RLHF-style) to regularize shared policy representations against predefined safety bounds—penalizing action trajectories that violate latency SLOs, overload thresholds, or fairness constraints during exploration.
  • Human-in-the-Loop Ready: The alignment module accepts ranked trajectory pairs from operator feedback, enabling iterative policy refinement without destabilizing the core BDQ critic during stealth pilot iterations.
[Figure 3: Dual-Head Multi-Objective Policy (Primary + Auxiliary)]
Main branch →Aux branchPreference space · safety boundsAuxiliary · Pref Align (aux head)Shared Policyrepr · BDQPrimary · OutputNet (main head)ControlRL objectivePref reg.lossTraj. A ≻ Bhuman feedbackgrad backprop · aux headPrimary RL optimizationPenalize violating trajectories

Dual-head multi-objective (schematic): The main branch flows horizontally left-to-right: shared BDQ representation → Primary output network → control actions → RL objective. The Auxiliary preference-alignment network drops down in an L-shaped path from the shared trunk (width aligned from Primary left edge to Control right edge) into a preference regularizer vertically aligned with the RL objective; ranked pairs (A ≻ B) inject into the aux head without destabilizing the core critic.


Engineering Takeaways

  • Systems + RL: Demonstrated end-to-end ownership from MDP formulation and branching action design through distributed PyTorch training and serving prototypes.