At Deepcell, I designed, developed, and scaled an ultra-high-performance cell image classification and intelligent screening platform. The platform enables biologists and pharmaceutical researchers to analyze, classify, and physically sort cell populations in minimal time based on raw morphological surface features—bridging laboratory hardware instruments and cloud deep learning clusters into an efficient end-to-end pipeline.


Core Engineering Optimizations & Compute Acceleration

Our core R&D focus was maximizing model throughput, aggressively minimizing cloud compute costs, and building highly robust ultra-low-latency inference pipelines.

1. Ultra-High-Throughput Real-Time Streaming Pipeline

To handle streaming cell-image data captured in real time by laboratory hardware, we built a heavily hardware-accelerated parallel streaming inference pipeline spanning capture, quantized inference, serving, and terminal 2D visualization:

[Figure 1: Real-Time AI Inference Acceleration Pipeline Flowchart]
Online hot pathTensorRT INT8 · dynamic batching · GPU queuesOffline buildONNX → TensorRT INT8 · PTQ calibrationMicroscopyData QueuePre-processPyTorch ModelEmbeddingsUMAP Terminal

Pipeline acceleration logic (schematic): The live hot path runs: optical hardware → data queue → parallel CV pre-processing, entering Triton’s online layer from the left; a TensorRT INT8 engine emits embeddings for terminal UMAP display. The trained PyTorch model also enters the offline build layer from the left, completing ONNX → TensorRT INT8 (PTQ calibration) at release—the artifact is loaded by the online layer, not rebuilt on the hot path.

2. Accelerated Distributed Model Training

To train large CNN and Vision Transformer models efficiently on GCP—and control cost on preemptible TPU Pods—we built a distributed training stack with checkpoint fault tolerance and graph compilation acceleration:

[Figure 2: Distributed TPU Training & Fault-Tolerant Scheduling Flowchart]
Job Queue · Pod DispatchCNN / ViT parallel trainingCheckpoint Persist · RecoveryLossless resume on preemptionTraining SetTPU Podstorch.compilePyTorch Model

Training acceleration logic (schematic): Morphology training data is dispatched by a preemptible scheduler to GCP TPU Pods for parallel training; on preemption, a checkpoint queue drives fault-tolerant restart (green feedback loop). torch.compile compiles and fuses the compute graph on the training side—the artifact feeds Figure 1’s offline build layer.

  • TPU training instance scheduling: On Google Cloud Platform (GCP), I fine-tuned large-scale models (CNNs, Vision Transformers). To control cost, I designed a distributed training scheduler built on preemptible TPU Pods, with automated persistent checkpoint queues and fault-tolerant auto-restart recovery—enabling lossless resume when cloud instances are preemptively terminated.
  • PyTorch graph compilation: Enabled torch.compile on the training side to compile the compute graph and fuse operators, optimizing memory access and kernel scheduling and reducing overall training time by 35%.

3. High-Throughput Model Inference & Quantization

  • ONNX & TensorRT compilation: Exported trained PyTorch models to ONNX and applied NVIDIA TensorRT for deep inference-side optimization on production NVIDIA GPUs.
  • INT8 model quantization: Implemented calibration-based post-training quantization (PTQ), compressing models to INT8 precision. While holding classification accuracy loss within 0.1% of the FP32 baseline, GPU inference throughput improved .
  • Triton inference serving: Deployed optimized models on NVIDIA Triton Inference Server. With dynamic batching, concurrent multi-model loading, and shared CUDA memory queues, the system achieved sub-millisecond tail response times under high-frequency instrument telemetry streams.
  • Terminal UMAP 2D projection: UMAP-reduced embeddings are rendered live at the instrument terminal as a 2D morphology space for immediate inspection and cell selection.

4. Data Pipelines & Embedding Engineering

  • Real-time instrument data ingestion: Built a low-latency real-time read queue wired directly to microscopy hardware image output. Applied real-time computer vision correction and augmentation, extracting dense semantic embeddings online.
  • Vector retrieval system: Built a high-performance vector retrieval index for offline storage and search over tens of millions of historical morphology vectors—complementing the live terminal UMAP projection in the streaming pipeline above—and keeping legacy embeddings compatible with, or cheaply mappable to, new model architectures and parameter updates.