Serving Applied ML Systems

Maxwell J. Yin

I design and build end-to-end Machine Learning systems, combining distributed model training with high-throughput, low-latency semantic retrieval pipelines.

Portrait of Maxwell J. Yin
πŸ“ Toronto, Canada

About Me

I bridge the gap between academic Machine Learning research and high-performance product deployments.

My core expertise lies in designing scalable LLM pipelines, distributed model training configurations, and low-latency retrieval infrastructures. Whether optimizing model serving configurations to reduce token latency or designing highly scalable contrastive embedding adapters, I focus on building reliable systems with concrete metric targets.

Production Systems & Scale Profile

Practical metric baselines achieved across real-world model training and low-latency serving pipelines.

Retrieval Latency ⚑
~12ms
P99 query latency achieved using dual-encoder FAISS index rather than heavy cross-encoders.
Index Volume πŸ“¦
1.7M
Total passage embeddings indexed and updated in real-time under low-latency limits.
Compute Scaling πŸš€
8Γ— H100
Distributed training of 400M-parameter Transformer model over 40B tokens stable.
Throughput Gain πŸ“ˆ
5.2Γ—
Serving QPS scaled 5.2Γ— using optimized contrastive embedding adapters.
cineseek-inference-pipeline.flow
INPUT
Natural Query
raw user string
βž”
AGENT
LLM Rewriter
intent expansion
βž”
INDEX
FAISS ANN
top-100 recall
βž”
RANK
Cross-Encoder
top-10 reranking
βž”
OUTPUT
LLM Generator
12ms explaining

Career timeline

Machine Learning Engineer

Huawei Noah's Ark Lab
2025 – 2026
Toronto, ON
  • Engineered large-scale, highly scalable LLM training and evaluation pipelines.
  • Successfully trained a 400M-parameter Transformer model on 40B tokens across multi-node 8Γ— H100 GPUs, stabilizing multi-GPU distributed operations.
  • Investigated training optimizations, distributed profiling, and performance evaluation pipelines, boosting training stability and comparing latency-vs-cost constraints.
Distributed Training Transformer Architecture DeepSpeed PyTorch System Profiling

Graduate Research Assistant

Western University
2021 – 2025
London, ON
  • Designed and scaled semantic retrieval pipelines querying over 500K+ documents and 1.7M passages.
  • Replaced expensive multi-stage reranking pipelines with highly optimized FAISS-based Approximate Nearest Neighbor (ANN) retrieval in a unified embedding space.
  • Slashed end-to-end downstream retrieval latency by more than 5Γ— while maintaining outstanding precision and result relevance.
FAISS Indexing Semantic Search Embedding Adapters RAG Pipelines Inference Optimization

Featured Project

CineSeek: Agent-Enhanced Semantic Movie Search

production-ready

CineSeek combines LLM-based query expansion and rewriting, FAISS-based high-performance ANN retrieval, and an agentic cross-encoder reranker. It was built specifically to solve complex, long-tail queries without sacrificing production latency bounds.

Vector Indexing LLM Agents FastAPI serving Docker Compose
Live Demo Source Code All Projects

Research & Publications

First-author papers in TACL, NAACL, AAAI, and Expert Systems with Applications, bridging representation learning with information retrieval.

  • AAAI 2025
    MABR: Multilayer Adversarial Bias Removal Without Prior Bias Knowledge
    M. J. Yin, B. Wang, and C. Ling
  • TACL 2024
    Source-Free Domain Adaptation for Question Answering with Masked Self-training
    M. Yin, B. Wang, Y. Dong, and C. Ling
  • NAACL Findings 2024
    Source-Free Unsupervised Domain Adaptation for Question Answering via Prompt-Assisted Self-learning
    M. Yin, B. Wang, and C. Ling
  • ESWA 2024
    A Fast Local Citation Recommendation Algorithm Scalable to Multi-topics
    M. J. Yin, B. Wang, and C. Ling

Explore the full dataset & citations on Google Scholar βž”