As large language models move into production, inference performance is increasingly defined by systems-level decisions, not model architecture or…
As large language models move into production, inference performance is increasingly defined by systems-level decisions, not model architecture or prompts.
This session explores the infrastructure and low-level engineering challenges behind efficient LLM inference, including KV cache movement, memory bandwidth, cache efficiency, distributed execution, and long-context optimization.
Across three technical talks, we’ll cover disaggregated inference on modern cloud hardware, data-oriented design for high-performance inference engines, and structural sparsity techniques for KV cache compression.
The event is designed for engineers and researchers working on LLM infrastructure, inference engines, and ML systems, and concludes with networking, food, and drinks.
Agenda
18:00 Doors open
18:30 - 18:50 Disaggregated Inference with EFA and NIXL (Toshinobu Akazawa - Solutions Architect, AWS)
18:50 - 19:10 High-Performance Inference Execution, Caching, and Systems using Data-Oriented Design (Julie...
プラットフォーム: luma