Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog
…Prefill is compute-intensive and benefits from high floating point operations (FLOPS), while decode is memory-bandwidth-bound and benefits from large, fast memory. Disaggregated inference Disaggregated architectures separate these stages into…