Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog
… Prefill workers four replicas, 2-degree Tensor Parallelism : apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: prefill-workers spec: replicas: 4 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: prefill-leader … …