What metrics should you measure for LLM inference performance?
The prerequisite for sizing and TCO estimation is benchmarking the performance of each deployment unit, e.g., an inference server. The goal of this step is to measure the throughput a system can produce under load, and at what latency. These throughput and latency metrics, together with quality of service requirements (e.g., max latency) and expected peak demand (e.g., max concurrent users or requests per second), will help estimate the required hardware, such as sizing the deployment. In turn, sizing information is a prerequisite for estimating the total cost of ownership (TCO) of the given s