Search

Showing top 43 results for "Kubernetes"

Kubernetes

117 articles indexed Last updated 2h ago See topic hub

People also ask

What is the benefit of running Slurm on Kubernetes?

The operational payoff of running Slurm on Kubernetes comes from the ecosystem. Rather than building and maintaining separate toolchains for GPU management, monitoring, networking, and node lifecycle, you can use the Kubernetes tooling that already exists for these problems. Platform teams manage clusters with declarative YAML, Helm deployments, rolling updates, and Prometheus or Grafana for observability.

Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical Blog

How does NVSentinel work?

NVSentinel is installed in each Kubernetes cluster run. Once deployed, NVSentinel continuously watches nodes for errors, analyzes events, and takes automated actions such as quarantining, draining, labeling, or triggering external remediation workflows. Specific NVSentinel features include continuous monitoring, data aggregation and analysis, and more, as detailed below.

Automate Kubernetes AI Cluster Health with NVSentinel | NVIDIA Technical Blog

How does Slinky slurm-operator work?

Slinky slurm-operator represents each Slurm component (slurmctld for scheduling, slurmdbd for accounting, slurmd for compute workers, slurmrestd for API access) as a Kubernetes Custom Resource Definition (CRD). A Slurm cluster is defined using Custom Resources, and Slinky creates containerized Slurm daemons running in their own pods, configured to belong to their respective cluster. Slinky ensures high availability (HA) of the Slurm control plane (slurmctld) through pod regeneration, with no need for the Slurm native HA mechanism. Configuration changes propagate automatically: Kubernetes synch

Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical Blog

What is the GPU Usage Monitor?

The GPU Usage Monitor is an open-source project that deploys a fully integrated GPU observability stack for Kubernetes. Rather than requiring SRE and platform teams to assemble and configure individual components, the GPU Usage Monitor uses DCGM Exporter, kube-state-metrics, Prometheus, and Grafana into a single deployment, complete with pre-built dashboards designed specifically for GPU-accelerated workloads. The design principle is operational simplicity. A single helm install command results in actionable GPU visibility within minutes, with no custom dashboard authoring or scrape configurat

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA Technical Blog

Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes | NVIDIA Technical Blog

Data Center / Cloud Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes Mar 12, 2026 By Mark Chmarny and Nathan Taber Discuss 0 Discuss 0 L T F R E AI-Generated Summary Like Dislike AI Cluster Runtime is an open-source project from NVIDIA that simplifies and standardizes K… …

Mar 12, 2026 · Mark Chmarny

Automate Kubernetes AI Cluster Health with NVSentinel | NVIDIA Technical Blog

… A health system for Kubernetes GPU clusters NVSentinel is an intelligent monitoring and self-healing system for Kubernetes clusters that run GPU workloads. …

Dec 8, 2025 · Lalit Adithya

Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical Blog

… Slinky , an open source project developed by SchedMD now part of NVIDIA , takes two approaches to this integration: slurm-bridge brings Slurm scheduling to native Kubernetes workloads, allowing Slurm to act as a Kubernetes scheduler for pods slurm-operator runs full Slurm clusters on Kubernetes inf… …

Apr 9, 2026 · Anton Polyakov

계층화되고 재현 가능한 레시피를 통한 GPU 인프라용 Kubernetes 검증하기

… AWS에서는 Amazon EKS 팀의 창립 멤버로 참여해 EKS, Karpenter, 그리고 오픈소스 생태계를 통해 Kubernetes 기반 서비스를 정의하는 데 핵심 역할을 했습니다. NVIDIA에서는 GPU 가속 Kubernetes 환경과 대규모 AI 인프라를 위한 헬스 자동화 패턴을 설계하며, 클라우드 사업자와 고객이 프로덕션 환경에서 GPU 워크로드를 안정적으로 운영할 수 있도록 방향을 제시하고 있습니다. …

Mar 20, 2026 · Mark Chmarny

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA Technical Blog

… The observability gap in GPU-Accelerated Kubernetes clusters For site reliability engineers SREs and platform teams managing GPU-accelerated Kubernetes clusters, two failure modes are common and costly. …

May 21, 2026 · Guy Saltoun

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

… Rather than scaling deployments directly, WVA emits target replica counts as Prometheus metrics that standard HPA/Kubernetes-based event-driven autoscaling KEDA act on—keeping the scaling actuation within Kubernetes-native primitives. …

Mar 23, 2026 · Anish Maddipoti

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare | NVIDIA Technical Blog

… Before that, he managed large-scale, multi-tenant AI Kubernetes clusters, making sure research teams get access to the resources they need, and helping researchers navigate Kubernetes for training and inference. …

Jan 28, 2026 · Ekin Karabulut

OSMO Platform

… Do I need Kubernetes or infrastructure expertise to use OSMO? No. Workflows are defined in simple YAML files, and OSMO abstracts the underlying infrastructure. Users don’t need to write Kubernetes manifests or manage cluster configuration to run physical AI workloads at scale. …

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling | NVIDIA Technical Blog

… Kubernetes: From flat schedulers to NVLink-aware placement In Kubernetes, the core challenge is similar to Slurm. Kubernetes pods need to be placed on nodes that share high-bandwidth connectivity. …

Apr 7, 2026 · Ryan Prout

NVIDIA Nsight Cloud

… NVIDIA Cloud Native Stack is based on Ubuntu/RHEL, Kubernetes, Helm, and the NVIDIA GPU and Network Operator. …

Followed topics

Kubernetes

People also ask

Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes | NVIDIA Technical Blog

Automate Kubernetes AI Cluster Health with NVSentinel | NVIDIA Technical Blog

Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical Blog

계층화되고 재현 가능한 레시피를 통한 GPU 인프라용 Kubernetes 검증하기

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA Technical Blog

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare | NVIDIA Technical Blog

OSMO Platform

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling | NVIDIA Technical Blog

NVIDIA Nsight Cloud