Followed topics

Search

Showing top 2 results for "Operational/config outage"

Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical Blog

… In the longer term, the team is working on graceful Slurm cluster upgrades, planned outage workflows, configuration rollback, and structured daemon logging. …

Apr 9, 2026 · Anton Polyakov

Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo | NVIDIA Technical Blog

… Key features include: AI reasoning plus tool-calling : Replaces manual alarm triage by invoking NOC tools for validation, root‑cause analysis, and remediation across existing systems End-to-end automation : Handles alarm validation, RCA, and healing for various incident types such as outages, flaps… …

Mar 1, 2026 · Aiden Chang