NVIDIA GTC 2026: Live Updates on What’s Next in AI
… A single multiply‑add operation may cost mere femtojoules tiny units of energy , but fetching the data from external memory can cost thousands of times more. …
In healthcare, tedious, time-consuming tasks like medical coding, documentation and managing insurance forms cut into the time doctors can spend with patients. Sully.ai helps solve this problem by developing “AI employees” that can handle routine tasks like medical coding and note-taking. As the company’s platform scaled, its proprietary, closed source models created three bottlenecks: unpredictable latency in real-time clinical workflows, inference costs that scaled faster than revenue and insufficient control over model quality and updates. To overcome these bottlenecks, Sully.ai uses Basete
Leading Inference Providers Achieve Lowest Token Cost With Open Source Models on NVIDIA BlackwellLatitude is building the future of AI-native gaming with its AI Dungeon adventure-story game and upcoming AI-powered role-playing gaming platform, Voyage, where players can create or play worlds with the freedom to choose any action and make their own story. The company’s platform uses large language models to respond to players’ actions — but this comes with scaling challenges, as every player action triggers an inference request. Costs scale with engagement, and response times must stay fast enough to keep the experience seamless. Latitude runs large open source models on DeepInfra’s infere
Leading Inference Providers Achieve Lowest Token Cost With Open Source Models on NVIDIA BlackwellSentient Labs is focused on bringing AI developers together to build powerful reasoning AI systems that are all open source. The goal is to accelerate AI toward solving harder reasoning problems through research in secure autonomy, agentic architecture and continual learning. Its first app, Sentient Chat, orchestrates complex multi-agent workflows and integrates more than a dozen specialized AI agents from the community. Due to this, Sentient Chat has massive compute demands because a single user query could trigger a cascade of autonomous interactions that typically lead to costly infrastruct
Leading Inference Providers Achieve Lowest Token Cost With Open Source Models on NVIDIA BlackwellCustomer service calls with voice AI often end in frustration because even a slight delay can lead users to talk over the agent, hang up or lose trust. Decagon builds AI agents for enterprise customer support, with AI-powered voice being its most demanding channel. Decagon needed infrastructure that could deliver sub-second responses under unpredictable traffic loads with tokenomics that supported 24/7 voice deployments. Together AI runs production inference for Decagon’s multimodel voice stack on NVIDIA Blackwell GPUs. The companies collaborated on several key optimizations: speculative decod
Leading Inference Providers Achieve Lowest Token Cost With Open Source Models on NVIDIA Blackwell… A single multiply‑add operation may cost mere femtojoules tiny units of energy , but fetching the data from external memory can cost thousands of times more. …
… Moving to Blackwell’s native NVFP4 format, an ultralow precision floating-point data format reducing memory bandwidth and model size while maintaining inference accuracy, further cut that cost to just 5 cents — for a total 4x improvement in cost per token — while maintaining the accuracy that custo… …
… In January, its GitHub star count crossed 100,000 as developer interest surged.... …
… By applying the default NVIDIA TensorRT settings during the compilation process, the Screenshop team immediately saw a 3x surge in throughput, estimated to deliver a staggering 66% cost reduction. …
… This includes techniques like chain-of-thought prompting , external memory and multistep decomposition. …