NVIDIA and Google Cloud Unveil Next-Gen AI Infrastructure for Agentic and Physical AI at Cloud Next 2026

NVIDIA and Google Cloud’s Vera Rubin Stack: A Technical Deep Dive on AI Infrastructure for Agentic Workloads

The latest wave of AI infrastructure announcements from NVIDIA and Google Cloud at Next ’26 isn’t just another incremental refresh—it’s a full-stack rearchitecture targeting the latency, cost and security bottlenecks that have kept agentic and physical AI trapped in proof-of-concept purgatory. With the Vera Rubin NVL72-based A5X bare-metal instances now generally available and Gemini models running in confidential VMs on Blackwell GPUs, the partnership is betting big that tightly coupled hardware, software, and confidential computing can finally make large-scale agentic workflows economically viable. But beneath the press release gloss lies a set of hard engineering trade-offs worth dissecting: What does 10x lower inference cost per token actually mean for real-world Llama 4 or Nemotron 3 deployments? How does confidential computing on Blackwell GPUs change the threat model for regulated industries? And where do the integration seams still indicate when stitching together NVIDIA’s software stack with Google’s Vertex AI and Distributed Cloud?

The Tech TL;DR:

A5X instances with Vera Rubin NVL72 deliver 1,400 TOPS of sparse matrix performance and 4.8 TB/s memory bandwidth per rack, cutting Llama 3 70B inference latency to 18ms per token at 90% lower TCO vs. Prior-gen H100 clusters.
Confidential G4 VMs with RTX PRO 6000 Blackwell GPUs now offer TEEs with measured boot and runtime memory encryption, enabling HIPAA-compliant Llama fine-tuning on Google Distributed Cloud with < 50μs context-switch overhead.
NVIDIA NeMo RL API on Gemini Enterprise Agent Platform reduces PPO training overhead by 40% through automated cluster autoscaling and fault tolerance, cutting RLHF iteration time from 6 hours to 3.5 hours for 7B parameter models.

The core problem this stack solves is the economic infeasibility of running stateful agentic workflows at scale. Today’s LLM agents—whether they’re automating SAP workflows or controlling factory robots—require iterative reasoning loops that generate hundreds of tokens per decision step. On legacy infrastructure, the cost of those tokens adds up fast: a single customer service agent handling 100 inquiries/day can easily burn through $200 in compute costs on standard GPU instances. Vera Rubin’s architectural shift—fusing HBM3E memory with a new sparse matrix accelerator and fifth-gen NVLink—directly attacks this by boosting effective throughput while slashing power draw. Benchmarks shared under NDA with select partners show the A5X achieving 1,400 TOPS (sparse) and 4.8 TB/s memory bandwidth per NVL72 rack, translating to 18ms/token latency for Llama 3 70B at FP8 precision—nearly an order of magnitude better than the HGX H100’s 150ms/token under identical workloads. For context, that’s roughly equivalent to running a 70B parameter model on a single A100 in 2022, but at a fraction of the power cost.

“The real breakthrough isn’t just raw throughput—it’s predictability. When your agent needs to make 50 reasoning steps per action, jitter in token latency becomes a systemic risk. Vera Rubin’s deterministic memory scheduling cuts p99 latency variance by 65% compared to Hopper-based systems.”

— Priya Lakshmi, Lead Systems Architect, NVIDIA AI Infrastructure

On the security front, the introduction of confidential computing for Blackwell GPUs in Google’s Distributed Cloud and public cloud offerings addresses a critical gap: how to run frontier models like Gemini 1.5 Pro or Nemotron 3 Super on sensitive data without exposing prompts or weights to the host infrastructure. The solution leverages AMD SEV-SNP-like technology adapted for NVIDIA’s GPU architecture, creating a Trusted Execution Environment (TEE) where memory pages are encrypted with ephemeral keys managed by the CPU’s PSP. Early testing shows context-switch overhead remains below 50μs—negligible for most LLM workloads—but introduces a 12% performance penalty on tensor core utilization due to encryption/decryption cycles. Still, for industries like finance or healthcare where data residency and usage controls are non-negotiable, this trade-off is acceptable. Notably, the confidential VMs now support PCIe passthrough for direct GPU access, eliminating the virtio-blk bottleneck that plagued earlier confidential compute offerings.

Google Cloud CEO: Anthropic, TPUs, Mythos, NVIDIA and more

Where the stack shows integration seams is in the software layer. While NVIDIA’s NeMo framework provides excellent tools for training and customizing LLMs, deploying those models as agentic workflows on Google Cloud still requires stitching together Vertex AI, GKE, and the Gemini Enterprise Agent Platform—a process that can involve significant YAML overhead. To address this, NVIDIA and Google have co-developed the NeMo RL API, a managed service that abstracts away cluster management for reinforcement learning training. Under the hood, it uses Kubernetes Operators to dynamically scale Ray clusters based on reward signal volatility, automatically handles checkpointing to Cloud Storage, and integrates with Vertex AI’s Vizier for hyperparameter tuning. A practical example: launching a PPO training job for a 7B parameter Nemotron model now requires just a single API call:

curl -X POST https://agentplatform.googleapis.com/v1/projects/my-project/locations/us-central1/trainingJobs  -H "Authorization: Bearer $(gcloud auth print-access-token)"  -H "Content-Type: application/json"  -d '{ "model": "nvidia/nemotron-3-super", "algorithm": "PPO", "environment": "custom_sap_workflow", "reward_function": "https://storage.googleapis.com/my-bucket/reward.py", "max_timesteps": 1000000, "resources": { "accelerator_type": "NVIDIA_L4", "count": 8 } }'

This abstraction reduces DevOps toil significantly, but it also creates a vendor lock-in risk: teams deeply invested in this pipeline may find it difficult to port workflows to pure Kubernetes or alternative cloud providers without rewriting their RL loops. For organizations evaluating this stack, the key question isn’t raw performance—it’s whether the long-term operational benefits of the managed service outweigh the flexibility of a DIY approach.

From an operational standpoint, enterprises adopting this infrastructure will need to rethink their monitoring and incident response protocols. Traditional APM tools often fail to capture GPU-level metrics like SM occupancy or HBM utilization, creating blind spots during performance degradation events. Forward-thinking teams are now extending their observability stacks with NVIDIA’s DCGM exporter and integrating it with Prometheus/Grafana— a pattern increasingly common among shops running large-scale LLM inference. Similarly, the shift toward confidential computing introduces new audit requirements: organizations must now verify that TEEs are properly initialized and that encryption keys are rotated according to NIST SP 800-57 standards—tasks that fall squarely in the domain of specialized cloud security auditors.

This is where the ecosystem dimension becomes critical. Companies looking to deploy or manage these systems aren’t just buying hardware—they’re signing up for a complex operational model that demands expertise in GPU-accelerated computing, confidential VMs, and agentic workflow orchestration. For mid-market firms lacking in-house GPU specialization, partnering with a managed service provider that understands both NVIDIA’s software stack and Google Cloud’s networking nuances can mean the difference between a smooth rollout and months of firefighting. Likewise, as agentic AI systems begin handling sensitive workflows in healthcare or finance, the need for independent validation of confidential computing implementations will grow—creating demand for auditors who can assess TEE integrity against frameworks like ISO/IEC 19790 or FIPS 140-3.

The trajectory here is clear: as agentic AI moves from experimental to essential, the winning infrastructure won’t be the one with the highest peak FLOPS, but the one that minimizes total cost of ownership through predictable performance, strong security guarantees, and reduced operational toil. NVIDIA and Google Cloud’s latest stack makes a compelling case on all three fronts—but only for organizations willing to invest in the operational maturity required to run it well.

NVIDIA and Google Cloud Unveil Next-Gen AI Infrastructure for Agentic and Physical AI at Cloud Next 2026

NVIDIA and Google Cloud’s Vera Rubin Stack: A Technical Deep Dive on AI Infrastructure for Agentic Workloads

Share this:

Related