A full Petaflop in the Palm of Your Hand – The Dell Pro Max with GB10

A full Petaflop in the Palm of Your Hand – The Dell Pro Max with GB10

More

Descriptions:

Dave’s Garage host Dave puts Dell’s GB10-based system through three practical workloads to assess whether Nvidia’s compact Blackwell superchip is ready for serious edge AI development. At the hardware level, the GB10 pairs a 20-core ARM CPU (10 Cortex X925 + 10 A725 efficiency cores) with 6,144 CUDA cores and a shared pool of 128GB LPDDR5 memory connected over NVLink C2C β€” eliminating the host-to-device memory shuffling that complicates discrete GPU setups. Nvidia rates the chip at roughly 1 petaflop of FP4 AI compute; the unit draws about 230W through an external power brick.

The first workload runs large language models through Ollama compiled with CUDA and linked against TensorRT-LLM, taking advantage of Blackwell’s FP4 quantization path to keep 120B-parameter models resident in memory β€” within about one percentage point of FP8 accuracy according to Nvidia’s calibration documentation. The second builds a reinforcement learning system that trains a game-playing agent for the arcade title Tempest, replacing two ThreadRipper towers with dual RTX 6000 cards. The third deploys a fully local vehicle detection pipeline using YOLO and DeepStream over RTSP, generating SMS alerts only for unfamiliar cars by comparing embeddings against a household vehicle gallery β€” no cloud upload required.

Dave is candid about the tradeoffs: a discrete RTX 4090 outperforms the GB10 on raw FP8 throughput, but the unified memory architecture makes the GB10 more practical for multi-model pipelines that would otherwise require constant model swapping.


πŸ“Ί Source: Dave’s Garage Β· Published January 11, 2026
🏷️ Format: Deep Dive

1 Item

Channels