My M5 Max, Gemma 4, MLX LOCAL Stack. (This KILLS MODEL PROVIDERS)

My M5 Max, Gemma 4, MLX LOCAL Stack. (This KILLS MODEL PROVIDERS)

More

Descriptions:

IndyDevDan runs a structured head-to-head benchmark between a fully specced Apple M5 Max MacBook Pro and its M4 Max predecessor, testing local AI inference across four models: Qwen 3.5 (35B) in GGUF and MLX variants, Nvidia’s NVFP4-format Qwen 3.5, and Google’s Gemma 4 in standard and MLX formats. All inference runs on Apple silicon using the MLX machine learning framework without any cloud API dependency.

The benchmark methodology tracks four metrics: prefill speed (prompt ingestion), decode tokens per second, total wall clock time, and peak RAM usage. Tests run five prompts of increasing complexity, culminating in breadth-first graph traversal tasks at 4K and 32K context windows. Key findings show MLX-optimized model variants consistently outperform their GGUF counterparts on Apple hardware, the M5 Max delivers faster prefill and decode speeds across all tested models, and 32K context represents a practical limit for sub-35B parameter models before accuracy degrades noticeably.

The broader argument is quantitative: local inference on Apple silicon has matured to the point where cloud providers like Anthropic and OpenAI are no longer necessary for many workloads. GPU utilization approaching 100% and RAM usage around 55GB on the M5 Max during heavy reasoning tasks illustrate both the capability ceiling and the impressive progress in consumer-grade local AI.


📺 Source: IndyDevDan · Published April 20, 2026
🏷️ Format: Benchmark Test

1 Item

Channels