Descriptions:
At the AI Engineer conference, Nvidia’s Ziv Ilan — a researcher in Nvidia’s AI labs team based in Paris — presents a practical framework for closing the latency gap in diffusion-based image and video generation models. While models like Flux 2, LTX 2.1, and Google’s latest video generation systems have reached high quality, the 20–50 denoising steps they require make them too slow for real-time developer and enterprise use cases. Ilan covers three optimization layers in order of implementation complexity: quantization, caching, and step distillation.
On quantization, Ilan describes work with Black Forest Labs on Flux 2, using dynamic post-training quantization on Nvidia’s Blackwell architecture. Pre-quantized checkpoints are available on Hugging Face, and detailed examples are published in Nvidia’s open-source TRT-LLM visual generation repository. For caching, he explains T-Cache and more advanced chunk-based caching, where unchanged regions of an image or video frame between denoising steps are identified and skipped — offering meaningful speedups if threshold tuning is handled carefully to avoid quality degradation. The technique is already integrated into vLLM, OmniGen, and other serving libraries.
The most impactful technique is step distillation: training a student diffusion model to match teacher model output quality in as few as 1 to 8 steps instead of 50, enabling potential 10x to 200x throughput improvements and making real-time generation achievable. Ilan draws an analogy to DeepSeek’s model distillation work — but notes that for diffusion models, the goal is step reduction rather than parameter reduction. A live demo from Nvidia’s recent GTC conference in San Jose illustrates the practical results at 1080p quality.
📺 Source: AI Engineer · Published June 16, 2026
🏷️ Format: Deep Dive







