The Ultimate LTX-Video Guide for ComfyUI: T2V, I2V, and ControlNet (Depth/Canny/Pose) Explained

The Ultimate LTX-Video Guide for ComfyUI: T2V, I2V, and ControlNet (Depth/Canny/Pose) Explained

More

Descriptions:

This Veteran AI video serves as a comprehensive introduction to LTX-Video 2 (LTX Two) inside ComfyUI, covering five complete workflows: Text-to-Video, Image-to-Video, and ControlNet-driven generation using Depth, Canny, and Pose maps. What sets LTX Two apart from other open-source video models is its native support for synchronized audio—generated video comes with background music, sound effects, and spoken dialogue matching the prompt content.

A large portion of the video is dedicated to explaining LTX Two’s unusual dual-track architecture. Unlike traditional video models, LTX Two maintains separate Video Latent and Audio Latent streams throughout the pipeline. This means loading a standalone Audio VAE alongside the main checkpoint (which contains the Video VAE), running two-stage sampling with a split/merge pattern between stages, and decoding both tracks separately before final output. The host explains why this matters for node selection and why a specific LTXV scheduler node is required for Stage 1 Sigma generation.

Practical guidance covers the two-stage resolution strategy—generating at half resolution in Stage 1 before a 2x upscale in Stage 2—as well as CFG value differences between stages (4.0 distilled vs. 1.0 for the upscale pass). The Image-to-Video section highlights the “Image to Video In Place” node that injects reference image information directly into the latent. All workflows are available on the official ComfyUI blog and on RunningHub.


📺 Source: Veteran AI · Published January 07, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels