Descriptions:
Veteran AI provides a comprehensive technical walkthrough of LongCat-Video-Avatar 1.5 inside ComfyUI, covering three purpose-built generation workflows for different audio lengths and production contexts. The video is targeted at practitioners ready to move beyond basic demo outputs and build reliable, longer-form talking avatar videos using the upgraded model.
The tutorial begins with what changed in version 1.5: the audio encoder was upgraded from Wav2Vec2 to Whisper-Large v3, giving the model finer-grained understanding of pronunciation rhythm, multilingual cadence, and per-phoneme mouth shapes. Simultaneously, DMD distillation compresses generation down to eight steps, improving both output quality and inference throughput. Using Kijai’s WanVideo ComfyUI extension, the presenter walks through model loading (BF16 main model plus acceleration LoRA at weight 1.0), reference image scaling to 480×832, Whisper audio embedding via the LongCat Avatar Whisper Embeds node, and a critical frame count rule — values must satisfy the formula 4n+1 (e.g., 93, 149, 173) or sampling will error out.
The three workflows compared are: single (one sampling pass, recommended for clips under roughly ten seconds), extend (manual segment-by-segment chaining for longer audio), and auto-extend (automatic looping keyed to audio duration). The extend and auto-extend workflows introduce frames and overlap parameters that control temporal continuity between segments. The video also dedicates time to prompt strategy — demonstrating that LongCat Avatar is audio-driven rather than motion-driven, meaning gestures like head turns, waves, or camera movement must be explicitly written into the positive prompt to appear.
📺 Source: Veteran AI · Published June 03, 2026
🏷️ Format: Tutorial Demo







