Descriptions:
SkyReels V3 brings three new model variants — Video-to-Video, Reference Generation, and Audio-to-Video — and this Veteran AI video walks through how to use all three inside ComfyUI using Kijai’s conversion weights from HuggingFace. The tutorial focuses primarily on the Audio-to-Video (A2V) pipeline, which takes a reference image and audio clip and generates lip-synced video output with high subject consistency.
The A2V workflow loads the SkyReels V3 A2V model alongside the Wan 2.1 VAE, processes the reference image to dimensions divisible by 16, encodes it into an image embedding, and connects the vocal audio encoding as an extra parameter to the sampler. Generation happens in 81-frame chunks using a sliding window mechanism; for audio longer than roughly 3 seconds, a video extension section chains additional segments. The presenter added a Color Match node to the extension pipeline to reduce inter-segment flickering that was present in the original Kijai workflow, though some flicker remains — an honest limitation the video calls out directly.
For shorter clips, one-shot generation is strongly recommended over the extension method. A head-to-head quality comparison with Wan 2.1 Infinite Talk concludes that under similar conditions the two models perform comparably, which may factor into workflow selection. The Reference Generation (R2V) mode is also briefly covered, using a Phantom-based workflow that accepts up to four reference images and pads portrait frames to landscape format for scene generation.
📺 Source: Veteran AI · Published February 06, 2026
🏷️ Format: Tutorial Demo







