Descriptions:
Netflix has open-sourced VOID — Video Object and Interaction Deletion — a 5-billion parameter model that removes objects from videos while correctly simulating the causal consequences of their absence. Rather than leaving a clean hole, VOID reasons about what the removed object was doing to its environment: erase the person who kicked a ball and the ball simply sits still, because the kick never happened. Fahd Mirza runs a live local demonstration on Ubuntu with an NVIDIA RTX A6000 (48 GB VRAM), showing the model consuming just under 13 GB during inference.
Architecturally, VOID works in three stages: a vision-language model identifies everything causally connected to the selected object, encoding this into a guidance mask; a video diffusion model (built on CogVideo) regenerates the scene as if the object was never present; and a second refinement pass uses motion flow from the first result to smooth any shape instability in moving objects. Mirza tests two scenes — removing a glass from a table and erasing a duck from a ball-rolling shot — with inference running roughly two minutes per clip.
Results are promising but imperfect: surface textures can appear smoother than the original, and complex interactions still produce minor artifacts. Mirza built a lightweight Gradio interface on top of the official demo to make local experimentation accessible. For video editors and researchers exploring open-source generative video tools, VOID represents a meaningful step beyond traditional inpainting into physically-aware, causally-consistent object removal.
📺 Source: Fahd Mirza · Published April 15, 2026
🏷️ Format: Hands On Build







