LayerT2V: Interactive Multi-Object
Trajectory Layering for Video Generation

1Shanghai Jiao Tong University    2Nanjing University    3Shanghai Innovation Institute   
layert2v logo CIPT2V Pipeline Diagram

We have observed that existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address this, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct “layer” and thus facilitating coherent multi-object synthesis while enhancing control over the generation process.

Comparisons on Multi-Object Motion Control




MCtrl[1]
Peek[2]
Dav[3]
Ours

Qualitative comparison on colliding object motion control with others. We compare our method against MotionCtrl, Peekaboo, and Direct-a-Video. Our method excels in handling cases involving more than one object with colliding motion. Furthermore, we take both static trajectory and moving trajectory into account and showcase that our model outperforms others.

Extensive Blending

We present a case to show the ability of our model to handle multiple object generation iteratively. For convenience, we use arrows to replace bounding boxes.



Method

We carefully design our layer-customized module to achieve our goal of handling multi-object scenario generation while preserving harmony and consistency between multiple layers. It has 3 critical components, (1) Guided Cross-Attention, (2) Oriented Attention-Sharing, (3) Attention-Isolation, and has 2 critical tricks, (4) Keyframe Amplification and (5) Post-Harmonization.

Illustration of Proposed Methods

GuidedCrossAttn OrientedAttnSharing AttnIsolation KeyframeAmp Post-Harmonization

Train Data Visualization

Spatial cross-attention plays an essential role in T2V generation, as it serves as the only pathway for the prompt to embed into the latent representations. Therefore, it is crucial to thoroughly investigate methods to steer spatial cross-attention towards our desired outcomes. Rather than using a linear additive mask, we employ a Gaussian function to smoothly construct an additive mask corresponding to the bbox area with an influence coefficient 𝜆, allowing it to guide without negatively disrupting the attention values, thereby preserving the quality of generated content.


Visualization of Alpha Mask

alpha mask

To better illustrate the transparency relationships between multiple layers of objects, we visualize the alpha masks of each layer as well as the blended alpha mask of the foreground objects for one set of results.

Extension 1: Layer Transplantation

We have observed that generated layers, even when lacking specialized attributes like reflections, can offer significant practical value for transplantation to other videos. Moreover, the transparency of these generated layers allows for flexible scaling, repositioning, and seamless overlay onto diverse backgrounds.

Extension 2: Interaction between Foregrounds

If interactions within the same depth are desired, this can be achieved by combining multiple objects into a group, which is as an extension of our method.

Acknowledgments

We would like to greatly thank Ming-Hsuan Yang at University of California Merced and Kelvin C.K. Chan at Google DeepMind for their insightful discussions and generous support. We also thank Yangnan Lin for his help in testing benchmarks for our model.


References

[1] WANG Z, YUAN Z, WANG X, et al. Motionctrl: A unified and flexible motion controller for video generation[C]//ACM SIGGRAPH 2024 Conference Papers. 2024: 1-11.

[2] JAIN Y, NASERY A, VINEET V, et al. Peekaboo: Interactive video generation via masked- diffusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 8079-8088.

[3] YANG S, HOU L, HUANG H, et al. Direct-a-video: Customized video generation with user- directed camera movement and object motion[C]//SIGGRAPH ’24: ACM SIGGRAPH 2024 Conference Papers. 2024: 1-12.

BibTeX

@misc{cen2025layert2vinteractivemultiobjecttrajectory,
      title={LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation}, 
      author={Kangrui Cen and Baixuan Zhao and Yi Xin and Siqi Luo and Guangtao Zhai and Xiaohong Liu},
      year={2025},
      eprint={2508.04228},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.04228}, 
    }