MoGA

🎯 Abstract

✨ Do you want to generate a short film ? ✨

Long video generation with diffusion transformer is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query–key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy–efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention mechanism that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantics-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces ⚡ minute-level, multi-shot, 480p videos at 24 FPS with approximately 580K context length. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.

Minute-Level Multi-Shot Long Video Examples (Film Remakes)

✨Pipeline

🎬 More Results

Gallery

📚 Citation

BibTeX

@article{jia2025moga,
  title        = {MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation},
  author       = {Weinan Jia and Yuning Lu and Mengqi Huang and Hualiang Wang and Binyuan Huang and Nan Chen and Mu Liu and Jidong Jiang and Zhendong Mao}, 
  year         = {2025}
  eprint       = {2510.18692},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  note         = {arXiv preprint},
  url          = {https://arxiv.org/abs/2510.18692}, 
}

If you find our work useful in your research, please consider citing our paper.

Ethics Concerns

    The videos in these demos are  generated by models, and are intended solely to showcase the capabilities of this research. If you have any concerns, please contact us at luyuning@bytedance.com, and we will promptly remove them.