MoGA

1University of Science and Technology of China, 2FanqieAI, ByteDance China, 3Hong Kong University of Science and Technology, 4Wuhan University
arXiv arXiv GitHub Code(Coming Soon) Huggingface Model(Coming Soon)

🎯 Abstract

Do you want to generate a short film ?

Long video generation with diffusion transformer is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query–key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy–efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention mechanism that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantics-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces ⚡ minute-level, multi-shot, 480p videos at 24 FPS with approximately 580K context length. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.

Minute-level Multi-shot Long Video Example

✨Pipeline

🎬 More Results

Gallery

📚 Citation

BibTeX
@article{jia2025moga,
  title        = {MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation},
  author       = {Weinan Jia and Yuning Lu and Mengqi Huang and Hualiang Wang and Binyuan Huang and Nan Chen and Mu Liu and Jidong Jiang and Zhendong Mao}, 
  year         = {2025}
  eprint       = {2510.18692},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  note         = {arXiv preprint},
  url          = {https://arxiv.org/abs/2510.18692}, 
}

If you find our work useful in your research, please consider citing our paper.