A Scalable Framework for Text-Driven Mesh Motion Generation
Animating static meshes into realistic motion conditioned on text prompts remains a fundamental yet challenging problem. Existing skeleton-based approaches suffer from limited semantic expressiveness, skinning artifacts, and manual effort for data curation, which restrict scalability. To address these limitations, we propose AniMuse, a semantic-aware and scalable framework for text-driven mesh motion generation. We introduce AniMuse Gaussian Nodes, a novel rigging representation that unifies geometric control with semantic information by combining learnable Gaussian control nodes with visual semantic features, without requiring predefined skeletons or manual annotation. This representation enables high-fidelity, semantically consistent deformation while naturally aligning with text semantics. Building on this representation, we develop a semantic-aware diffusion model that generates SE(3) trajectories for each control node. Since our pipeline assumes no fixed skeleton topology or category-specific prior, it naturally generalizes across mesh categories. We demonstrate its effectiveness on animal motion generation and further curate MuseumAnimal, a large-scale animal mesh motion dataset, to validate the scalability of our skeleton-free paradigm. Comprehensive experiments show that AniMuse achieves superior performance in both in-domain and out-of-domain benchmarks.
Demo data is extracted from Planet Zoo using code provided by Animo. Visual design references: Ruinart Unconventional Gallery, UNESCO Virtual Museum, Microsoft × NHM Visions of Nature.