Teacher policy
Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
InterMimic is grounded on the key insight of tackling the challenges of skill perfection and skill integration progressively. We implement a curriculum-based teacher-student distillation framework, where multiple teacher policies focus on imitating, retargeting, and refining small subsets of interactions, and a student policy integrates these skills from the teachers.
Teacher policy
Student Policy
InterMimic can serve as a post-processing tool for MoCap by correcting contact artifacts.
InterMimic not only endures, but also addresses issues such as incorrect hand positioning and floating contacts in the reference data which the baseline method fails to correct.
Leveraging large-scale training, InterMimic exhibits strong zero-shot generalization.
InterMimic can synthesize the sequence of kicking, lifting, and relocating, generated by our student policy tracking sequentially concatenated MoCap sequences, which is unseen during training, highlighting the policy's compositionality.
InterMimic operates with novel objects from BEHAVE [Bhatnagar et al. (2022)] and NeuralDome [Zhang et al. (2023)], demonstrating the effectiveness of our object geometry and contact-encoded representation.
@inproceedings{xu2025intermimic,
title={InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions},
author={Xu, Sirui and Ling, Hung Yu and Wang, Yu-Xiong and Gui, Liangyan},
booktitle={CVPR},
year={2025}
}