InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

Sirui Xu¹, Hung Yu Ling², Yu-Xiong Wang¹^†, Liang-Yan Gui¹^†,

¹University of Illinois Urbana Champaign, ²Electronic Arts

^†Equal Advising

CVPR 2025 Highlight

InterMimic enables simulated humans to perform physical interactions, featuring scalable skill learning covering diverse objects.

Abstract

Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.

Results

Interactive Motor Skills by SMPL-X Humanoids

Interactive Motor Skills by Unitree G1 with Inspire Hands

Retargeting by RL + Forward Dynamics: A better solution for Contact-Rich HOIs

Scaling Up Whole-Body Interactive Skills

InterMimic is grounded on the key insight of tackling the challenges of skill perfection and skill integration progressively. We implement a curriculum-based teacher-student distillation framework, where multiple teacher policies focus on imitating, retargeting, and refining small subsets of interactions, and a student policy integrates these skills from the teachers.

Teacher policy

Student Policy

Comparison to ground truth interactions from MoCap

InterMimic can serve as a post-processing tool for MoCap by correcting contact artifacts.

Comparing to baselines on MoCap Imitation

InterMimic not only endures, but also addresses issues such as incorrect hand positioning and floating contacts in the reference data which the baseline method fails to correct.

Downstream Applications

Leveraging large-scale training, InterMimic achieves strong generalization and seamlessly integrates with kinematic generators. This elevates the framework beyond imitation to generative modeling, including predicting future interactions based on past or generating interactions from text prompts.

Interaction Prediction: InterDiff [Xu et al. (2023)] + InterMimic

Text to Interaction: HOI-Diff [Peng et al. (2023)] + InterMimic

Zero-Shot Cases

Leveraging large-scale training, InterMimic exhibits strong zero-shot generalization.

Generalization on Multi-Object Scenerios

InterMimic can synthesize the sequence of kicking, lifting, and relocating, generated by our student policy tracking sequentially concatenated MoCap sequences, which is unseen during training, highlighting the policy's compositionality.

Generalization on Novel Objects

InterMimic operates with novel objects from BEHAVE [Bhatnagar et al. (2022)] and NeuralDome [Zhang et al. (2023)], demonstrating the effectiveness of our object geometry and contact-encoded representation.