Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

Yutong Chen*,1, Yifan Zhan*,1,2, Zhihang Zhong†,1, Wei Wang1, Xiao Sun†,1, Yu Qiao1, Yinqiang Zheng2
1Shanghai AI Laboratory, OpenGVLab, 2The University of Tokyo
*Co-first authors, Co-corresponding authors
teaser

We emphasize that appearance variations not only depend on different static poses, but can also be induced by inertia, such as the graceful hanging down of the dress drape after a sudden stop in motion. Compared with previous methods solely relying on static poses, we encode past pose trajectory with the pose sequence to accurately capture such dynamic effects.


Video demos


Rendering results of our Dyco and HumanNeRF[1] on I3D-Human dataset.

Novel velocity rendering



We select different scaling factors α ranging from 0 to 2, while increasing α implies higher motion speed. We conduct inference on the same pre-trained model and show our rendering results with scaled delta pose sequence as condition. We observe a significant variation in the skirt's amplitude as α changes, which is consistent with our experience that the skirt's amplitude increases with higher spinning velocity.


Novel acceleration rendering



Novel acceleration is accomplished by designing a pose-sequence simulating an abrupt stop. Specifically, for a spinning test data, we intercepted poses midway and control subsequent poses to be exactly the same. We show the rendering sequences with (left) and without (right) delta pose sequence condition to illustrate that our method can capture the characteristics of novel acceleration accurately, such as the falling of a skirt hem.


Abstract

Neural rendering techniques have significantly advanced 3D human body modeling. However, previous approaches often overlook dynamics induced by factors such as motion inertia, leading to challenges in scenarios like abrupt stops after rotation, where the pose remains static while the appearance changes. This limitation arises from reliance on a single pose as conditional input, resulting in ambiguity in mapping one pose to multiple appearances.
In this study, we elucidate that variations in human appearance depend not only on the current frame's pose condition but also on past pose states. Therefore, we introduce Dyco, a novel method utilizing the delta pose sequence representation for non-rigid deformations and canonical space to effectively model temporal appearance variations. To prevent a decrease in the model's generalization ability to novel poses, we further propose low-dimensional global context to reduce unnecessary inter-body part dependencies and a quantization operation to mitigate overfitting of the delta pose sequence by the model. To validate the effectiveness of our approach, we collected a novel dataset named I3D-Human, with a focus on capturing temporal changes in clothing appearance under approximate poses. Through extensive experiments on both I3D-Human and existing datasets, our approach demonstrates superior qualitative and quantitative performance. In addition, our inertia-aware 3D human method can unprecedentedly simulate appearance changes caused by inertia at different velocities.

Idea

Velocity ambiguity

The overall pipeline of our method. The rigid transformation and non-rigid transformation module deform the coordinate in the pose space into the canonical space, which is then fed into the triplane volume to obtain the color and density in the canonical space. To capture the variation under similar poses within different dynamic contexts, we adopt a localized dynamic context encoder to embed pose sequences as additional conditional inputs into the transformation module and canonical volume.



Conclusion

In this work, we present Dyco, a novel human motion modeling method incorporating pose-sequence condition to mitigate dynamic contexts induced appearance ambiguities. We posit that the appearance of human is determined not only by pose conditions but by the cumulative motion states induced by inertia as well, which can be adequately encapsulated by pose sequences. In addition to introducing pose sequences as conditional inputs, we further design a localized dynamic context encoder to address the model overfitting caused by excessive reliance on delta pose. Through these modules, we have successfully resolved the ambiguity in appearance caused by dynamic context, thus enhancing the rendering quality of human bodies in loose attire. The I3D-Human dataset we have developed aims to rectify the oversight in previous datasets regarding loose clothing and propel research into complex human motion familiar in real-life scenarios.



Reference

[1] Weng, Chung-Yi, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. "Humannerf: Free-viewpoint rendering of moving people from monocular video." In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp. 16210-16220. 2022.