Introduction

In this study, we introduce a learning-based method for generating high-quality human motion sequences from text descriptions (e.g., ``A person walks forward"). Existing techniques struggle with motion diversity and smooth transitions due to limited text-to-motion datasets and reliance on full-body skeletal pose representations. To address this, we develop a network encoder that converts motion sequences into periodic signals, capturing the local periodicity of motions in time and space. We also propose a conditional diffusion model for predicting periodic motion parameters based on text descriptions and the starting pose. Our approach outperforms current methods, generating a broader variety of high-quality motions with natural transitions, especially in longer sequences.

Method Overview

Method Overview: (a) First, we conduct a preprocessing stage to isolate the periodic and non-periodic segments in the motion sequences. (b)-(c) Using the preprocessed movements, we then learn a network encoder to transform the motion space into a learned periodic parameterized phase space by minimizing the reconstruction errors between the original motions and the motions formed by decoding periodic parameters via Inverse FFT. (d) Next, we train a conditional diffusion model to predict the periodic parameters with a text prompt and a starting pose as inputs. During inference time, given a text prompt and a starting pose, we apply diffusion to predict the periodic parameters and then decode the motion from the signal.

Video

Citation

@article{wan2023diffusionphase,
  title={DiffusionPhase: Motion Diffusion in Frequency Domain},
  author={Wan, Weilin and Huang, Yiming and Wu, Shutong and Komura, Taku and Wang, Wenping and Jayaraman, Dinesh and Liu, Lingjie},
  journal={arXiv preprint arXiv:2312.04036},
  year={2023}
}

Thanks Richard Zhang for the website template.