Motion Question Answering via Modular Motion Programs

Stanford University

For the task of human motion question answering (HumanMotionQA), we create a dataset (BABEL-QA) that evaluates models' ability to learn complex, multi-step reasoning for human behavior understanding.

Abstract

In order to build artificial intelligence systems that can perceive and reason with human behavior in the real world, we must first design models that conduct complex spatio-temporal reasoning over motion sequences. Moving towards this goal, we propose the HumanMotionQA task to evaluate complex, multi-step reasoning abilities of models on long-form human motion sequences. We generate a dataset of question-answer pairs that require detecting motor cues in small portions of motion sequences, reasoning temporally about when events occur, and querying specific motion attributes.

In addition, we propose NSPose, a neuro-symbolic method for this task that uses symbolic reasoning and a modular design to ground motion through learning motion concepts, attribute neural operators, and temporal relations. We demonstrate the suitability of NSPose for the HumanMotionQA task, outperforming all baseline methods.

HumanMotionQA and BABEL-QA

For the HumanMotionQA task, we introduce the BABEL-QA dataset, which consists of human motion sequences paired with questions in natural language and answers from a vocabulary of words. HumanMotionQA requires complex motion understanding and spatio-temporal reasoning, as models must (1) detect subtle and complex motor cues performed only in a small portion of a motion sequence and (2) reason temporally about how different sections in a motion sequence relate to one another without having access to action boundaries.

Data generation examples

NSPose

We introduce NSPose as a method that leverages a symbolic reasoning process to learn motor cues, modular concepts relating to motion (actions, directions, and body parts), and temporal relations. NSPose executes symbolic programs recursively on the input motion sequence and learns modular motion programs that correspond to different activity classification tasks. Our method jointly learns motion representations and language concept embeddings from motion sequences and question-answer pairs.

Temporal grounding

In addition to learning motion concepts and transformations from the motion to attribute embedding space, NSPose also learns temporal relation operators that capture temporal relations for before, after, and in between from human motion frames, without the use of annotated action boundaries. NSPose is able to identify boundaries between different actions, as these transition frames are learned implicitly through filtering for concepts in segments with temporal relations.

BibTeX

@article{endo2023humanmotionqa,
  author    = {Endo, Mark and Hsu, Joy and Li, Jiaman and Wu, Jiajun},
  title     = {Motion Question Answering via Modular Motion Programs},
  journal   = {ICML},
  year      = {2023},
}