Discriminative Hierarchical Modeling of Spatio-Temporally Composable Human Activities

1Pontificia Universidad Catolica de Chile, Santiago, Chile.
2Universidad del Norte, Barranquilla, Colombia

Accepted in CVPR 2014 (PDF)

People perform complex activities that can be characterized as spatial and/or temporal compositions of simpler actions. Top-left: A person simultaneously waves and walks by assigning subsets of body parts to different actions. Top-right: A person sequencially talks on the phone and runs away to attend an urgent matter. Bottom: A person walks in a room, picks a book up, walks while reading a book, etc. In this paper, we propose a novel formulation that is able to capture these spatio-temporal compositions for complex activity recognition using RBGD data.

This paper proposes a framework for recognizing complex human activities in videos. Our method describes human activities in a hierarchical discriminative model that operates at three semantic levels. At the lower level, body poses are encoded in a representative but discriminative pose dictionary. At the intermediate level, encoded poses span a space where simple human actions are composed. At the highest level, our model captures temporal and spatial compositions of actions into complex human activities. Our human activity classifier simultaneously models which body parts are relevant to the action of interest as well as their appearance and composition using a discriminative approach. By formulating model learning in a max-margin framework, our approach achieves powerful multiclass discrimination while providing useful annotations at the intermediate semantic level. We show how our hierarchical compositional model provides natural handling of occlusions. To evaluate the effectiveness of our proposed framework, we introduce a new dataset of composed human activities. We provide empirical evidence that our method achieves state-of-the-art activity classification performance on several benchmark datasets.

We also introduce a new benchmark dataset, Composable Activities, consisting of 693 videos that contain activities in 16 classes performed by 14 actors. We capture RGB-D data for each sequence using a Microsoft Kinect sensor and estimate position of relevant body joints.

Overview of our discriminative hierarchical model for recognition of composable human activities. At the top level, activities are compositions of actions that are inferred at the intermedite level. These actions are in turn compositions of poses at the lower level, where pose dictionaries are learnt from data. Our model also divides each pose into R spatial regions to capture regions that are relevant to each activity. This figure illustrates the case when R = 2.

To recognize human activities and actions we propose a 3-level compositional hierarchical model, consisting of activities, actions and poses.

CVPR spotlight video

Example video


This work was funded by FONDECYT grant 1120720, from CONICYT, Government of Chile, and LACCIR grant RFP1212LAC005. I.L. is supported by a PhD studentship from CONICYT. J.C.N. is supported by a Microsoft Research Faculty Fellowship.