Human Parsing using Stochastic And-Or Grammars and Rich Appearances
B. Rothrock and S. C. Zhu

Introduction

One of the key challenges to human parsing and pose recovery is handling the variability in geometry and appearance of humans in natural scenes. This variability is due to the large number of distinct articulated configurations, clothing, and self-occlusion, as well as unknown lighting and viewpoint. In this paper, we present a stochastic grammar model that represents the body as an articulated assembly of compositional and reconfigurable parts. The reconfigurable aspect allows a compatible part to be substituted with an alternative part with different attributes, such as for clothing appearance or viewpoint foreshortening. Relations within the grammar enforce consistency between part attributes as well as geometry, allowing a richer set of appearance and geometry constraints over conventional articulated models. Part appearances are modeled by a sparse deformable image template that can still richly describe salient part structures. We describe a dynamic programming parsing algorithm for our model, and show competitive pose recovery results against the state-of-art on a challenging dataset.

Generative Grammar Model

The appearance of the human body, including the appearance of individual parts, their articulating geometry constraints, and type constraints, can be represented concisely by an and-or graph grammar. By virtue that the AOG is a generative model, samples can be drawn from the model in the spirit of analysis-by-synthesis to visually inspect that the results are human-like.

And-Or Graph Grammar: The grammar defines a hierarchical and reconfigurable decomposition of the body into constituent parts. A derivation of the grammar involves selecting from each of these reconfigurable components to produce a parse graph. Inference is then defined as finding the optimal parse graph that explains the image.

Synthesized Samples: The statistical model defined on the And-Or graph is generative, meaning that it is designed to explain the data. Because of this, random samples can be drawn from the model, which result in very human-like poses and appearances.

Parsing and Detection

Parsing is accomplished by a dynamic programming algorithm that incrementally finds optimal part geometries as well as their reconfigurable forms.

Optimal Score Maps: During the inference process, the dynamic programming algorithm computes optimal score maps for each part recursively. As the algorithm moves up the grammar toward the root part, it accumulates more evidence and updates these score maps accordingly until it arrives at a single, globally optimal solution.

References

Human Parsing using Stochastic And-Or Grammars and Rich Appearances
B. Rothrock and S.C. Zhu
SIG-11: Second International Workshop on Stochastic Image Grammars [pdf]
A Stochastic Grammar of Images
S.C. Zhu and D. Mumford
Foundations and Trends in Computer Graphics and Vision [pdf]
A Numeric Study of the Bottom-up and Top-down Inference Processes in And-Or Graphs
T.F. Wu and S.C. Zhu
Int'l Journal of Computer Vision (under review) [pdf]
Learning Active Basis Model for Object Detection and Recognition
Y.N. Wu, Z.Z. Si, H.F. Gong, and S.C. Zhu
Int'l Journal of Computer Vision [pdf]