Representation, Inference, and Learning

Description

The objective the program is to explore and develop methods for scalable autonomous systems capable of understanding scenes and events for learning, planning, and execution of complex tasks. The program is exploring powerful mathematical frameworks for unified knowledge representation applied to shared perception, learning, reasoning, and action. Exploiting probabilistic methods such as stochastic grammars to represent and process visual scenes and actions. Data-driven methods for spatial, temporal, and causal parsing of information are being developed for semantic understanding of scenes and events in unstructured environments along with cognitive processing methods for exploitation and manipulations.

**Models**

This project developed a mathematical foundation for unified representation, inference, and learning. The result of the project is an end-to-end system for scene and event understanding from various imaging sensor inputs. It computes a probabilistic spatial-temporal And-Or Graph (AOG) representations for human activities and human-object interactions, and answers binary (yes/no) queries via Who, What, Where, When storylines. We explore three different linear models to capture these relationships, ranging from heavily annotated AOG, to weakly supervised Sum-Product Networks and Heirarchical Random Fields.

*And-Or Graphs (AOG):* we represent spatial relationships useing a three-layered AOG to jointly model group activities, individual actions, and participating objects. As for the temporal relations, and context we use lateral connectivities connecting multiple graphs, which enables us to do multitarget tracking.The AGO allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors – called α process; 2) bottom-up inference based on detecting activity parts – called β process; and 3) top-down inference based on detecting activity context – called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus. This problem is addressed by formulating a cost-sensitive inference of And-Or Graphs as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference. *Sum-Product Networks (SPN): *to explore a different family of models, we represented activities using an SPN. A product node in SPN represents a particular arrangement of parts, and a sum node represents alternative arrangements. The sums and products are hierarchically organized, and grounded onto space-time windows covering the video. The difference of this family compared to And-Or Graphs is the weak supervision of the internal parts of the network, which enables the network to learn salient parts and use the for classification.

*Hierarchical Random Fields (HiRF):* similar to SPNs we formulate a new linear deep model which models only hierarchical dependencies between model variables. This effectively amounts to modeling higher-order temporal dependencies of video features. We specify an efficient inference of HiRF that iterates in each step linear programming for estimating latent variables. Learning of HiRF parameters is specified within the max-margin framework.

**Dataset**

We collected a new dataset, the UCLA Courtyard Dataset, with videos captured from a birdeye viewpoint of a courtyard at the UCLA campus. The videos show human activities at different semantic levels, and have a sufficiently high resolution to allow inference of fine details. The dataset consists of a 106-minute, 30 fps, 2560 × 1920-resolution video footage. We provide annotations in terms of bounding boxes around group activities, primitive actions, and objects in each frame.

**M. R. Amer** and S. Todorovic. Sum Product Networks for Activity Recognition in Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016. PDF ** M. R. Amer**, L. Peng, and S. Todorovic. HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos. European Conference on Computer Vision, 2014. PDF