Continued advances promise to produce autonomous systems that will perceive, learn, decide, and act on their own. However, the effectiveness of these systems is limited by the machine’s current inability to explain their decisions and actions to human users. Explainable AI — especially explainable machine learning — will be essential to understand, appropriately trust, and effectively manage an emerging generation of artificially intelligent machine partners. The goal of this project is to produce more explainable models, while maintaining a high level of learning performance and enable human users to understand, appropriately trust, and effectively manage the emerging generation of artificially intelligent partners.
We present a novel approach for searching and ranking videos for activities using deep generative model. Ranking is a well-established problem in computer vision. It is usually addressed using discriminative models, however, the decisions made by these models tend to be unexplainable. We believe that generative models are more explainable since they can generate instances of what they learned.
Our model is based on Generative Adversarial Networks (GANs). We formulate a Dense Validation GANs (DV-GANs) that learns human motion, generate realistic visual instances given textual inputs, and then uses the generated instances to search and rank videos in a database under a perceptually sound distance metric in video space. The distance metric can be chosen from a spectrum of handcrafted to learned distance functions controlling trade-offs between explainability and performance.
Our model is capable of human motion generation and completion. We formulate the GAN discriminator using a Convolutional Neural Network (CNN) with dense validation at each time-scale and perturb the discriminator input to make it translation invariant. Our DVGAN generator is capable of motion generation and completion using a Recurrent Neural Network (RNN). For encoding the textual query, a pretrained language models such as skip-thought vectors are used to improve robustness to unseen query words.
We evaluate our approach on Human 3.6M and CMU motion capture datasets using inception scores. Our approach shows through our evaluations the resiliency to noise, generalization over actions, and generation of long diverse sequences.