DARPA - Communicating with Computers: A Visual Storytelling Platform for Composing, Directing and Animating

The goal of this project is developing methods that enable symmetric communication between people and computers. Machines are not merely receivers of instructions but collaborators, able to harness a full range of natural modes including language, gesture and facial or other expressions. Communication is understood to be the sharing of complex ideas in collaborative contexts. Complex ideas are assumed to be built from a relatively small set of elementary ideas, and language is thought to specify such complex ideas—but not completely, because language is ambiguous and depends in part on context, which can augment language and improve the specification of complex ideas. In the case of collaborative composition researchers explore the process by which humans and machines might collaborate toward the assembly of a creative product—in this case, contributing sentences to create stories. Success in this program would advance a number of application areas, most notably robotics and semi-autonomous systems. 

We present a new collaborative visual storytelling platform, Aesop, for direction and animation. Our system operates in two main modes, commonsense grounding (annotation) and conversation. Aesop system senses the human state and input using a natural language parser and human gesture monitoring for natural interactions. The interface consists of a 3D animation software and a web controller to interact with the internal state of the system. For knowledge representation, we formulate novel composition graphs which enables spatio-temporal event representation. Aesop thus enables 3D spatial and temporal reasoning which are both essential for storytelling. Finally, the system utilizes a dialog manager to track the conversation state and manage goals. Our key innovation is enabling conversational AI using both verbal and non-verbal communication, ground language and vision in 3D enabling research in language, vision, and planning in the context of storytelling.

The MovieGraphs dataset is a collection of annotated video clips from 50 movies. Movies are a rich source for human interactions. Clips are annotated with characters who appear in the scene, their attributes (both physical and emotional), relationships and interactions between characters, timestamps at which actions initiate and conclude, a label for the situation depicted in the scene, and a natural language description of the scene’s narrative content. Each situation instance includes a video, subtitles, a brief description of the scene, and a corresponding situation graph. Currently, MovieGraphs are annotated with human centric annotations, lacking objects, spatial relationships and commonsense grounding. We use MovieGraphs dataset as a source of knowledge and commonsense to use, augment with spatial relationships and objects, extract visual features directly from the frames, ground the graphs and their augmentations in Muvizu to teach the AI agents how to translate visual and textual concepts into grounded composition graphs in Aesop.


T. J. Meo, C. Kim, A. Raghavan, A. Tozzo, D. A. Salter, A. Tamrakar, M. R. Amer. Aesop: A Visual Storytelling Platform for Conversational AI and Commonsense Grounding. AI Communications, 2019. Pre-Print.

T. Meo, A. Raghavan, D. Salter, A. Tozzo, A. Tamrakar, and M. R. Amer. Aesop: A Visual Storytelling Platform for Conversational AI, International Joint Conference on Artificial Intelligence, 2018. Best demo award.  PDF 

A. Tozzo, D. Jovanovic, M. R. Amer. Neural Event Extraction in Movies. North American Association for Computational Linguists, 2018. PDF