Communicating With Computers (CwC) is a program funded by DARPA that explores communication between humans and machines. CwC aims to build intelligent systems that are not just servants, but collaborators, and can help humans perform a variety of tasks, from construction, to composition and data exploration. In the context of CwC, I was technical lead on a large collaboration including Brandeis LLC and the CSU Vision lab, that used the VoxWorld Platform to develop Diana, and multimodal interactive agent that can hear, see, and understand her human partner, and communicates using spoken language and gesture to collaborate on construction tasks in a virtual environment. Multimodal agents like Diana represent a step forward for intelligent systems, toward agents that can not only understand language, but understand the situation and context that they inhabit, whether in the real world or in a mixed-reality environment shared with humans.
Our research from CwC has been presented at a number of top-tier conferences, including *ACL, NeurIPS, and AAAI. Most recently, we presented a live demo of Diana at AAAI 2020 in New York City. The video below was presented alongside the demo:
As intelligent agents become more integrated with everyday life, interacting with them will be not be limited to speech. Interactions will be inherently multimodal, drawing on spoken and typed language, haptic input from pads, keys, and sensors, head and hand gestures from image and video RGBD captures, and contextualized and situational awareness in the local environments, including object and action awareness. When communications become multimodal in nature, each modality in operation provides an orthogonal angle through which to probe the computational model of the other modalities, including the behaviors and communicative capabilities afforded by each, and so multimodal interactions require a unified framework and control language through which interpret inputs and behaviors and generate informative outputs. We study and model the semantics of the communication to generate representations for common ground between human and computer.
Below is a screenshot from "Kirby," a mobile robot that generates a simulated representation of the real world it navigates through. Kirby's human partner communicates through speech and gestural interface to that virtual environment and generates grounded representations and instructions that are communicatd back to the robot to execute in the real world.
VoxWorld is a multimodal simulation platform used to build interactive intelligent systems capable of situational understanding. VoxWorld developed out of VoxML, a modeling language for visually grounding language through the properties of objects, events, and relations, that I developed with my Ph.D. advisor James Pustejovsky, and VoxSim , a Unity-based visual event simulator that I wrote for my dissertation. VoxWorld is the platform underlying intelligent agents like Diana and with it, we have conducted research on event semantics, spatial reasoning, and referring expressions and have applied the technology to the medical and robotics domains, among others.