Communicating With Computers (CwC) is a program funded by DARPA that explores communication between humans and machines. CwC aims to build intelligent systems that are not just servants, but collaborators, and can help humans perform a variety of tasks, from construction, to composition and data exploration. In the context of CwC, I was technical lead on a large collaboration including Brandeis LLC and the CSU Vision lab, that used the VoxWorld Platform to develop Diana, and multimodal interactive agent that can hear, see, and understand her human partner, and communicates using spoken language and gesture to collaborate on construction tasks in a virtual environment. Multimodal agents like Diana represent a step forward for intelligent systems, toward agents that can not only understand language, but understand the situation and context that they inhabit, whether in the real world or in a mixed-reality environment shared with humans.
Our research from CwC has won awards and been presented at a number of top-tier conferences, including *ACL, NeurIPS, and AAAI. At the AAAI conference in February 2020, we presented a live demo of Diana in New York City. The video below was presented alongside the demo:
As intelligent agents become more integrated with everyday life, interacting with them will be not be limited to speech. Interactions will be inherently multimodal, drawing on spoken and typed language, haptic input from pads, keys, and sensors, head and hand gestures from image and video RGBD captures; and contextualized and situational awareness in the local environments, including object and action awareness. When communications become multimodal in nature, each modality in operation provides an orthogonal angle through which to probe the computational model of the other modalities, including the behaviors and communicative capabilities afforded by each, and so multimodal interactions require a unified framework and control language through which systems interpret inputs and behaviors and generate informative outputs. We study and model the semantics of the communication to generate representations for common ground between human and computer.
Below is a screenshot from "Kirby," a mobile robot that generates a simulated representation of the real world it navigates through. Kirby's human partner communicates through speech and gestural interface to that virtual environment and generates grounded representations and instructions that are communicated back to the robot to execute in the real world.
BabyBAW (Best of All Worlds) is an NSF-funded EAGER program, part of the NSF 2026 Idea Machine. We draw direct inspiration from developmental psychology to build architectures and platforms that combine neural networks, symbolic reasoning, and embodied simulation for their respective strengths (the "Best of All Worlds" approach). BabyBAW uses the VoxWorld Platform as its simulation engine and we prototype and develop various diverse tasks to test the BAW architecture(s) on, allow BAW agents to explore and learn in a manner similar to a human infant.
In the image below, BabyBAW learns to stack two blocks in just a few minutes by attending to the height of the structure it builds. Veridical knowledge of parameters like height are one of the advantages provided by embodied simulation methods.
VoxWorld is a multimodal simulation platform used to build interactive intelligent systems capable of situational understanding. VoxWorld developed out of VoxML, a modeling language for visually grounding language through the properties of objects, events, and relations, that I developed with my Ph.D. advisor James Pustejovsky, and VoxSim , a Unity-based visual event simulator that I wrote for my dissertation. VoxWorld is the platform underlying intelligent agents like Diana and with it, we have conducted research on event semantics, spatial reasoning, and referring expressions and have applied the technology to the medical and robotics domains, among others.