iSAT
iSAT (the Institute for Student AI Teaming) is one of the inaugural NSF-funded AI Institutes, led by the University of Colorado Boulder.
The goal of iSAT is to build AI agents that can assist small groups engaged in STEM learning by observing student interactions and group dynamics. AI in education can play the role of an expert, a near-peer, or a co-learner. We are interested not in the traditional intelligent tutoring system paradigm, but in the use of AI to help teams of students learn better by helping them engage in the social practice of science.
My group's work in iSAT is primarily concerned with the modeling of multimodal interaction during collaborative tasks. This includes recognizing gestures, grounding objects in context and reasoning about their relationship to the task at hand, and detecting collaborative problems solving skills from multimodal inputs. In the image below, we see some students in the lab engaged in a collaborative task, where we automatically track the people, their hands, and their skeletons.
BabyBAW
BabyBAW (Best of All Worlds) is an NSF-funded EAGER program, part of the NSF 2026 Idea Machine. We draw direct inspiration from developmental psychology to build architectures and platforms that combine neural networks, symbolic reasoning, and embodied simulation for their respective strengths (the "Best of All Worlds" approach). BabyBAW uses the VoxWorld Platform as its simulation engine and we prototype and develop various diverse tasks to test the BAW architecture(s) on, allow BAW agents to explore and learn in a manner similar to a human infant.
In the image below, BabyBAW learns to stack two blocks in just a few minutes by attending to the height of the structure it builds. Veridical knowledge of parameters like height are one of the advantages provided by embodied simulation methods.
AIDA
AIDA (Active Interpretation of Disparate Alternatives) is a program funded by DARPA that seeks to develop automated understanding of events and situations from different perspectives and diverse data sources.
My groups work in the context of AIDA has been following on findings from CSU's vision group about the interchangeability of feature spaces in CNN models.
We have explored these interchangeability findings in Transformer-based language models and found that using simple affine transformation techniques, we can transfer information between different language models in tasks like coreference resolution and cognate detection. We can even transfer information on a novel language into a target language model when the target model was not trained on that language! We believe this has protentially profound implications for the properties of language embedding spaces.
Communicating With Computers
Communicating With Computers (CwC) is a program funded by DARPA that explores communication between humans and machines. CwC aims to build intelligent systems that are not just servants, but collaborators, and can help humans perform a variety of tasks, from construction, to composition and data exploration. In the context of CwC, I was technical lead on a large collaboration including Brandeis LLC and the CSU Vision lab, that used the VoxWorld Platform to develop Diana, and multimodal interactive agent that can hear, see, and understand her human partner, and communicates using spoken language and gesture to collaborate on construction tasks in a virtual environment. Multimodal agents like Diana represent a step forward for intelligent systems, toward agents that can not only understand language, but understand the situation and context that they inhabit, whether in the real world or in a mixed-reality environment shared with humans.
Our research from CwC has won awards and been presented at a number of top-tier conferences, including *ACL, NeurIPS, and AAAI. At the AAAI conference in February 2020, we presented a live demo of Diana in New York City. The video below was presented alongside the demo:
Multimodal Semantic Grounding for HCI/HRI
As intelligent agents become more integrated with everyday life, interacting with them will be not be limited to speech. Interactions will be inherently multimodal, drawing on spoken and typed language, haptic input from pads, keys, and sensors, head and hand gestures from image and video RGBD captures; and contextualized and situational awareness in the local environments, including object and action awareness. When communications become multimodal in nature, each modality in operation provides an orthogonal angle through which to probe the computational model of the other modalities, including the behaviors and communicative capabilities afforded by each, and so multimodal interactions require a unified framework and control language through which systems interpret inputs and behaviors and generate informative outputs. We study and model the semantics of the communication to generate representations for common ground between human and computer.
Below is a screenshot from "Kirby," a mobile robot that generates a simulated representation of the real world it navigates through. Kirby's human partner communicates through speech and gestural interface to that virtual environment and generates grounded representations and instructions that are communicated back to the robot to execute in the real world.
VoxWorld Platform
VoxWorld is a multimodal simulation platform used to build interactive intelligent systems capable of situational understanding. VoxWorld developed out of VoxML, a modeling language for visually grounding language through the properties of objects, events, and relations, that I developed with my Ph.D. advisor James Pustejovsky, and VoxSim , a Unity-based visual event simulator that I wrote for my dissertation. VoxWorld is the platform underlying intelligent agents like Diana and BabyBAW. With it, we have conducted research on event semantics, spatial reasoning, and referring expressions and have applied the technology to the medical and robotics domains, among others.