Many questions in biology, from development to neuroscience and medicine require the identification of finegrained behaviors. Researchers will develop novel computer vision and natural language processing technology to improve behavioral analysis in biology and medicine. In this project, they will provide a rich knowledge graph representations to language and vision models to learn grounded representations of complex animal behavior. They will develop new open information extraction methods for unstructured text that will extract descriptions of physical behavior for animals & humans. Unifying these extracted descriptions by particular species, they will construct fine-grained knowledge graph representations of animal behavior that will serve as a symbolic world model to ground representation learning of video content.
While these descriptions may only be aligned for certain images, they can be consolidated into a more expressive structured representation of a dynamics model for animals. Using these aligned structures with language descriptions and video content, researchers will use self-supervised training objectives to learn from video content such as nature documentaries and yoga pose videos, which are more semantically-related to animal tracking videos in lab settings. Thus, while current action recognition inference systems in biology [Datta et al. 2019, von Ziegler et al. 2021,Hausmann et al. 2021] do not take priors, language and long-range temporal reasoning into account, this project proposes to develop tri-modal systems that integrate language, video and structured knowledge to advance action recognition. Scientists believe that these models will be able to more robustly and efficiently generalize to various applications in biology.