Many questions in biology, from development to neuroscience and medicine require the identification of finegrained behaviors. We will develop novel computer vision and natural language processing technology to improve behavioral analysis in biology and medicine. Specifically, we will build deep learning models that can efficiently learn joint representations from video and heterogeneous data sources (e.g., textual descriptions, knowledge graphs). To do so, we will mine the written literature as well as video sharing platforms to extract a knowledge graph of behavior and then learn tri-modal models based on vision, language and this knowledge graph. We believe that these models will be able to more robustly and efficiently generalize to various applications in biology.