We have been working on two tasks that go beyond English-language image description: Multimodal Translation is the task of translating the description of an image, given a corpus of parallel text aligned with images; Crosslingual Description is the task of generating descriptions, given a corpus of independently collected source- and target-language descriptions aligned with images.
We proposed the first models for crosslingual image description (Elliott et al. (2015)), released the Multi30K corpus
of bilingually described images (Elliott et al. (2016)), devised a doubly-attentive multimodal neural translation model (Calixto et al. (2016)), and organised the first and second shared
tasks in both of these problems (Specia et al. (2016)).
More recently, we proposed a model that learns visually grounded representations as an auxiliary task for multimodal translation (Elliott and Kádár (2017)); our approach (UvA-TiCC) performed well against the state of the art for English-German translation in the 2017 Multimodal Translation shared task.
Visual Dependency Representations capture the prominent spatial relationships between the objects in an image (Elliott 2015). This representation has proven useful for automatic image description using detected objects (Elliott and de Vries, 2015) or gold-standard objects (Elliott and Keller, 2013), and query-by-example image retrieval (Elliott, Lavrenko, and Keller, 2014). There is a treebank of annotated images and an off-the-shelf parsing model for inducing the representations.