The Visual and Linguistic Treebank is a data set of images annotated with human-written descriptions, object boundaries, and Visual Dependency Representations. The images are freely available from the Action Recognition Task in the PASCAL VOC 2010 data set; our annotations are available for only the trainval data. Descriptions are available for all 2,424 images in the trainval data, and object annotations and Visual Dependency Representations are available for a subset of 341 images.
The data has been organised to follow the format of the Flickr8K data set to make it easier to use your existing experimental infrastructure to run experiments with this new data set.
The images in this data set are drawn from the trainval portion of the PASCAL VOC 2010 Action Recognition Taster. The images depict at least one on ten actions: running, walking, riding a bike, reading a book, riding a horse, talking on the phone, jumping, taking a photo, using a computer, and playing an instrument. Each image has a ground-truth bounding box around the person performing the action in the image.
The images should be downloaded directly from PASCAL.
Each image in the data set is associated with three image descriptions, as demonstrated in the Overview. The descriptions were collected from untrained annotators on Amazon Mechanical Turk, and they are provided in their original form.1
341 images are annotated with the objects referred to in the image descriptions. The annotations are in the form of labelled polygons, and this was achieved using a modified version of the LabelMe tool (Russell et al. 2005). The data is presented in the raw LabelMe XML format, and we provide a tool that makes it possible to extract the polygon data from the XML data.
A Visual Dependency Representation is a structured representation that captures the prominent spatial relationships between the objects in an image. This representation of an image has proven useful for both automatic image description (Elliott and Keller, 2013), and query-by-example image retrieval (Elliott, Lavrenko, and Keller, 2014). One Visual Dependency Representation is available for each image--description pair in the data set: resulting in a total of 1,023 Visual Dependency Representations.
Visual Dependency Representations are available in their raw dotty format, and in CoNLL-X format. Additionally, we provide a tool that automatically converts the dotty formatted files into a CoNLL-X style dependency representation.
D. Elliott, V. Lavrenko, F. Keller. 2014. Query-by-Example Image Retrieval using Visual Dependency Representations. To appear in Proceedings of the 25th International Conference on Computational Linguistics (COLING '14), Dublin, Ireland.
D. Elliott and F. Keller. 2014. Comparing Automatic Evaluation Measures for Image Description. In Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics (ACL '14), Baltimore, Maryland, U.S.A., pages 452--457.
D. Elliott and F. Keller. 2013. Image Description using Visual Dependency Representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP '13), Seattle, Washington, U.S.A., pages 1292--1302.
B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. (2008) LabelMe: A Database and Web-Based Tool for Image Annotation. International Journal of Computer Vision, 77(1-3):157–173.