The images in this data set are drawn from the trainval portion of the PASCAL VOC 2010 Action Recognition Taster. The images depict at least one on ten actions: running, walking, riding a bike, reading a book, riding a horse, talking on the phone, jumping, taking a photo, using a computer, and playing an instrument. Each image has a ground-truth bounding box around the person performing the action in the image.

The images should be downloaded directly from PASCAL.


  1. A man is using his laptop computer. He is sitting a chair in a room which also has a couch in it.
  2. There is a man browsing on his laptop. There is a couch next to the love seat he is sitting in.
  3. A man is using a laptop. He is sitting in a chair in his lounge.

Each image in the data set is associated with three image descriptions, as demonstrated in the Overview. The descriptions were collected from untrained annotators on Amazon Mechanical Turk, and they are provided in their original form.1

Object Annotations

341 images are annotated with the objects referred to in the image descriptions. The annotations are in the form of labelled polygons, and this was achieved using a modified version of the LabelMe tool (Russell et al. 2005). The data is presented in the raw LabelMe XML format, and we provide a tool that makes it possible to extract the polygon data from the XML data.

Visual Dependency Representations

A Visual Dependency Representation is a structured representation that captures the prominent spatial relationships between the objects in an image. This representation of an image has proven useful for both automatic image description (Elliott and Keller, 2013), and query-by-example image retrieval (Elliott, Lavrenko, and Keller, 2014). One Visual Dependency Representation is available for each image--description pair in the data set: resulting in a total of 1,023 Visual Dependency Representations.

Visual Dependency Representations are available in their raw dotty format, and in CoNLL-X format. Additionally, we provide a tool that automatically converts the dotty formatted files into a CoNLL-X style dependency representation.


D. Elliott, V. Lavrenko, F. Keller. 2014. Query-by-Example Image Retrieval using Visual Dependency Representations. To appear in Proceedings of the 25th International Conference on Computational Linguistics (COLING '14), Dublin, Ireland.

D. Elliott and F. Keller. 2014. Comparing Automatic Evaluation Measures for Image Description. In Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics (ACL '14), Baltimore, Maryland, U.S.A., pages 452--457.

D. Elliott and F. Keller. 2013. Image Description using Visual Dependency Representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP '13), Seattle, Washington, U.S.A., pages 1292--1302.


B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. (2008) LabelMe: A Database and Web-Based Tool for Image Annotation. International Journal of Computer Vision, 77(1-3):157–173.


  1. In Elliott and Keller (2013), the descriptions were POS tagged using the Stanford POS Tagger v.3.1.0 using the pre-trained english-bidirectional-distsim model, and dependency parsed using Malt Parser v.1.7.2 using the pre-trained engmalt.poly-1.7 model.