The CAVIAR team collected ground truth video sequences for evaluating the project performance and also for use by the computer vision community. The ground truth is available on the web and is encoded in an XML variant designed by us for use with computer vision data.
Altogether there are 52 sequences comprising 90K frames. About 1/3 are of an indoor office lobby. The other 2/3s are in a shopping centre, with 2 views of the same activity.
The scripted (and unscripted) activities include: walk, browse, slump, left object, meet, fight, window shop, shop enter, shop exit.
The ground truth of moving targets was compiled by hand, and includes geometric position information, short term activity descriptions and long-term activity descriptions. The groundtruth labeling for some of the people (108) in some of the video sequences (19) was extended to also mark up the head, hands, feet and shoulders. A total of 52+K target frames were annotated. This additional labelling was not used in the project, but has been announced and made available for use by the international community.
One sequence with 3 labellings was used for labelling consistency experiments. The results showed good geometric and timing consistency. But the semantic labelling status was confusing: maybe 1% labelling errors and 40+% semantic ontology/timing differences.
A sample image from the ground truth with the annotated people and heads, etc
is shown here.
More details are here
Papers that describes the some of the ground truth are:
Back to CAVIAR home page.