The goal of this task was to compare the performance of several state of the art detectors evaluated on a large data base with respect to hand labelled ground truth.
The assumption is that the ideal target detector would produce the same output as a human would do. One of the conclusions of this work is that the output produced by a pixel based detector is quite different than the manual labelling. The target detectors do not have access to higher level information such as segmenting the individuals of a group. The ground truth labelling describes groups and also the individuals that compose the group. The target detector sees a connected component of foreground pixels and determines its bounding box. Therefore, a group of people is detected as a single target. When computing the statistics, this fact reduces the absolute target detection rate. On the other hand, all detectors are penalised in the same way which means that the comparison between the detectors is valid.
We compared following detectors. Basic Background Subtraction (BBS), W4, single gaussian model (SGM, multiple gaussian model (MGM) and LOTS. All these detectors operate on single images, there is no temporal link between images of a sequence. To demonstrate the increase in performance (faster processing and smaller false alarm rate), we enhanced the BBS with a Kalman filter. This method is referred to as "Track" in the graphs.
To evaluate the performance of the detectors, we chose the measures
tracking detection rate (TRDR) and false alarm rate (FAR) as proposed
by Black. This figure shows the TRDR with respect to overlap:
This figure shows the FAR with respect to overlap:
We see that the TRDR decreases with increasing overlap requirement. The absolute detection rate is somewhere below 60\%. This may surprise. There are several reasons. First, the error that is introduced by detecting a group as a single target. In the ground truth, only active targets are labelled. The others are considered as background. Second, the detectors need to initialise a reference image. Any target different from this reference image is systematically detected.
An important result is that the TRDR of all methods (except W4) are very close for T=60%. To decide which approach is best we need to look at the FAR. Here we see significant differences. LOTS has the lowest FAR followed by SGM and MGM. Considering the computation time, LOTS is the fastest 130Hz, MGM is the slowest 2.8Hz with all others in between. Combination of the BBS with a Kalman filter produces the same TRDR and reduces the FAR by 35% (from ~70% BBS to 35% Track).
LOTS and SGM outperform the more complex background model but MGM may be an artifact related to the database. There are not many periodic changes. In a more complex scenario with periodically changing background, LOTS and SGM may fail and only MGM may produce good results.
A paper that describes the comparison is:
Back to CAVIAR home page.