Sequential behaviour recognition commonly uses a Hidden Markov Model (HMM) but this has an exponential distribution implicit in its state transition model. We replaced this distribution with an empirical time-in-state distribution (HSMM: Hidden Semi-Markovian Model). The commonly used algorithms for the HSMM model are O(T^2) meaning that continuous video is computationally infeasible. We located an O(T) algorithm from gene-sequence analysis and adapted it for video-sequence use.
We represented behaviour in a 4 level scheme, with movement and roles infered based on image evidence and instantaneous 'situation' and long-term 'context' descriptions represented in a graph representation. We developed a rule-based symbolic 'parsing' of the video sequence and a HSMM recognizer. The former is simpler and but the latter can cope better with softer (probabilisitic) evidence. We then compared the two algorithms, recognizing behaviour using the ground-truth tracking, IST feature descriptions and UEDIN role hypothesizing, over 7 context models, 80 sequences and 417 tracked persons. The rule based recognizer achieved 57% and the HSMM recognizer achieved 65% correct recognition of the contexts.
We investigated whether an algorithm based on hard categorial decisions and hand-crafted decision rules would have better or worse recognition results than the HSMM probabilisitic recognizer. We compared the rule-based 'parser' (which allowed some erroneous single frame movement and role classification errors) to the HSMM algorithm, which allowed marginal evidence to be used at a lower probability. In the following, the data is from all ground truth sequences, using the ground truth short term short-term activity and role classifications. True class labels are at the table left edge. The Context label abbreviations are
This classifier used hand-tuned rule and procedural based matching algorithms (e.g. like a parser allowing erroneous states) that matched the different context model graphs to the sequence of situations for each video. The sequence of situations were derived from the combination of role and short-term activity labels. Overall 70% of the situations (individual frames) were correctly classified and 57% of the behaviours (context models) were correctly recognised.
. | CW | CB | CI | CEn | CEx | CR | CWi | CErr | Tot | % |
CW | 63470 | 1103 | 3366 | 1149 | 58 | . | . | 5525 | 74671 | 85 |
CB | 656 | 15934 | . | 1188 | . | . | . | 8780 | 26558 | 60 |
CI | 5512 | 2575 | 18768 | . | . | 232 | . | 2704 | 29791 | 63 |
CEn | 1048 | . | . | 16785 | . | 6011 | . | 2384 | 26228 | 64 |
CEx | 371 | . | . | . | 10603 | . | . | 26895 | 37869 | 28 |
CR | . | . | . | . | . | . | . | 2488 | 2488 | 0 |
CWi | 67 | 10139 | . | . | 3766 | . | . | 8601 | 22573 | 0 |
Total | . | . | . | . | . | . | . | . | 220178 | 57 |
This classifier used the HSMM matching algorithm to matched the different context model graphs to the sequence of situations for each video. The sequence of situations were derived from the combination of role and short-term activity labels. Overall 74% of the situations (individual frames) were correctly classified and 65% of the behaviours (context models) were correctly recognised. This was slightly better overall than the rule-based approach.
. | CW | CB | CI | CEn | CEx | CR | CWi | CErr | Tot | % |
CW | 65710 | 1103 | 4099 | 368 | . | . | . | 3391 | 74671 | 88 |
CB | 656 | 14872 | . | . | . | . | . | 11030 | 26558 | 56 |
CI | 191 | 224 | 21747 | . | . | . | . | 7629 | 29791 | 73 |
CEn | 1049 | . | . | 16261 | . | . | 17 | 8918 | 26228 | 62 |
CEx | 371 | . | . | . | 21206 | . | . | 16292 | 37869 | 56 |
CR | . | . | . | . | . | 528 | . | 1891 | 2488 | 20 |
CWi | . | 9565 | . | . | . | . | 2934 | 10074 | 22573 | 13 |
Total | . | . | . | . | . | . | . | . | 220178 | 65 |
A paper that describes the algorithm is:
Back to CAVIAR home page.