Computed-assisted prescreening of video streams for unusual activities

Aims and Objectives

The project will investigate two novel computer-based image analysis processes to prescreen video sequences for abnormal or crime-oriented behaviour. The ultimate goal is to filter out image sequences where uninteresting normal activity is occurring, as well as the much easier sequences where nothing is occurring. The first process is for detecting, understanding and discriminating between similar types of interactions, such as two people fighting versus meeting and greeting. We propose to investigate and possibly extend the dynamic Hidden Markov Model technique as applied to tracked individuals to solve this problem. The second process is for analysing crowd scenes, where tracking of individuals is only possible over short time periods, and where the overall flow of the crowd is more salient. The goal is to discriminate between normal behaviour, such as people normally exiting from a football match, and abnormal behaviour, such as when people have to divert around an obstacle (fallen person, fight, etc). We propose to adapt global probabilistic models to flow data obtainable from short-time image tracking.

The objectives are:

To investigate and extend methods for classifying the interaction between multiple persons, capable of discriminating between subtly different behaviours.
To develop methods for flow-based analysis of the behaviour of many interacting individuals.
To apply the results of these two approaches to detection of criminal or dangerous situations in interactions between small groups and crowd situations.

Excitement and Novelty

With the recent installation of video surveillance systems in many city centres and other urban areas, there is a massive increase in the ability to collect data. Traditionally, this data was processed by human observers, recently replaced by recording equipment and post-processed only after undesirable events have occurred. What is more desirable is to be able to automatically detect potentially significant events as they happen. This full capability is beyond current computer technology and still requires human observers. Unfortunately, there are not enough operators and boredom also sets in quickly.

The research proposed here is aimed at collaborative working between human observers and computer-assisted prescreening: the computer would make initial assessments of video streams to select interesting sequences, which are then switched through to human operators for their more subtle assessment. This also allows a single human to manage more cameras simultaneously, as only the significant data will be relayed. Moreover, the portions of the video stream where the interesting events are occurring can be highlighted.

From a pragmatic viewpoint, the results could extend the capabilities of surveillance ``blank-screen'' technology, wherein operators are only presented with information when activity is occurring in the scene. The approaches presented here could also eliminate routine normal activity, such as two people walking together, so as to allow human operators to focus on unusual behaviours.

One key focus of this proposal is the understanding of interacting discrete agents over time, based on video data. The state of the art on the issue of social interactions between people is by Oliver et al , who used coupled Hidden Markov Models (HMMs) for detecting and classifying interactions. Of particular interest were the meeting and following behaviours. Discriminating between different types of meeting behaviours is central to our project, so we will probably adapt many ideas from this work. More recent results by Gong et al used multi-linked Hidden Markov Models to get improved performance. Their application was recognising interacting airport service vehicles. A particular strength of their approach is the automatic construction of links between agents when appropriate (as will be necessary in the work proposed here, because interactions can occur spontaneously between formerly non-interacting individuals).

The approaches described above used coupled probabilistic models representing relations between segmented individual tracks. An alternative is to use a joint probabilistic model covering the whole scene, as in Galeta et al, where viewpoints and trajectory models can be learned from observation. The representation encodes both spatial and temporal qualitative relationships between interacting moving entities (automobiles). A previous significant result on this problem was by Brand and Kettnaker who used a human-constructed scene representation, using fully connected HMMs, without much intermediate symbolic interpretation of the data. This was not as effective of a model as it also allows physically or temporally invalid couplings.

In contrast to the largely pure probabilistic modeling approaches summarised above, research has also been undertaken into representations with a distinct symbolic as well as a probabilistic content. Ivanov and Bobick used a grammatical representation for describing the temporal sequencing of events, and then developed a probabilistic parsing technique. The main application was to two agent car-person interactions (e.g. pick up/drop off). Intille and Bobick exploited a much more extreme version of this problem, applying the mixed symbolic and probabilistic representation to recognising plays in American football, involving 22 agents (players). A library of play models was used to encode both spatial and temporal relationships. Most other sports-based video analysis work has tended to follow similar approaches.

The advantage of the symbolic representations is the ability to more easily encode a priori knowledge of the domain, particularly in situations where the number of observations is limited (e.g. when detecting unusual behaviours). The disadvantage compared to the more purely probabilistic representations is that the human-built models may fail to represent significant types of interactions.

The other key focus of this proposal is the understanding of interacting crowds over time, where individuals can be tracked for only a few frames, and where there might be many hundreds of people simultaneously in view. We are not aware of any previous research considering this problem. Of the research on interacting agents, that by Brand and Kettnaker might be a good foundation as it builds a probabilistic model of the entire scene simultaneously.

As well as interacting people and interacting automobiles, research has investigated people interacting with automobiles (Ivanov), including early work by Remagnino et al. A good overview of issues and recent progress in the broader area of dynamic scene understanding, including the topic of interest here, is given by Buxton.

In summary, there are a few good results on modelling discrete interactions using pure probabilistic and mixed probability and symbolic representations. What is less well understood is how to model subtle distinctions between slightly different types of interactions, such as greeting behaviour versus fighting. In the area of recognising behaviour in contexts with many actors, such as crowd behaviour at sporting events, we have not found any good prior work. Thus, we claim that there are two good open research questions here, both of which have direct relevance to the crime detection and prevention programme.

If successful, the research proposed here will:

allow a variety of interacting behaviours to be specified,
classify the behaviours as rare/abnormal or uninteresting,
identify activity in unexpected places, and
do this at a near video rate but with a low false alarm rate, so as to quickly alert a human operator for further analysis.

Outline of the proposed methodology

The project will investigate two sub-problems based on interacting groups of humans.

Subtle Human Interaction Understanding

The goal of this subproject is to extend existing methods of behaviour recognition to be able to distinguish between subtly different interactions between a small group of individuals, in particular between greeting and fighting, or preparing for fighting.

Statistical Flow Analysis of Bulk Human Motion

When the number of interacting people increases beyond a threshold level, the performance of tracking individuals will decrease, and thus symbolic interpretation of the behaviour becomes impossible. Therefore, we propose investigating a novel flow-based approach. In this approach, short-term correlation-based tracking can produce flow patterns in the image data. From these patterns, statistical classification techniques can probably be developed that distinguish between normal and abnormal flow patterns. For example, fans leaving a football ground normally have a standard movement patterns, leading to standard flow patterns. If the flow is disrupted, e.g. by a fight, then crowd density may make it impossible to track individuals, nor identify the fighters; however, the disruption to the flow because of obstacles and other people attempting to avoid the fighters may become detectable.