Object-based Visual Attention for Computer Vision
Yaoru Sun a and Robert Fisher a
a School

of Informatics, The University of Edinburgh 5 Forrest Hill, Edinburgh EH1 2QL, UK

Abstract In this paper, a novel model of object-based visual attention extending Duncan’s Integrated Competition Hypothesis [24] is presented. In contrast to the attention mechanisms used in most previous machine vision systems which drive attention based on the spatial location hypothesis, the mechanisms which direct visual attention in our system are object-driven as well as feature-driven. The competition to gain visual attention occurs not only within an object but also between objects. For this purpose, two new mechanisms in the proposed model are described and analyzed in detail. The ﬁrst mechanism computes the visual salience of objects and groupings; the second one implements the hierarchical selectivity of attentional shifts. The results of the new approach on synthetic and natural images are reported. Key words: Visual attention, object-based visual attention, integrated competition, grouping salience, hierarchical selectivity.

1

Introduction

It is well known that the primate visual system employs an attention mechanism to limit processing to important information that is currently relevant to behaviours or visual tasks. It can eﬃciently deal with the balance between computing resources, time cost and performing diﬀerent visual tasks in a normal, cluttered and dynamic environment [69]. Visual attention selectivity can be either overt to drive and guide eye movements, picking up useful information over time [32,53,62], or covert, internally shifting the focus of attention from one locus to another without eye movements [45,68, p. 519-570]. 1.1 General problems of modelling visual attention Modelling visual attention is a challenging problem for machine vision. Three closely-related basic questions are immediately identiﬁable:
Preprint submitted to Elsevier Science 30 November 2002

(1) How can the visual system know what information is important enough to capture attention? Modern research on visual attention from psychophysical and neurophysiological experiments have found that there exist two ways by which information can be used to direct attention (see [94,96] for reviews). One approach uses bottom-up information including basic features such as colour, orientation, motion, depth, conjunctions of features such as objects in 2D or 3D space and even learned features. In this case, visually salient features (a feature or stimulus diﬀers from its immediate surround in some dimensions and the surround is reasonably homogeneous in those dimensions [20]) are mostly used to attract visual attention. A great number of models make use of “saliency” to direct attention [1,2,9,54,88]. However, saliency cannot always capture attention in a purely bottom-up fashion if attention is focused or directed elsewhere in advance [94,96]. Thus it is necessary to recognize the importance of how attention is also controlled by top-down information relevant to current visual behaviours. The deployment of attention is determined by an interaction between bottom-up input and top-down attentional priming or setting [96]. (2) How does the visual system know when and how to direct attention and choose important information rather than doing so at random times and by random selection? This is the paradox of intelligent selection of attention in visual systems. We would like to know whether selection happens earlier or later, to what extent visual processing is serial or parallel, and what interplay exists between these factors. A number of researchers have proposed two-stage models in which the preattentive stage performs independent detection or extraction of primary visual features automatically in parallel (without attention) and the second stage of attention processes the combination of primitive features by serially shifting the focus of attention to scan subsets of the incoming information available from the previous stage (see [94] for a review). This proposed strategy, however, conﬂicts with many modern psychophysical experiments that conﬁrm that attention can arise from very early visual processing stages (e.g. feature detection) or arise from relatively late processing stages (e.g. object representation or recognition) in diﬀerent circumstances in which parallel and serial processing reciprocally intertwine for eﬃcient performance of visual tasks [44,58,61]. Thus, this problem is far from well understood and requires further investigation. (3) Where is (are) the next potential target(s) of visual attention shifts? That is, how does attention know where to go and what to do next? There are two traditional assumptions in the literature attempting to account for this. The space-based attention theory holds that attention is allocated to a region of space, with processing of everything within this spatial window of attention like a spotlight, internal eye, or zoom-lens [29,72,85,86]. Object-based attention theory argues that attention is actually directed to an object or a group of objects to process any properties of selected object(s) rather than regions of space [17,19,49,78]. Some recent ﬁndings support a view 2

that the two accounts are not mutually exclusive [23,31,44] and they may actually share common neural mechanisms in the parietal lobes [33]. Until now, few researchers have proposed attentional models that integrate space-based and object-based views (but see [60]). As suggested by S. E. Palmer in [68, p. 547-549], both hypotheses may be true to account for diﬀerent processing levels respectively in the visual system and may be necessary to supply and interact at multiple processing levels for coherent behaviour. These three problems lead to a general question: How does visual attention work to perform eﬃcient selectivity? The dominant theory of visual attention is based on the hypothesis that attention works in space like a “spotlight” or “zoom lens”, scanning the scene by shifting attention from one location to the next to limit processing to a variable size of space in the visual ﬁeld. There have been a number of attentional models that use this hypothesis. Most of them are derived from Treisman’s Feature Integrated theory [85] which consists of separate low-level feature maps that are combined together by a spatial attention window operating on a master map or saliency map. We will brieﬂy review the most inﬂuential accounts of visual attention in psychophysics and the correspondingly inspired computable models below. 1.2 Psychophysical models of attention There are two divisions of theories in the vast literature being developed to understand visual attention. One is the very inﬂuential space-based attention theory. Another is the developing theory concerning object-based attention. So far Treisman’s model is the most successful model of space-based attention and does provide a general framework for understanding visual attention. Following her theory, a number of computational models of attention in the psychophysics and computer vision ﬁelds have been developed. The main diﬀerence between them is that they use diﬀerent methods to construct and combine the low-level feature maps and to model the control mechanisms of attentional movements. In addition, there are lots of other well-known models of spatial attention as well, such as the guided search model of Wolfe [93], the spotlight or zoom lens model of Eriksen et al. [28,29], the saliency map model of Koch and Ullman [54], the dynamic routing model of Olshausen et al. [67] and the like. The essential bifurcation between object-based attention and space-based attention lies in the question of what are the underlying units of attentional selection. In contrast to the traditional models of space-based attention, object-based attention holds that visual attention can directly select discrete objects rather than only and always selecting continuous spatial areas of the visual ﬁeld. The research on objectbased attention is still quite new. However, some fundamental theories have been developed in recent years. In the “Biased Competition Model” of Desimone and Duncan [15] and the “Integrated Competition” hypothesis of Duncan [24], visual attention is taken as an emergent eﬀect of competition between neural representations in multiple systems which work together to serve the same selected object. 3

Other pioneer research can be seen in the work of Humphreys and his colleagues [20,42,43], Grossberg [40], Behrmann [4], and a converged review [78]. 1.3 Computable models of space-based attention Koch and Itti have built the most sophisticated saliency-based spatial attention model [54,46]. The saliency map is used to encode and combine information about each salient or conspicuous point (or location) in an image or a scene to evaluate how diﬀerent a given location is from its surrounding. A Winner-Take-All (WTA) neural network implements the selection process based on the saliency map to govern the shifts of visual attention. This model performs well on many natural scenes and has received some support from recent electrophysiological evidence [35,76]. Tsotsos et al. [88] presented a selective tuning model of visual attention that used inhibition of irrelevant connections in a visual pyramid to realize spatial selection and a top-down WTA operation to perform attentional selection. In the model proposed by Clark et al. [9,10], each task-speciﬁc feature detector is associated with a weight to signify the relative importance of the particular feature to the task and WTA operates on the saliency map to drive spatial attention (as well as the triggering of saccades). In [37,75], colour and stereo are used to ﬁlter images for attention focus candidates and to perform ﬁgure/ground separation. Grossberg proposed a new ART model for solving the attention-preattention (attention-perceptual grouping) interface and stability-plasticity dilemma problems [38,39]. He also suggested that both bottom-up and top-down pathways contain adaptive weights that may be modiﬁed by experience. This approach has been used in a sequence of models created by Grossberg and his colleagues (see [8,39] for an overview). In fact, the ART Matching Rules suggested in his model tends to produce later selection of attention and is partly similar to Duncan’s integrated competition hypothesis [23] which is an object-based attention theory and diﬀerent to the above models. Some researchers have exploited neural network approaches to model selective attention. In [2,3], the saliency maps which are derived from the residual error between the actual input and the expected input are used to create the taskspeciﬁc expectations for guiding the focus of attention. Kazanovich and Borisyuk proposed a neural network of phase oscillators with a central oscillator (CO) as a global source of synchronization and a group of peripheral oscillators (PO) for modelling visual attention [52]. Similar ideas have also been found in other work [12,13,56,64,65] and are supported by many biological investigations [56,81,89]. There are also some models of selective attention based on the mechanisms of gating or dynamic routing information ﬂow by dynamically modifying the connection strengths of neural networks [38,43,67,73]. In some models, mechanisms for reducing the high computational burden of selective attention have been proposed based on space-variant data structures or multiresolution pyramid representations and have been embedded within foveation systems for robot vision [7,77,11,30,82,84,92]. But it is noted that these models developed the overt attention systems to guide ﬁxations of saccadic eye movements and partly or completely ignored the covert attention mechanisms. Fisher and 4

Grove [41] have also developed an attention model for a foveated iconic machine visual system based on an interest map. The low-level features are extracted from the currently foveated region and top-down priming information are derived from previous matching results to compute the salience of the candidate foveate points. A suppression mechanism is then employed to prevent constantly re-foveating the same region. 1.4 Inducements and innovations of the proposed model The computable models of space-based attention reviewed above, however, have some intrinsic disadvantages. They have only concentrated on mechanisms of visual attention based on selectivity by spatial locations. Thus they inherently lack mechanisms accounting for object-based selection (see [25,26] and [68, p. 547-549] for reviews). The normal scene is usually cluttered: objects may overlap or share some common properties. In this case attention may need to work in several discontinuous spatial regions at the same time. Some diﬀerent visual features which constitute the same object can come from the same region of space: in this case maybe no attention shift is required. The structure of one object may be very complex and hierarchical: in this case the interaction or cooperation of object-based, location-based selection and selectivity by visual features is required. Object-based attention has advantages that space-based attention does not have: • more eﬃcient visual search: speed and accuracy; • less chance to select a nonsense or empty location; • naturally hierarchical selectivity. Thus it is important to properly integrate the two accounts of space-based and object-based attention. The above problems led us to propose another machine vision approach for modelling visual (covert) attention. The model described here is an alternative computational model of visual attention which is object-based. It absorbs several ideas and many ﬁndings from modern literature in psychophysics and computer vision, including recent research on: 1) object-based visual attention such as Duncan’s Integrated Competition theory [22–24], and [15,16]; 2) visual saliency such as Koch and Itti’s model of saliency-based visual attention [54,46]; 3) bottom-up and top-down interaction of visual attention [94,96]; 4) integration of object-based and location-based attention [60]; 5) visual representations of within-objects and between-objects [44]; and 6) other investigations [5,39]. One of the novel mechanisms in our model is the grouping-based salience computation for attentional competition between features, objects, and groupings of features and objects, and competition within objects and groupings of features and objects. The early visual features of the scene (colours, intensity, and orientations) are extracted by multiresolution pyramids. The visual salience of points, objects and regions is calculated for diﬀerent groupings on the feature pyramids, which builds up the basis of the purely bottom-up attention competition among various visual inputs. The competition for visual attention is modulated by the interaction between bottom-up visual saliency and the top-down attentional setting which is 5

decomposed into positive priming, negative priming, free, and occupied cases (introduced later). The main goal of this paper is to present our model for the visual saliency of groupings and the mechanism of covert attentional movements. Another novel mechanism used in the proposed model is hierarchical selectivity for guiding covert attentional movements, which can be regarded as a kind of multiple selectivity [68, p. 547-554] integrating attentional selection by spatial locations, visual features and their complex conjunctions (e.g. objects or groupings). The competition for attention takes place ﬁrstly from the most coarse level on multiresolution pyramids, then gradually to the ﬁner level, as well as from coarser groupings to ﬁner groupings within and between groupings and resolutions. The ﬁnest grouping is set to a pixel or point in our model. This mechanism is thus biologically plausible. Clearly, the strategies from coarse to ﬁne occur on the multiple architecture of visual resolution and groupings including objects, features and locations related to the relevant resolution. At each pyramid level, the winner of selective attention in each competition is generated by a Winner-Take-All (WTA) strategy. The presented model explores the ﬁrst machine-vision implementation of a hierarchical object-based visual attention system. The paper shows that it produces plausible attention shifts on real imagery and also that its performance on synthetic displays is similar to human psychophysical results. To simplify the research, we have assumed that a perceptual organisation of the image into a hierarchical set of groupings has been done. (We assume that other research from elsewhere will eventually supply this input. See section 2.6 for the further discussion) Further, our approach has a mechanism to respond to top-down behavioural inputs, but we have not completely investigated the actual top-down selection process (as this is a complex process involving both visual and non-visual reasoning). Lastly, the presented model only considers covert attention (where the fovea does not move) rather than overt eye movements that might lead to signiﬁcant changes in visual salience.

2

Model

2.1 Overview of the model Our work is concerned with the development of eﬃcient mechanisms of visual attention for a machine vision system. The model developed here shows that objectbased and location-based attention can work in a uniform framework depending on both the current scene and the observer’s goals to deal with complex visual tasks (see [78] for a comprehensive review of object-based visual attention). The model, for this purpose, brings together several issues found in the modern literature. The critical aspects of our theory are: 1. Integrated competition for visual attention 6

Our approach extends Duncan’s Integrated Competition hypothesis [16,23,24]. The main adjustment is that we think his model of object-based attention can be extended to work in both object-based and space-based ﬁelds by replacing objectcentered with grouping-centered (see one of the few psychophysical attentional models [5] and [39] for integrating object-based with location-based evidence). A grouping is a unit involving object(s) and related features and locations (see [18,44,78] for detailed discussion on these issues). In this way, a grouping in our model can be a point (a pixel here), an object or a feature, a group of objects or features, or a region. At any given moment, enhanced responses to one grouping will decrease responses to other competitors. Once one grouping gains the dominance of selective attention, all other relevant processing to this grouping and all components belonging to this grouping share the same dominance. This is why it is termed “integrated competition”. 2. Bottom-up and top-down interaction The nature of attentional competition comes from dynamic interaction based on visual saliency [54,46] between bottom-up visual grouping and top-down attentional biasing or setting [94,96]. That is, purely bottom-up or top-down driven information for attention can only bias the competition for selection process partly. In this case, salient visual groupings can capture attention quickly and automatically only if the current attention is not deliberately directed to other groupings or properties in advance. 3. Hierarchical selectivity of visual attention Hierarchical selectivity is proposed to guide the attentional movements shifting from one locus of attention to another under multiscale transformation and directly builds upon the above two issues. It implies that visual attention can directly select a continuous area of space, discrete object(s), feature(s), point(s), or their grouping. The space-based and object-based attentional selectivity are either cooperative or independent of each other for eﬃcient selective acts according to the current visual situations and tasks. This strategy is especially useful for machine vision. For example, space-based selection can be applied to region segmentation whereas object-based selection can be used for object recognition or ﬁne analysis. Keeping the above issues in mind, our model of visual (covert) attention is depicted schematically in Figure 1. The model ﬁrstly extracts primary features colours, intensity, and orientations from one ﬁxated image sampled from a given scene, by multiscale pyramid ﬁlters. After perceptual grouping preprocessing, the bottom-up saliency mappings of various groupings are created via the groupingbased salience computation. These saliency mappings are dynamically varied according to competition conditions among the groupings at diﬀerent resolutions and related surroundings during the attentional movements. The results produced from this stage are fed to the attention competition pool where all coarsest groupings compete against each other for preferentially obtaining the selective attention. The competition procedure is a dynamic interaction between bottom-up salience and the top-down attentional setting. The rules of winner-take-all and inhibition

7

W orld Image

EYE/(One) Fixation Image

Primary Feature Extraction

Co lo u r Py ramid s

In ten s ity Py ramid s

O rien tatio n Py ramid s

Perceptual Grouping

Grouping Saliency Mapping

Lo cu s o f A tten tio n

Competition Pool of Attention
T o p -D o w n A tten tio n al Settin g

Fig. 1. The schematic description of the model

of return are applied here to ensure the winner beneﬁts and prevent attention from returning to the previously attended groupings. The attentional movements among the winning competitors are guided by hierarchical selectivity. The detailed description of each module in the model is given in the following sections. 2.2 Eye/ﬁxation image Our model is built for covert visual attention rather than overt eye movements such as gaze control in active vision research. At any moment, a ﬁxed image, which is a transformation of the world image into retinal image at each ﬁxation point, is obtained by simulating the functional mapping of resolution decreasing from the fovea to the periphery of the retina. The following modules involved in visual attention operate on one given ﬁxed image and the sampling function of a gaze at that moment is presented as follows. Future research will consider overt eye movements such as saccades and how saliency is integrated with overt multiple ﬁxations. 2.3 Primary feature extraction Colour image input is decomposed into sets of multiscale feature maps via overcomplete steerable pyramid ﬁlters [36], to generate four colour, one intensity, and four orientation pyramids [46]. Suppose that F is the input image, with r, g , b being the red, green, and blue colour components of F . An intensity image I (pij ) is created by: I (pij ) = [r(pij ) + g (pij ) + b(pij )]/3 (1)

where pij is a point of F , i ∈ [1 . . . n], j ∈ [1 . . . m], n × m is the size of the image. 8

Then, four colour channels R (red), G (green), B (blue), and Y (yellow) are obtained as [46] (negative values are set to zero): R(pij ) = r(pij ) − [g (pij ) + b(pij )]/2 G(pij ) = g (pij ) − [r(pij ) + b(pij )]/2 B (pij ) = b(pij ) − [r(pij ) + g (pij )]/2 Y (pij ) = [r(pij ) + g (pij )]/2 − |r(pij ) − g (pij )| /2 − b(pij ) (2) Let Wlpf , Wbpf (λ; θ) be Gaussian and orientated Gabor steerable ﬁlters respectively. With these ﬁlters acting on the ﬁve I , R, G, B , and Y channels (see [36,46] for more details), we can construct intensity, colour (red, green, blue, and yellow) and orientation pyramids:
T Iλ+1 = Wlpf · Wlpf · Iλ; I0 = I T Rλ+1 = Wlpf · Wlpf · Rλ; R0 = R T Gλ+1 = Wlpf · Wlpf · Gλ; G0 = G T Bλ+1 = Wlpf · Wlpf · Bλ; B0 = B T Yλ+1 = Wlpf · Wlpf · Yλ; Y0 = Y

(3)

(4) (5)

Oλ (θ) = Wbpf (λ; θ) · I

where λ ∈ [1 . . . l] is the pyramid’s scale and θ ∈ [00 , 450 , 900 , 1350 ] or [00 , 22.50 , 450 , 67.50 , 900 , 112.50 , 1350 , 157.50 ] (in this paper, we used both orientation sets for different experiment environments but the ﬁrst is the general one) is the preferred orientation. The Anderson kernel used for Wlpf is (1/16, 1/4, 3/8, 1/4, 1/16). The Gabor ﬁlter comes from modulating the related Lapacian pyramids with a set of oriented sine waves, then being followed by low-pass operation, and ﬁnally taking the modulus (see [36] for these two ﬁlters in detail). 2.4 Grouping-based saliency mapping Salience evaluation based on groupings is the bridge to achieve object-based attention and integrate space-based attention in this paper. In our approach, groupings are the primary perceptual units upon which attentional processes operate. The term “grouping” (or “segmentation”) is a common concept in the long research history of perceptual grouping by the Gestaltists (see [68, p. 257-266] for a review). We evaluate salience based on groupings here because “grouping” itself has already embedded “object” and “space”. This usage constitutes a fundamental diﬀerence to most of the previous computable models of space-based attention. A grouping is a hierarchical structure of objects and space. In this sense, a grouping may be a point, an object, a region, or a hierarchical structure of groupings. However, we are not implementing the grouping process in this paper but assume in the work below that it exists. That is, we assume that a given scene at each scale has already been segmented into groupings according to the Gestalt principles 9

Grouping Salience

Spatial Salience

Object(s) Salience

Feature(s) Salience

Fig. 2. Diagram of grouping salience

(or other grouping approaches). Some further discussions on grouping are given in Section 2.6. The theory proposed here for salience computation is independent of the approach used for perceptual grouping.
a
Grouping Competition Surroundings of the Grouping

Sub-grouping Cooperation and Competition

Sub-grouping

b

c

Fig. 3. An example of grouping salience

The salience of a grouping is a function of all saliency contributions coming from the components within the grouping working together to compete with their common competitors and competing with each other. This notion covers two issues. One issue is the relationship of spatial location, objects, and features to the grouping they belong to, as shown in Figure 2. The ﬁgure shows that grouping salience is computed from its components of spatial location, feature(s), and/or object(s). The other issue is the competition between a grouping and its surroundings by cooperation and the competition between its components. The eﬀect of a competition between two competitors may either enhance or suppress their salience according to their contrast properties (Figure 3 (a)). Two simple examples are given in Figure 3 (b) and (c). Suppose the red circle (grouping A) is the target. We want to calculate its salience. Its surroundings consist of four groupings (B, C, D, E) and other background points (green pixels). In Figure 3 (b), all red pixels within grouping A work together to enhance the grouping salience by feature contrast to compete with the surroundings. Along with this global competition, local competitions among pixels within grouping A also produce a negatively enhanced eﬀect on the grouping salience due to the same features of these pixels. In Figure 3 (c), the green star subgrouping in grouping A brings a suppressive eﬀect on the total A grouping salience 10

when it competes with green pixels in the background but a enhanced eﬀect when it competes with the non-green groupings and pixels with A and elsewhere. The ﬁnal salience of grouping A depends on the competitive eﬀects brought by all of the components within A (including red pixels, white star and green star). Based on the above considerations, the contrast between any two points is the primitive operation in the computation of grouping salience. However, we are not claiming that the salience computation theory introduced below is complete. The paper in fact is concerned with salience deriving from the colour, intensity, orientation, and distance factors only. Many other factors aﬀecting salience are not included here, such as motion, shape, size, depth and the like (see [94] for the related issues in visual search). One unconsidered factor about relative size diﬀerence between groupings will be discussed later (see section 2.4.3). The salience of a grouping is calculated by combining the colour, intensity, and orientation salience of the components of the grouping. Due to the close relationship between the chromatic opponent-colour channels and the achromatic (white-black) channel in the visual perception and contrast process [90,91], we calculate colour and intensity salience together. Suppose is any given grouping at the current resolution scale λ at time t, Θ is the surroundings of . If ∀ i ∈ , ∀ j ∈ ( ∪ Θ) and i = j , we calculate the colour-intensity salience SCI and orientation salience SO of i by: SCI ( SO (
i ; λ; t ) i ; λ; t )

= fCI (

i; { i; {

j }; λ ; t ) j }; λ ; t )

= fO (

(6)

where fCI , fO are the calculating function of colour-intensity, orientation salience between i and j respectively. The salience S of grouping is given as: S(
i ; λ; t )

= Γ [SCI (

i ; λ; t); SO (

i ; λ; t)]

S ( ; λ; t) = Ψ[S (

i ; λ; t)]

(7)

where Γ, Ψ are normalization and integration functions respectively. These functions are deﬁned in detail below. This computational model of saliency is built upon the principle of localization, relativity and the dynamics of visual input in the scene as covert attention occurs. As pointed out in [35], most (stable) objects in a normal environment are not intrinsically salient but can become salient if they are behaviourally signiﬁcant. The normal scene has a hierarchical structure, thus features may not always have the same salience when viewed in extended regions or larger contexts. In other words, the salient diﬀerence among objects or features may change over time, or as background or the context of the scene changes. The saliency computation is a complex and diﬃcult problem. Until now few research studies in the ﬁeld of attention in machine vision have dealt with it (however, see [46,47,79] for some discussion related to spatial saliency map). From our point of view, visual saliency arises from the competition between diﬀerent groupings and between a grouping and its surroundings. 11

For simplicity in formulas, all computations below are deﬁned for a given current time and resolution scale. The salience computation at other times and spatial scales is similar because the salience of a grouping is decided only by the current constitution of the grouping and its surroundings. Thus the changing of salience over time (salience dynamics) of a grouping depends upon the varying of the grouping’s current constitution and surroundings over time. That is, the same computation rules are used for any time and scale when the segmentation of groupings at that time and scale is given. In this way, the full details of the computable approach are given below. 2.4.1 Colour and intensity salience Assume x, y are two arbitrary pixels in a grouping on level λ of pyramids of colours, intensity, and orientations. Then, the properties of x and y can be denoted by a tensor composed of a 4-dimension colour vector, a 1-dimension achromatic intensity vector, and a 4-dimension orientation vector. For example, pixel x = ({Rx,λ, , Gx,λ, , Bx,λ, , Yx,λ, }, {Ix,λ, }, {Ox,λ, (θ1 ), Ox,λ, (θ2 ), Ox,λ, (θ3 ), Ox,λ, (θ4 )}). In the following section we suppose all calculations are within a given group on a given pyramid level, so the subscripts and λ will be generally omitted. We ﬁrst compute the property contrast between pixels x and y . Let RG and BY be the two colour “double-opponent channels” of red-green/green-red and blue-yellow/yellowblue [27,48], so we have: RG(x, y ) = |(Rx − Gx ) − (Ry − Gy )|/2 BY (x, y ) = |(Bx − Yx ) − (By − Yy )|/2 The colour chromatic contrast ∆C between x and y is calculated as: ∆C (x, y ) =
2 2 ηRG RG2 (x, y ) + ηBY BY 2 (x, y )

(8)

(9)

where ηRG and ηBY are the weighting parameters. In this paper, we set them as: ηRG = Rx + R y + G x + G y Rx + R y + G x + G y + B x + B y + Y x + Y y
2 2 2 Bx + By + Yx2 + Yy2

(10) 3 × 255 where the 255 parameter is used here because of the representations of colour and intensity in this paper have the maximum value 255. The weights ηRG and ηBY can be optimized further according to more colour discrimination experiments or references in the colour research literature. The results produced by setting η RG and ηBY as those in formulae (10) are very close to L*u*v* (see [63,74] for related issues). We obtain equal maximal contrasts between opponent colours such as red and green, blue and yellow, or white and black. The contrasts between other colours are also reasonable. For example, it is acceptable that the colour contrast between yellow and black is greater than yellow and white, etc. (see [50,63,95] for more ηBY = 12

discussion). All values of colour-intensity contrasts between x and y fall into the range [0 . . . 255]. The intensity contrast between the two pixels x and y is: ∆I (x, y ) = |I (x) − I (y )| (11)

So, the formula for calculating salience SCI (x, y ) of colour-intensity between x and y is: SCI (x, y ) = α∆C (x, y )2 + β ∆I (x, y )2 (12)

where α and β are weighting coeﬃcients and we here set them to 1. Suppose dgauss is the Gaussian distance function between x and y . The Gaussian distance is deﬁned as: dgauss (x, y ) = (1 − x − y − 12 )e 2σ n−1 ˆ
x−y
2

(13)

with the scale σ and distance ||x−y ||. In the experiments in this paper, the Gaussian scale σ is set to n/ρ where n is the maximum of the width and length of the feature ˆ ˆ maps on the current pyramid level λ. ρ is a positive integer and generally 1/ρ may be set to a percentage of n, such as 2%, 4%, 5%, or 20%, 25%, 50%, etc. The greater ˆ ρ is, the smaller the radius between the neighborhood and its surrounding center is. In this way, the Gaussian distance guarantees competition throughout the attention window but the strength varies with distance. This function produces strong local competition between short-range neighbours and weak competition between longrange neighbours. Such similar eﬀects of attention competition have been found in visual cortex [16]. Research on cortico-cortical connections shows that inhibition from the surround of the same stimulus properties as the center is strongest [80]. The distance ||x − y || can be the Euclidean distance but we prefer a chessboard distance: ||x − y || = M AX (|i − h|, |j − k |), (i, j ), (h, k ) are the coordinates of x, y on the current pyramid level. M AX denotes the maximizing operator. The reason for selecting the chessboard distance is that with the aid of this operator, the neighbours within the same, 8-adjacency neighbourhood have equal distance eﬀects on their common center and the “center-surround” function can be easily simulated. Let N HCI be the neighbourhood surrounding x, yi ⊂ N HCI (i = 1 . . . n × m − 1) be a neighbour. We use the following formula to calculate the colour-intensity salience of x:
n×m−1

SCI (x) =

i=1

SCI (x, yi ) · dgauss (x, yi ) (14) dgauss (x, yi )
i=1

n×m−1

13

2.4.2 Orientation salience We deﬁne θ x,y as the orientation diﬀerence between pixels x and y . Let ux (θ) and uy (φ) be the orientation vectors of x and y in the current orientation pyramid respectively. Note that u, θ, and φ themselves all consist of multiple components. π For example, ux (θ) = [ux (0), ux ( π ), ux ( π ), ux ( 34 )], if we have four preferred orien4 2 tations. We deﬁne the orientation salience CO (x, y ) of x to y as: CO (x, y ) = dgauss (x, y ) sin(θ x,y ) (15)

where dgauss has already been deﬁned in equation (13). A major reason that we select a sinusoid function for orientation contrast is that this function is a nonlinear and monotonically increasing function from 0 to 1 over the range [0, π ] and sym2 metric in [0, π ]. Nothdurft has suggested that the salience of pop-out targets has a nonlinear (enhanced) character from threshold and saturation eﬀects with increasing orientation contrast from 0 to π [66]. If ux and uy have orientation strengths 2 at all orientations, then the general calculation for θ x,y can be given by:
π

φ

θx,y =

0

ux (θ)uy ((θ + φ) 0 π 0 ux (θ )uy ((θ + φ)

π

mod π )dθ dφ mod π )dθdφ (16)

For practical computation in this paper, we give the following discrete form for θx,y :
ζ −1

θx,y =

j =0 i=0 ζ −1 ζ −1 j =0 i=0

jϕ

ζ −1

ux (iϕ)uy ((iϕ + jϕ) mod π ) (17)

ux (iϕ)uy ((iϕ + jϕ) mod π )

where mod is the standard modulus operator, ζ is the number of orientation pyramids or preferred orientations, ϕ = π/ζ . When ζ is 4 or 8, ϕ is π/4 or π/8. The salience computation for orientation is more complicated than for colourintensity. It is most important to take into account the homogeneity/heterogeneity of the neighbourhood of each point which is currently taken as a center for centersurround calculation. Psychophysical ﬁndings show that “pop-out” is closely related to the distribution of orientations in the local neighbourhood [57,69,85,87,94]. Aiming at a practical computation of orientation salience, further considerations of “center-surround” operations are provided as follows. Let yi (i = 1 . . . nk , nk is the number of neighbours in the k -th neighbourhood) be a neighbour in the distance k or k -th neighbourhood N HO (k ) surrounding x. It is clear that the distance 1 or ﬁrst neighbourhood of x has 8 closest neighbours surrounding x, and that the distance k neighbourhood has 8k neighbours. A boundary check must be applied to ensure all data comes from within the current image layer. Then the average orientation contrast of x to its k -th neighbourhood 14

is: C O (x, N HO (k )) = 1 nk CO (x, yi )
yi∈N HO (k)

(18)

Suppose n0 is the number of diﬀerent directions within N HO (k ), we have ωk = n0 − 1. This is used for checking and evaluating how heterogeneous the orientations are in the neighbourhood of x. n0 can be obtained by a simple method: set n0 = 0; then n0 = n0 + 1 if the orientation on which yi has the maximum value on all orientation maps (this means the maximum sub-orientation vector of yi is on that map) is diﬀerent from the maximum sub-orientation vector of yi+1 . We use the same set of histograms above to evaluate the orientation homogeneity of the whole surround of x. Let wijk be yi ’s value on the orientation (θj ) feature maps on k -th neighbour “ring”, nr be the number of “rings” in the whole neighbourhood of x, then the method to calculate homogeneity weight ω for the whole surround is given in formulae (20). Under these considerations, we have the orientation contrast of x to its k -th neighbourhood: C O (x, N HO (k )) ˆ CO (x, N HO (k )) = ξ + ωk (19)

where ξ is a parameter used to prevent a zero denominator and usually set to 1. Let mr be the number of “rings” in a neighbourhood, and dgauss (k ) (deﬁned in equation (13)) be the Gaussian distance of the k -th neighbourhood to x. Because of the chessboard distance, dgauss (k ) is the same for each point within x’s k -th neighbourhood. Finally, the orientation salience of x to all of its neighbours is: ˆ CO (x, N HO (k )) · dgauss (k ) (ξ + ω ) · m r ·
k

C O (x ) =

k

dgauss (k )

where mr =
k

ˆ 1 and |CO (x, N HO (k ))| > 0;

ω is given by: ω=
j

H (θj ); H (θ) = H (θj ) =

  

H k (θ j ) − H (θ j )
k

M AX Hk (θj ), H (θj )  1 nr
k

 

;

H k (θ j ) =
yi ∈N HO (k)

wijk (θj , yi ); θj ∈ [θ1 . . . θζ ]; H (θ) =

H k (θ )

(20)

15

2.4.3 The salience of a grouping Suppose xi is an arbitrary component within a grouping . Here, xi may be either a point or a sub-grouping within . Then the visual salience S of a grouping is obtained from the following formula: S ( ) = γCI
i

SCI (xi ) + γO
i

S O (x i )

(21)

where γCI , γO are the weighting coeﬃcients for the colour-intensity, and orientation salience contributing to the grouping salience. i SO (xi ) is computed from the primary oriented components of grouping but not from the shape of itself. The shape distribution or boundary of a grouping may be arbitrary and may conﬂict with orientations of the components in the grouping. This causes some uncertainty about how to evaluate the direction of a grouping. Here we employ a simple statistical method to deal with this problem (See [14] for other complex statistical methods involved in this ﬁeld). Suppose that xi0 , . . ., xij , . . ., xin0 ∈ are components of a given grouping with orientation components θ0 , . . ., θj , . . ., θn0 respectively. CO (xij ; θj ) is the orientation salience of xij with orientation θj , ˆ O denotes the primary orientation on which (orientation) map the grouping has the maximum sum value at the current layer of the orientation pyramids. A ˆ simple method to compute O is: calculate the value sum on each θj orientation map of all components within to obtain a distribution histogram of diﬀerent oriented vectors (as the horizontal ordinates); then take the orientation which has the maximum value in the histogram. The formula for calculating SO (xi ) is then:
i

S O (x i ) =
i i

ˆ CO (xi ) when θj = O

(22)

The above formulae for the salience computation of a grouping is a practical implementation based upon the theory discussed in equation (7). As mentioned before, some other factors inﬂuencing salience are not considered at the moment, for example, the relative size factor between a grouping and its surrounding groupings. When the size of a target is diﬀerent from the surrounding distractors but shares all other properties with these distractors, the target will “pop-out”. The current computation method is inapplicable in this special case. This factor looks like very simple and seems easy to implement but it is not in practice. There are a lot of problems associated with it and some are diﬃcult to resolve. One problem is how to evaluate the homogeneity of the target’s surround, especially to surrounding objects or regions. The homogeneity of a surround is aﬀected by many factors such as shape, orientation, or colour. The shape of an object or a region may be arbitrary, so the “pop-out” by the relative size factor would depend upon the shape factor as well even if excluding other factors such as how to quantify the relationship between salience and the relative size. Another problem is how to evaluate the degree of homogeneity and heterogeneity of the surround of a grouping, especially under the consideration of orientation. The method (formulae (18), (19), and (20)) used in this paper is simple and may 16

a

b

C

d

Fig. 4. An example for salience varying with relative size between the center target and surrounding distractors

work under many homogeneous or heterogeneous environments. For example, the homogeneity surround: all neighbours with the same orientation should be diﬀerent from another homogeneity surround: some diﬀerent neighbour rings have (some) diﬀerent orientations but on each ring the neighbours have the same orientation. But this method is not complete especially when the surround consists of arbitrary objects. As mentioned above, an object has a shape and the shape may be arbitrary. Even if ignoring other factors such as colours, how to calculate an object orientation is not easy and this directly aﬀects the homogeneity of a surround. The diﬃculty is that there is no reference which can be used to evaluate an exact order of the diﬀerent homogeneity distributions of orientations. Solutions for the above problems need more evidence from other research ﬁelds such as psychophysics and neuroscience. Figure 4 shows an example about the relative size factor. In Figures 4 (a) and (b), the red target “pops-out” in (b) when it becomes smaller. But in Figures 4 (c) and (d), which green target is more salient? Although (c) and (d) are the same to (a) and (b) except the target’s colour, it may be that the target in (c) is more salient than the target in (d). 2.5 Competition pool of attention In this module, diﬀerent groupings are dynamically formed on diﬀerent layers of pyramids and compete for attention selection from the coarsest level to the ﬁnest level by visual saliency interacting with top-down attentional biasing. The output is the dominant signal of the competitive winner(s) which is used to control the preferential processing or selectivities of visual attention. According to [15,16,24], the competition for visual attention can occur at multiple processing levels from low-level feature detection and representation to high-level object recognition in multiple neural systems. Also, “attention is an emergent property of many neural mechanisms working to resolve competition for visual processing and control of behaviour“ [15]. The above studies provide the direct support for the integrated competition for visual attention by binding object-selection, feature-selection and space-selection. The grouping-based saliency computation and hierarchical selectivity process proposed here, therefore, is a possible approach for achieving this purpose. Hierarchical selectivity operates on bottom-up visual salience from various groupings on each pyramid layer in the space-time context and top-down attentional setting. The outline of top-down attentional setting logic is shown in 17

Figure 5. It is implemented as a control set of four attentional states for the current bottom-up visual input at any competitive moment: (1) Positive priming by which consistent bottom-up input will gain a competitive advantage; (2) Negative priming which is contrary to positive priming; (3) Aimless or free state in which visual attention presents a neutral state to any visual input and thus the competition for attention is completely decided by bottom-up visual saliency; (4) Unavailability state in which visual attention is occupied at the moment. It means no visual attention is available. As pointed out in [34,51], top-down priming and bottom-up visual saliency both play important biasing roles in attention capture. Top-down biasing signals aﬀect the competition for selective attention by increasing or decreasing the baseline of neural activity. Until suﬃcient psychophysical ﬁndings are found to show how topdown inﬂuence directly ampliﬁes or reduces the intrinsic salience of targets, it is feasible to take the top-down setting into the threshold of attentional competition as proposed below. If employing a competitive neural network such as a WTA (winner-take-all) network, a top-down setting could be implemented by installing a dynamic threshold for neuron ﬁring but the overall computational cost for dynamic attention competition is expensive. A complex structure with an enormous number of neurons with population competition is needed. The solution presented here is to implement attentional setting via a threshold at the decision-points in the hierarchical selectivity process. Top-down attention setting here plays two roles: one is top-down biasing for globally and locally attentional competition; another is an intention request of whether to “view details” of a grouping (e.g. its sub-groupings) when attention is deployed at a grouping. However, top-down priming for special objects or groupings is very complicated since intricate object recognition from higher level processing is at least required. At present, top-down biasing here aims to act only on the level of basic features which here are the colour, intensity, and orientation feature pyramids. The top-down signals (Table 1) include two ﬂags for colour (which also includes intensity) and orientation top-down biasing and one ﬂag for “view details”. Each ﬂag encodes the states of its correspondingly top-down signal. For colour and orientation ﬂags, “00” is the default case in which all groupings compete for visual
Top-Down Attention Setting

Expectance Case

Default Case

Positive Priming

Negative Priming

Aimless

Unavailability

Fig. 5. Top-down attention setting

18

colour ﬂag

colour input

orientation ﬂag

orientation input

“view details” ﬂag

Table 1 Top-down attention setting to the basic features

attention in the pure bottom-up way; “01” encodes positive priming in which all groupings with the positively primed feature preferentially compete for attention and at the same time other competitors are suppressed; “10” encodes negative priming which is the inverse to positive priming; “11” is the unavailable state in which all groupings having these features are prevented from attracting visual attention. For the “view details” ﬂag, “0” signals “continue” to explore details of a grouping (i.e. its sub-groupings if they exist at the current resolution or the ﬁner resolution) and “1” means “shift” attention from the current winner to the next potential winning grouping. The next winner will be generated from the unattended groupings at the same resolution as the current winner if these groupings exist, otherwise from the unattended groupings that lie at the same coarser resolution as the parent grouping of the current winner (see hierarchical selectivity below). This process links from the “lineal chain” to the “collateral chain”. Hierarchical selectivity operates on the interaction between grouping salience and the top-down attentional setting at any competitive moment. The competition for visual attention occurs ﬁrst among the coarsest groupings (existing at the coarsest resolution) by global competition. Through a WTA (Winner Take All) mechanism, visual attention is ﬁrstly deployed to the winning competitor. Then, a top-down or goal-driven (request) control of whether “continuing” to view the details within the current grouping or “shifting” attention out of this grouping takes place. If switching attention, the next winning competitor gains visual attention with the aid of an “inhibition of return” mechanism which prohibits attention from instantly returning to a previously attended winner. The priority order for generating the next potential winner is: (1) The most salient unattended grouping that is a sibling of the current attended grouping. This winning grouping has the same parent as the current attended grouping and both lie at the same resolution. (2) The most salient unattended grouping that is a sibling of the parent of the current attended grouping, if the above winner can not be obtained. (3) The backtracking continues if the above is not satisﬁed. Temporary inhibitions to the attended groupings can be used to implement inhibition of return. More elaborate implementations may introduce dynamic time control into diﬀerent winners so that some previously-attended groupings can be visited again. But here we are only concerned that each winner is attended once. If continuing to check the current attended grouping, the competition for attention based on bidirectionally bottom-up and top-down interaction by local competition is triggered ﬁrstly among the sub-groupings that exist at the current resolution and then among the sub-groupings that exist at the ﬁner resolution. This indicates that the sub-groupings at the ﬁner resolution do not gain attention until their siblings at the coarser resolution are attended. By the force of WTA, the most salient sub-grouping wins visual attention. 19

Fig. 6. Diagram of hierarchical selectivity: see text for detailed explanation.

After attention has been directed to the winning grouping/sub-grouping, the same (top-down) goal-driven method is used to determine whether or not to “continue” to look into the details within this grouping/sub-grouping. If not, another attention “shift” takes place. If continuing to examine the particulars of this grouping/sub-grouping, another local competition triggers. When “continuing” to check an attended grouping/sub-grouping is requested, if there is no sub-grouping existing at the current or a ﬁner resolution, hierarchical selectivity goes back to the parent of the current attended grouping. At this moment, the same “continuing/shift” attention occurs. This “continuing/shift” recursive procedure continues until the desired goal is reached or all groupings in a scene are attended. As we mentioned before, the grouping salience computation is independent of how to segment the groupings in a scene (Section 2.4). The mechanism for hierarchical selectivity is also independent of what/how a segmentation is used at multiple resolutions or a single resolution for a scene. The choice of segmentation or grouping method is not included in these two mechanisms. Hierarchical selectivity 20

1. competition begins among the coarsest groupings at the coarsest resolution; 2. if (no unattended grouping exists at the current resolution) goto step 10; 3. unattended groupings at the current resolution are initialised to compete for attention based on their salience and top-down attentional setting; 4. check the colour-ﬂag and orientation-ﬂag and apply corresponding top-down processing to modify the active states of the groupings (details are not implemented in this paper); 5. all (modiﬁed) groupings compete for attention; 6. attention is directed to the winner (the most salient grouping) by the WTA rule; set “inhibition of return” to the current attended winner; 7. if (the desired goal is reached) goto step 12; 8. if (“view details” ﬂag=1) (i.e. don’t view details and shift the current attention) { set “inhibition” to all sub-groupings of the current attended winner; } if (the current attended winner has unattended siblings at the current resolution) { competition starts between these siblings; goto step 2 and replace the grouping(s) by these siblings; } else goto step 11; 9. if (“view details” ﬂag=0) (i.e. continue to view the details of the current attended winner) if (the current attended winner has no sub-grouping at the current resolution) goto step 10; else { competition starts between the winner’s sub-groupings at the current resolution; goto step 2 and replace the grouping(s) by the winner’s sub-groupings; } 10. if ((a ﬁner resolution exists) and (unattended groupings/sub-groupings exist at the ﬁner resolution)) {competition starts on groupings/sub-groupings at the ﬁner resolution; goto step 2;} 11. if (the current resolution is not the coarsest resolution) { go back to the parent of the current attended winner and goto step 2; } 12. stop. Table 2 The algorithmic description of hierarchical selectivity

runs on a given segmented scene and is driven by both the top-down attentional setting and the current distribution of the given segmentation and the corresponding salience. Switching attention between groupings/sub-groupings (and between the coarse and ﬁne resolutions if multiple resolutions are used) is then controlled. A diagram summarizing the recursive procedure for hierarchical selectivity is given in Figure 6. Its algorithmic description is given in Table 2. Two goals can be achieved by taking advantage of hierarchical selectivity. One is that attention shifting from a grouping to another and from groupings/subgroupings to sub-groupings/groupings can be easily carried out. Another is that the model may simulate the behaviour of humans observing something from far to near and from coarse to ﬁne. Meanwhile, it also easily operates at a single resolution level. In addition, a declaration we made here is that the top-down attentional setting in hierarchical selectivity is not completely implemented in this paper although its possible approach is given in the algorithm. Except of “colourﬂag=00”, “orientation-ﬂag=00” and “view details ﬂag”, other cases will be realized in the future. 21

Support for this approach to hierarchical selectivity has been found in recent psychophysical research on object-based visual attention. It has been shown that features or parts of a single object or grouping can gain an object-based attention advantage in comparison with those that are separated from diﬀerent objects or groupings. Also, visual attention can occur at diﬀerent levels of a structured hierarchy of objects at multiple spatial scales. At each level all elements or features coded as properties of the same part or the whole of an object are facilitated in tandem (see [4] for a review of these viewpoints and detailed ﬁndings). 2.6 Perceptual grouping It has been suggested [4] that grouping processes and perceptual organization play an integral role in object-based attention. Features that are grouped together compete against other feature groupings and obtain faster processing than features that do not belong together. Perceptual grouping is a complex combinatorial problem which involves in a lot of inﬂuence factors including top-down interference in many conditions. These factors work together to aﬀect how groupings are segmented, such as spatial proximity, similarity, common fate, shared properties, and even experience and learning [68, p.257-309]. In many cases, the rules for segmentation and interpretations of groupings are associated with visual tasks and experience. Nevertheless, study of this ﬁeld is out of the current scope of our research. The groupings used in this paper are produced by manual preprocessing based on Gestalt principles and heuristic knowledge, to provide the basis for experiments with our attentional model. The principles of grouping used are some common rules: proximity, closure, continuity, common fate, familiarity, and shared properties. A visual grouping is deﬁned as an eﬀective hierarchical structure formed by all components according to these principles. For example, objects which share a common colour or orientation and are separated from their surrounding which does not share this colour or orientation may be organized as a grouping. Objects belonging to a large group or share the same spatial location may be segmented into a multi-level structured grouping. In Figure 14, the white stripes in the road are grouped into three groupings by their familiarity. The four cars are organized as a grouping by their common fate. Two people are grouped together by their proximity. In fact, the “grouping” we are addressing here is the “perceptual units” which serve as the potential units of attention. For object-based attention, it is the “protoobjects” produced by various segmentation processes rather than the conceptual or recognizable “objects” we commonly experience in real word. “Evidence suggests that ‘object-based’ attention and ‘group-based’ attention may reﬂect the operation of the same underlying attentional circuits” [78]. One general criticism of object-based attention is the question whether objects are recognized before or after the selectivity by attention or visual segmentation processes occurs with attention/without attention, that is, also the traditional story in terms of “early selection” and “later selection” or the degree of preattentive processing in the visual systems. The issues we stress here may lead to clear this misunderstanding 22

and the detailed discussion of issues can be found in [18,78]. 3 Results and Discussion

For the evaluation of our object-based attention model, we ran a number of experiments based on synthetic and natural images. 3.1 Performance in synthetic images The goal of the experiments in this section is to compare the performance of our model with human behaviour in visual attention experiments. The experiments are designed for this purpose. Additional experiments can be found in [83]. 3.1.1 Neighbourhood inﬂuence on a grouping Many psychophysical studies of visual attention (especially on object-based attention) have suggested that visual search is greatly aﬀected by the attribute distribution and interaction between target and its surroundings (see [4,70,94] for a detailed explanation). These eﬀects are clearly observed in experiments testing similarity or shared feature dimensions between target and non-target and homogeneity or heterogeneity of non-targets themselves. When the distractors surrounding the target are more homogeneous to each other and share less features with the target, search becomes more eﬃcient or accelerated. Perceptual grouping also plays an important role, by which distractors are grouped by type so stronger grouping strength leads to easier pop-out [20,55,59]. We designed three kinds of experiments to test the model performance. The experiments probe the salience variation of the target in response to the surrounding changing without top-down attentional priming. It is also not necessary to calculate the target’s salience on all resolution levels. For a demonstration, it is suﬃcient to compute the target’s salience on the coarsest resolution and set the top-down attentional setting to the free-state by default. (This kind of consideration runs through the following synthetic experiments by default.) Experiment 1: The scaling eﬀect of uniform neighbours The experimental method is that the target is located at one place and kept ﬁxed. Then we add more and more homogeneous neighbours which have at least one feature diﬀerent to the target. The goal of this test is to prove that when the number of such homogeneous neighbours increases (i.e., the facilitated strength of the neighbours is stronger), the target’s salience increases so that the target’s pop-out becomes easier. We produced two series of sub-experiments to examine the model performance. In experiment A and B, the created images are all of 256× 256 and the target is always a red bar located at the center of the displays. Green horizontal bars are gradually added in the neighbourhood of the target and kept homogeneous. Compared with experiment A, the target in experiment B is vertical. So, the target is diﬀerent from its neighbours by only one feature of colour 23

Fig. 7. The performance of the model in experiments varying the scaling eﬀect of uniform neighbours. Left and middle columns: two of the displays used in each sub-experiment. The related results of each experiment are shown in the right graphs respectively.

in experiment A, two features of colour and orientation in experiment B. Features considered in the computation of the target’s salience are colour and orientation. Both distractors and background take part in the salience computation of the target. That is, the target’s salience is derived from the contrast not only between the target and distractors but also between the target and background. Figure 7 shows several created images and the results of the target’s salience in these two experiments. Discussion: The results from experiments A and B clearly show increasing target salience with increasing homogeneous neighbours (greater strength of neighbourhood). This is consistent with the ﬁndings from psychophysical experiments. Furthermore, the curve in experiment B ascends faster than that in experiment A (notice the diﬀerent scales of Y-axes in experiments A and B). It is suggested that uniform neighbours sharing fewer features with the target make the target more salient and hence attract more visual attention. We also did another experiment based on experiment A (not presented) in which we adjusted the target’s size. When the target became smaller its salience became smaller. But when the target became smaller and shared the same colour with the distractors, the results became unpredictable because of the relative size factor. As we already discussed in section 2.4.3, the model will fail to perform for this special case. Experiment 2: The eﬀect of coherence in the target’s neighbourhood This experiment investigates the salience of the target in an originally homogeneous surrounding by gradually changing one attribute of more and more neigh24

bours to another one (colour or orientation) but keeping them homogeneous. We produced two series of test images with size 256×256 for two sub-experiments. In the ﬁrst sub-experiment more and more items surrounding the target change colour to be the same as that of the target. The target’s salience comes from its comparison with all other circles and green background. In the second sub-experiment, the neighbour items become orthogonal to the target one by one. In this subexperiment, the salience of the target is derived from its comparison with all other red bars and black background. To remove the eﬀect of distance varying when a horizontal bar is rotated, the computation for distance factor is designed as: all red bars within the same neighbourhood have the same distance whatever their orientations are. That is, when a horizontal red bar is rotated to vertical, its distance remains the same as before. Several images and the results of these experiments are given in Figure 8. Discussion: The results shows that the target’s salience becomes weaker as more neighbours share the same colour as the target in experiment 2-A, but stronger as more neighbours turn orthogonally to the target in experiment 2-B. The reason is that in experiment 2-A the strength of grouping based on the colour green within the target’s homogeneous neighbourhood became weaker while the strength of grouping based on colour red is stronger. In experiment 2-B, although the neighbours form two types of groupings, the new continuously growing grouping did not aﬀect the neighbourhood homogeneity but enhanced the contrast to the target. In fact, both experiments have the same nature but reﬂect diﬀerent aspects of the eﬀect of the target’s neighbourhood. The result of experiments 1 and 2, as pointed out in [21,43] and other researches on object-based visual attention, have shown that stronger grouping distractors and greater diﬀerences between the target and distractors enable the target to be sought more eﬃciently. In other words, stronger contrast between the target and its neighbourhood makes the target more salient to capture visual attention in the bottom-up competition. Experiment 3: Eﬀect of the target neighbourhood heterogeneity This experiment examines the performance of the model in heterogeneous circumstances. In theory, the target should be less salient with a more disorderly distribution of the neighbourhood. The method used here is similar to the two previous experiments. The red vertical target was initially located at the center of a homogeneous surrounding in which the same colour bars are orthogonal to the target. After that, we gradually varied the neighbours’ orientations to create a series of more and more heterogeneous displays. One experiment is shown in Figure 9. All displays have added 30% random colour noise. The target’s salience is computed by colour and orientation of the target contrasting with both of the distractors and background. Although we do not give results from all the experiments, the overall experimental results are similar to that in Figure 9. Discussion: The results shown in the bottom diagram in Figure 9 indicate that the target’s salience decreases with the growing heterogeneity of its surroundings. This means that the eﬃciency of visual search becomes worse and worse. Notice that the downtrend of salience is much steeper in the ﬁrst four steps and tends 25

Fig. 8. Model performance when varying attributes of target’s neighbours in a homogeneous environment. 2-A: the target is a red circle located at the center of the display and the neighbours change to the colour of the target. 2-B: the target is the vertical red bar located at the center of the display and its neighbours change to the orientation orthogonal to the target. Left column: ﬁrst test display. Middle column: 8th test display.

Fig. 9. Model performance in an oriented heterogeneous environment. The target is the red vertical bar located at the center of the display. Its neighbours become more and more heterogeneous by gradually varying their orientations from the target and each other. Two members of the sequential displays are shown here.

26

to a mild decline afterwards. The saturated tendency eﬀect is not surprising but expected. The Gabor ﬁlter for orientation extraction used here is sensitive to four orientations of 00 , 450 , 900 , and 1350 . When the number of disorderly orientations exceeds four directions, the result is an almost saturated weight ω in equation (19) (see Section 2.4.2) because ω is limited by the maximum diﬀerent orientations (4 here). This ω is used to evaluate the orientation disorder within an object’s neighbourhood. Another observed phenomenon from the graphs in Figure 9 is that the main contributor to regularly reduce the target’s salience is the orientation disorder factor rather than feature colour. The explanation for this eﬀect is that the distractors always shared the same colour with the target and the varying position of each pixel within each distractor grouping in this experiment produced only a tiny eﬀect in the colour contrast between the target and the distractor, so the overall trend of the target’s salience is hardly aﬀected by the colour of the surrounding features. We have also examined the behaviour of the model performance with varying target intensity with a random noise background, varying target orientation from 0 to 360 degree [66], and the eccentricity of the target location [94]. The results on these experiments are compatible with the corresponding ﬁndings from the human psychophysics literature.

3.1.2 Grouping eﬀect and related hierarchical selection Figure 10 shows a display in which the target is the only vertical red bar and no one of the bars has exactly the same colour as another bar. Three bars have the same orientation and others have diﬀerent orientations. If we do not use any grouping rule, each bar may form a single grouping by itself. Then we obtain 36 single groupings. If segmenting the display by shared orientation, the only structured grouping is formed by the 3 vertical bars, which includes the red target (forms one sub-grouping) and other two vertical green bars (forms another two-level subgrouping). In this way, 34 top groupings (38 in total) can be obtained: a structured three-level grouping (contains 4 sub-groupings) and 33 single groupings formed by other distractors respectively. The resulting salience maps and attention sequences for these two segmentations are given in Figure 10. The background, colours, and orientations are considered in the salience computation. The top-down attentional setting is set to the free state, so this is pure bottom-up attention competition. The results show diﬀerent orders of paying attention to the targets. The target belonging to a grouping (see Figure 10 (C1), (C2), (C3), (C4), and (C5)) has an advantage in attracting attention much more quickly than the non-grouped. The competition starts among diﬀerent groupings in the display. The structured grouping of 3 vertical bars is the most salient compared to others and obtains attention ﬁrstly. Then the competition occurs within this grouping between the target and another sub-grouping formed by the two vertical but diﬀerent colour bars. Attention is directed to the sub-groupings according to their salience orders when we do not consider top-down attentional priming. The target is attended after the two-level sub-grouping is attended. This grouping advantage for attentional com27

Fig. 10. An example for structured groups and hierarchical selection. In the display the target is the vertical red bar at the third row and the second column. B1: salience map (in shades of grey) in the case of no grouping. B2: partial attention sequence of most salient bars for B1. C1: salience map in the case of grouping. C2, C3, C4: salience maps of the grouped bars. C5: partial attention sequence of most salient bars for C1. B, C: salience histograms for B1 and C1 respectively. Note target is attended after 7 shifts in B2 but only 3 in C5.

petition has been conﬁrmed by psychophysical research on object-based attention [4,78].

3.2 Performance in natural scenes In the previous section we examined several aspects of our attentional model by using some artiﬁcial images and successfully compared our results with related ﬁndings in psychophysical research. To investigate the model in complex natural scenes, we used colour outdoor photographs taken with a digital camera. The implementations for both of “from coarse to ﬁne” and “from far to near” human eye simulations in these real imagery are described in detail. 28

Fig. 11. An outdoor scene photographed from far and near distance respectively. The obtained images shown here are the same scene but diﬀerent resolutions. The salience maps is shown too and the grey scales indicate the diﬀerent saliences of the groupings.

3.2.1 Hierarchical Selectivity As suggested in [78], “there may be a hierarchy of units of attention, ranging from intra-object surfaces and parts to multi-object surfaces and perceptual groups”. Hierarchical selectivity is a novel mechanism designed for shifting attention from a grouping to another one or from a parent grouping to its sub-groupings as well as implementing attention focusing from far to near or from coarse to ﬁne. It can work under both multiple (or variable) resolutions and single resolution environments. Resolutions can be either scaled by pyramid decomposition scheme or by digital camera. Here an outdoor scene is used to demonstrate the behaviour of hierarchical selectivity. In ﬁgure 11, the same outdoor scene is photographed from far and near distances respectively so that two coarse (64×64) and ﬁne (512×512) resolution photographs are obtained. In the scene, there are two groupings: a simple shack in the hill and a small boat including ﬁve people and a red box within this boat on a lake. The people, red box, and the boat itself constitute seven sub-groupings respectively for this structured grouping. The 1/ρ parameter for Gaussian distance is set to 25% and the Gabor ﬁlter is sensitive to 4 orientations [00 , 450 , 900 , 1350 ]. The model works with these two images, using the coarse and ﬁne images as diﬀerent resolution levels. For this purpose, only feature (colour, intensity, and orientation) maps at the lowest level of the pyramids are created for each image. (Multi-level pyramids used to simulate attending a complicated natural photograph from far to near and coarse to ﬁne is also implemented in this paper. See the next sections for details.) Competition for attention starts in the coarse or far image (Figure 12). Using hierarchical selectivity, attention is ﬁrstly deployed to the winner (here the shack) and suppresses other competitors. Then attention shifts to the ﬁne image for further checking this winner if answering “yes” to the “view details” ﬂag. 29

Fig. 12. The attention movements implemented for the outdoor scene: blue arrows indicate attentional movements between resolutions and red arrows denote attention shifts at ﬁne resolution.

If the answer is “no”, the model will check if there is(are) any other grouping(s) existing at this image. When attention is shifting, an “inhibition of return” is set for this attended grouping. Because the shack has no sub-groupings, attention switches again to the coarse image and checks if there exists any next winner. Thus the boat grouping obtains attention. In the same way as attending the shack, answering “yes” to the “view details” ﬂag attention shifts to its sub-groupings in the ﬁne image. At this moment the competition for attention triggers among the seven sub-groupings. Attention is deployed to these sub-groupings by hierarchical selectivity. The salience maps computed for these groupings are shown in Figure 11 and the sequence of attention deployments is shown in Figure 12. The attention deployment trajectory shown in Figure 12 reveals reasonable movements for this natural scene. 3.2.2 Hierarchical Selectivity From Coarse To Fine The image presented in Figure 13 has 512×512 pixels and contains many structured objects and groupings. The pyramids in the model used here have three layers, ranging from resolutions 128×128 to 512×512. The Gabor ﬁlter was set to be sensitive to 4 orientations [00 , 450 , 900 , 1350 ]. The 1/ρ parameter for Gaussian distance was set to 50%. The model ﬁrstly extracted colours, intensity, and orientations from the photo and constructed altogether 9 three-layer pyramids: one intensity pyramid (Figure 15), four colour pyramids (Figure 16), and four orien30

Fig. 13. An outdoor photograph.

tation pyramids (Figure 17). Eleven meaningful groupings of objects were created manually by preprocessing according to Gestalt grouping rules (see section 2.6). Figure 14 shows the identiﬁers of the diﬀerent groupings in this image. The numerals pointed to by each white arrow denotes the identiﬁer of each grouping at multiple resolutions. The groupings which have the same preﬁx identiﬁer belong to the same parent grouping. The depth of each grouping is the index of its array mark. For example, identiﬁer 1-1 indicates that this is the ﬁrst sub-grouping of grouping No. 1. Identiﬁer 1-1-2 denotes it is the second sub-grouping of grouping No. 1-1. Groupings No. 1-1-1 and No. 1-1-2 have the same parent grouping No. 1-1. The black circles or ellipses are used to conveniently distinguish diﬀerent groupings (object(s) in the circles) and not the grouping boundaries. When viewing these groupings at diﬀerent resolutions, some groupings/sub-groupings will disappear at the lower resolution. The hierarchy of groupings is shown in Figure 14 and Figure 26 which is discussed later. The top-down attention setting was always set to the free state in this test. The decision-points during hierarchical selectivity to drive whether or not viewing the 31

Fig. 14. The identiﬁers of groupings in the given photograph.

Intensity Pyramid

Fig. 15. The intensity pyramid built from the photograph given in Figure 13.

details within a grouping were always answered “Yes”. Although this may make hierarchical selection look like an exhaustive exploration, the general performance of the model can be inspected in detail and completely (see the next section for an alternative implementation in this view). As we discussed in section 2.5, the control for recognizing which object is signiﬁcant is very intricate and needs higher visual processing related to the current visual tasks (also see the following discussion about the small white stripes in this scene). Future work will reﬁne this 32

Colour Pyramids
Red Pyramid Green Pyramid Blue Pyramid Yellow Pyramid

Fig. 16. The four colour pyramids built from the photograph given in Figure 13. (The graphs are black-white inverted to improve visibility.)
Orientation Pyramids
0 o Orientation Pyramid 45 o Orientation Pyramid 90 o Orientation Pyramid 135 o Orientation Pyramid

Fig. 17. The four orientation pyramids built from the photograph given in Figure 13. (Graphs in the second and third rows are black-white inverted to improve visibility.)

complicated control. In the more normal scenes, top-down priming proposed for the “view details” ﬂag will control choice to produce more interesting behaviour. Here, the competition for visual attention was ﬁrstly triggered at the coarsest resolution, namely the highest layer of the pyramids. During the attentional movements, shifting into the higher resolution (lower layers of pyramids) or switching to the lower resolution (higher layers of pyramids) dynamically changed depending 33

Fig. 18. Salience of the attending grouping and competing groupings, as well as the sequence of attentional movements. The red, green, and blue arrows denotes that attention is at or switched to the coarsest resolution, middle resolution, and ﬁnest resolution respectively. The small red panel at the top left corner in each slide shows a zoomed view of the objects. The red circle/semi-circle indicates the focus of attention. The grouping identiﬁers are also given in each panel.

34

Fig. 19. Continued slides of Figure 18

35

Fig. 20. Continued slides of Figure 19

on the natural structure of the current grouping being attended and its surroundings. When a structured parent grouping is attended at high resolution, some/all of its sub-groupings will be attended next at this current resolution if these subgroupings appear at the same resolution, or at the lower resolution if some/all of its sub-groupings do not appear at the current resolution. In this procedure, some sub-groupings within a parent grouping, such as some small white stripes in the road, may have not much signiﬁcance and may not necessarily be attended. This further top-down control for shifting attention will need additional theory to incorporate the measure of object similarity, subject’s experience and the current visual 36

A

B C

Fig. 21. Applying the model for space-based attention. Each pixel is an individual grouping. Only the raw salience maps of pixels at three resolutions are shown here in shades of grey.

task, etc. and is not implemented here. The results of the model performing on all resolution levels are shown in Figures 18, 19, and 20. At each attentional deployment, we show the entire or unitary salience of the grouping which is currently being attended. When the related groupings are ready to compete for visual attention we present the degrees of their individual salience (in shades of grey) in comparison with all other competitors. The brighter a grouping is, the more salient it is. It is worth noting that no mosaic appearance is seen in the results because the model theory is based on object attention in which a grouping competes for attention using its entire salience integrating the strength of all its components. Thus, the salience shown is the grouping salience rather than that of each pixel within the grouping. However, our grouping-based computation approach can also be applied for spatial attention if we consider each pixel as a grouping. Figure 21 gives the salience maps obtained from the same outdoor scene for individual pixels at the coarsest resolution (graph C), middle resolution (graph B), and ﬁnest resolution (graph A). The 1/ρ parameter for Gaussian distance for this experiment is set to 2%. According to the obtained results, the order of attention shifts is shown in Figure 22. We can see, the attention movements basically coincide with the salience diﬀerence between the objects in the scene. Some groupings, such as grouping 6, which consist of several very small sub-groupings, do not exist at the coarser resolution. They either have no way to take part in the competition, or lose much support from their smaller members or components or from their surroundings which may be useful to compete for attention at the ﬁner resolutions. So generally, they lost 37

Fig. 22. The overall trajectory of attentional movements of the model in multiresolution. Red arrows show attentional shifts from one grouping to another. Yellow and purple arrows show attention switches within groupings. The circles denotes the locus of attention.

some possible advantages when at the ﬁner resolutions. 3.2.3 Hierarchical Selectivity From Far to Near Three colour images shown in Figure 23 are taken using diﬀerent resolutions from far to near distance (64×64, 128×128, and 512×512) for the same outdoor scene. The scene is segmented (by hand) into 6 top groupings (identiﬁed by the black colour numbers: one object grouping 6 and ﬁve regions here) and 5 of them are hierarchically structured except grouping 4. In the coarsest image, only grouping 6 (one boat including two people) can be seen. In the ﬁner image, sub-groupings 5-1 and 5-3 within top grouping 5 appear but they lose details at this resolution. The smallest boat (i.e. sub-grouping 5-2 of grouping 5) can only be seen at the ﬁnest resolution. The salience maps of groupings during attention competition are also brieﬂy shown in Figure 23 where darker grey shades denote lower salience. The competition ﬁrst occurs among the top groupings at the coarsest scene. The most salient grouping 6 therefore gains attention. When giving a “yes” to the top-down attention setting (“view details” ﬂag), attention will shift to the subgroupings of 6. Two people and the boat then begin to compete for attention. If 38

Fig. 23. An outdoor scene taken from diﬀerent distances. The dotted circles are used to identify groupings but not their boundaries. The sequence of salience maps used for each selection of the next attended grouping is shown at the middle. Attention movements driven by hierarchical selectivity is shown at the bottom using a tree-like structure.

39

Fig. 24. The attention movements implemented for the outdoor scene: blue and red arrows indicate attention shifts between and at the same resolutions respectively. Arrows with red solid circles denote attention is attending the top groupings.

40

a “no” is given or after grouping 6 is attended, attention will shift to the next winner grouping 2. If a “yes” is given too to the “view details” ﬂag of grouping 2, attention will ﬁrst select sub-grouping 2-1 and then shift to sub-grouping 2-2. After attending 2-2, if continuing to view the remainder of grouping 2, attention will shift to the ﬁner resolution to visit 2-3. When grouping 5 is attended, the lake (excluding grouping 6) is visited ﬁrst and then attention shifts to the ﬁner resolution scene where 5-1 and 5-3 start to compete for attention. In the case of giving a “yes” to the top-down ﬂag of the winner 5-3, attention will shift to the ﬁnest resolution scene to check its details. Then attention goes back to the previous ﬁner resolution scene and shifts to 5-1. After that, attention shifts again to the ﬁnest resolution scene. Thus the smallest boat 5-2 at the ﬁnest resolution is attended. Figure 23 shows the overall behaviour of the model performed on the scene. Using this same scene, when stronger and stronger noise was added above σ = 17 for Gaussian noise, the order of the attention movements changed. The above results clearly show hierarchical attention selectivity and appropriated believable performance in a complicated natural scene. In addition, although this model is aimed at computer vision applications, the results are very similar to what we might expect for human observers. The attention movements shown in Figure 24 reveal the reasonable shifts of visual attention for this natural scene. 3.3 Improved behaviour of hierarchical selectivity in natural scenes We have shown the model performance in the complex natural scene. For a complete examination, we gave a positive response to each “view details” ﬂag. However, some small stripes (on the road) may be irrelevant to the current visual task and are thus unnecessary to attend in turn. Also some tiny unreadable characters are probably not worth notice by the observer. One possible way to improve the performance on these targets is to incorporate a top-down recognition component or learning process to produce a control function with reasonable salience thresholds according to diﬀerent environments and visual tasks. Our current model does not yet implement this complicated top-down control. Instead, we propose an alternative demonstration of our model’s abilities by using a simple human-computer interaction to give a positive or negative response to the “view details” top-down attentional setting (see section 2.5 for more details). Figure 25 shows a logical diagram of attentional movements in hierarchical selectivity working on a hierarchical scene containing three structured groupings. In this diagram, groupings A, B, and C have a decreasing salience order and the left sub-groupings have greater salience than their right siblings. That is, the saliences of A1, A111, B1, and C1 are greater than that of A2, A112, B2, and C2 respectively. Suppose that attention is currently deployed at grouping A111 and a negative answer is given to the check ﬂag of the top-down attentional setting “view details”. Then there are multiple (here four) possible destinations of the next attention movement, shifting to A112, A12, A2, or B (as shown in the diagram). In our previous strategy, the most salient sibling of A111 (i.e. A112) would win the next attention if a positive answer is checked from the “view details” ﬂag of A11. This 41

Y: view details N: not view details Y

Attention

A
N

N

B B1
Coarsest Resolution

Y

A1
N

A11
Y

A12 A121

A2

B2 C1

C
Finer Resolution

N

A111 1 A1111

A112 2 3

A21 4

B11

C2
Finest Resolution

Fig. 25. Diagram of attentional movements in hierarchical selectivity operating on multi-level structured groupings. Red arrows: attentional movements. Blue arrows: feed-back checker of “view details” ﬂag. Green arrows: possible winners competing for the next attention.

strategy has advantages of simplicity and following the closest previous top-down setting to the higher level grouping (the parent A11 of A111). Here we present an improved strategy for such hierarchical attention shifts. Suppose S (X ) represents the salience of any grouping X . Assume A and B are the most salient of the competitive groupings and S (A) > S (B ). Grouping A (or B) has a multi-level hierarchial structure. Then a tree-like data structure can be used to illustrate these structured groupings. Let the salience of the sub-groupings that have the same closest parent be decreasing from the left to the right. Let Ai1 ,i2 ,...,ij be the current attended sub-grouping at the level j of A. When i = 0 or j = 0, Ai1 ,i2 ,...,ij = A. Thus the ﬁrst level sub-groupings of A are . . . Ai1 Ai1 +1 . . ., the ﬁrst level sub-groupings of Ai1 are . . . Ai1 ,i2 Ai1 ,i2 +1 . . ., and the rest is deduced by this analogy. Clearly, all sub-groupings left of Ai1 ,i2 ,...,ij have already been attended or ignored. Ai1 ,i2 ,...,ij +1 is the most salient unattended sibling of the current attended grouping and Ai1 ,i2 ,...,ij−1 +1 is the most salient unattended sibling of its parent. When attending Ai1 ,i2 ,...,ij , if a negative answer is given to the “view details” ﬂag of top-down attentional setting or this sub-grouping has no child, the next potential winner to gain attention is produced by the following rules: (1) if Ai1 ,i2 ,...,ij +1 = A then attention shifts to grouping B; (2) otherwise attention shifts to the sub-grouping X with salience: S (X ) = M AX {S (Ai1 +1 ), S (Ai1 ,i2 +1 ), . . . , S (Ai1 ,i2 ,...,ij−1 +1 ), S (Ai1 ,i2 ,...,ij +1 )} (23) We applied this improved hierarchical selectivity to the natural scene shown in Figure 13. Here the entire scene is re-segmented into seven top groupings, as shown in Figure 26 (Graph B) by diﬀerent colour lines. The identiﬁers of diﬀerent groupings and their sub-groupings are also given in Graph B. Certain sub-groupings which are segmented within each top grouping are identiﬁed and the remainder (such as green grass in grouping 7 or trees in grouping 3) are denoted “others” in Graph A of Figure 26. The “view details” ﬂags of the parent groupings of the small white stripes in the road, trees in the lawns, and some tiny words (and symbols) below the “30” speed limit sign were answered “0” (positive) for the ﬁrst attending (the ﬁrst stripe, word or symbol) and “1” (negative) thereafter. Thus most sub-sub42

Fig. 26. The overall attentional movements on the natural scene produced by the improved strategy for hierarchical selectivity. Red arrows with a hollow circle indicate that attention goes to a top grouping and then shifts to the sub-groupings respectively. The dotted ellipses are not the sub-grouping boundaries and only used to conveniently show attention movements.

43

groupings such as those within sub-groupings 6-1, 6-2, and 6-3 of top grouping 6 are also abbreviated as “others” in Graph A, except several ﬁrst attended sub-subgroupings (for example, grouping 6-1-1). The 1/ρ parameter of Gaussian distance is set to 25% for the global competition between the seven regions and 4% for the local competition within these regions. Through the improved hierarchical selectivity, more natural attentional movements are clearly seen (Graph C in Figure 26. Note here attention is assumed to shift to the center of mass of the attended grouping). The complete hierarchical selectivity procedure for this scene is shown in Graph A in which the representations have the same meanings as those in Figure 25.

4

Conclusion

The mechanisms of object-based and space-based visual attention have been widely investigated in psychophysics and neuroscience research, however, modelling visual attention in computer vision is a quickly growing ﬁeld, especially for building computable models of covert attention. Until now, to our knowledge, although some computable models for space-based covert attention such as Koch and Itti’s saliency-based attention model [54,46] have been successfully built, no computational model for object-based attention has been developed. We have presented a computable model of hierarchical object-based attention for computer vision. It suggests that object-based and space-based attention can be integrated by using grouping-based salience to deal with dynamic visual tasks. By using the integrated competition of proto-objects based on groupings, the selectivity of attention by objects, locations and features can cooperatively work together. We demonstrated the behaviour of the model on a number of synthetic and real images. The experimental results showed that its performance concurs with the main ﬁndings in the psychophysical literature on object-based or space-based visual attention. Also, the model shows a good performance of selectivity by objects, by features, by spatial regions, and by their groupings on complex natural scenes. Such successful performances depend on three factors that we proposed here: • grouping-based saliency evaluation • integrated competition between groupings • hierarchical selectivity With the grouping-based saliency mechanism, the pop-out of objects and their groupings can be evaluated in a uniform computational framework. By using hierarchical selectivity to drive attentional movements, the multiple selectivities of objects, features, regions, and their groupings in multiscale resolutions can be performed in an integrated selection architecture. To our knowledge, the model proposed in this paper is the ﬁrst implemented model of object-based visual attention and of integrated object-based visual attention with space-based visual attention in computer vision. However, there are still several limitations to the current model besides the above strengths. One limitation is that we have not yet built a satisfactory method to deal 44

with the grouping processing. This is a great challenge not only for visual attention but also for computer vision. Another limitation is that we did not present here a complete theory of goal-driven eﬀects on visual attention, which is necessary for understanding visual attention. Lastly, if we use a resolution-varying or retinalike operator at each attention movement, the model will simulate the attention behaviour of human eyes better, because human eyes have decreasing resolution from the fovea to the periphery of the retina. We are currently investigating these points. ACKNOWLEDGMENTS The authors are grateful to Dr Fang Wang, Dr Laurent Itti, Dr Petko Faber, Dr Neil McCormick, and Dr Craig Robertson for their constructive suggestions. References
[1] I. Ahrns, and H. Neumann, “Space-variant dynamic neural ﬁelds for visual attention,” Proc. IEEE Computer Vision and Pattern Recognition, Fort Collins, CO., pp. 313-318, 1999. [2] S. Baluja, and D. Pomerleau, “Dynamic relevance: Vision-based focus of attention using artiﬁcial neural networks,” Artiﬁcial Intelligence, 97, pp. 381-395, 1997. [3] S. Baluja, and D. Pomerleau, “Expectation-based selective attention for visual monitoring and control of a robot vehicle,” Robotics and Autonomous Systems, 22, pp. 329-344, 1997. [4] M. Behrmann, R. S. Zemel, and M. C. Mozer, “Occlusion, symmetry, and objectbased attention: reply to Saiki (2000),” Journal of Experimental Psychology: Human Perception and Performance, 26(4), pp. 1497-1505, 2000. [5] C. Bundesen, “Cognitive psychology,” New York: Appleton, Century Crofts, In A. Kramer, G. H. Cole and G. D. Logan (Eds), Converging Operations in the Study of Visual Selective Attention, pp. 1-44, Washington, DC: American Psychologuea Association, 1996. [6] C. Bundesen, “A computational theory of visual attention,” Phil. Trans. R. Soc. Lond. B, 353, pp. 1271-1281, 1998. [7] P. Burt, “Attention mechanisms for vision in a dynamic world,” In: Proceedings Ninth International Conference on Pattern Recognition, Beijing, China, pp. 977-987, 1988. [8] G. Carpenter, S. Grossberg, and G. Lesher, “The representation of visual salience in monkey parietal cortex,” Nature, 391, pp. 481-484, 1998. [9] J. J. Clark, and N. Ferrier, “Modal control of an attention vision system,” Proc. IEEE Inter. Conf. Computer Vision, Tarpon Springs, FL., pp. 514-523, 1988. [10] J. J. Clark, “Spatial attention and latencies of saccadic eye movements,” Vision Research, 39(3), pp. 583-600, 1998.

45

[11] V. Concepcion, and H. Wechesler, “Detection and localization of objects in timevarying imagery using attention, representation and memory pyramids,” Pattern Recognition, 29(9), pp. 1543-1557, 1996. [12] W. Cowan, “Evolving conceptions of memory storage, selective attention and their mutual constraints within the human information-processing system,” Psychol. Bull., 104, pp. 163-191, 1988. [13] F. Crick, and C. Koch, “Towards a neurobiological theory of consciousness,” Seminars in the Neurosciences, 2, pp. 263-275, 1990. [14] http://www.dai.ed.ac.uk/CVonline. [15] R. Desimone, and J. Duncan, “Neural mechanisms of selective visual attention,” Ann. Rev. Neurosci., 18, pp. 193-222, 1995. [16] R. Desimone, “Visual attention mediated by biased competition in extrastriate visual cortex,” Phil. Trans. R. Soc. Lond. B, 353, pp. 1245-1255, 1998. [17] J. Driver and G. C. Baylis, “Attention and visual object segmentation“, R. Parasuraman, Ed., The attentive brain, pp. 299-25, Cambridge, MA: MIT Press, 1998. [18] J. Driver, G. Davis, C. Russell, M. Turatto, and E. Freeman, “Segmentation, attention and phenomenal visual objects,” Cognition, 80, pp. 61-95, 2001. [19] J. Duncan, “Selective attention and the organization of visual information,” J. Exp. Psychol., 113, pp. 501-517, 1984. [20] J. Duncan, and G. W. Humphreys, “Visual search and stimulus similarity,” Psychological Review, 96, pp. 433-458, 1989. [21] J. Duncan, “Target and non-target grouping in visual search,” Perception and Psychophysics, 57(1), pp. 117-120, 1995. [22] J. Duncan, “Coordinated brain systems in selective perception and action,” In: T. Iaui and J.L. McClelland (eds.), Attention and Performance XVI, Cambridge, MA: MIT Press, pp. 549-578, 1996. [23] J. Duncan, et al. “Integrated mechanisms of selective attention,” textitCurr. Opin. Biol., 7, pp. 255-261, 1997. [24] J. Duncan, “Converging levels of analysis in the cognitive neuroscience of visual attention,” Phil. Trans. R. Soc. Lond. B., 353, pp. 1307-1317, 1998. [25] H. E. Egeth, and S. Yantis, “Visual attention: control, representation, and time course,” Annu. Rev. Psychol., 48, pp. 269-97, 1997. [26] R. Egly, et al. “Shifting visual attention between object and locations: Evidence from normal and parietal lesion subjects,” J. Exp. Psychol. Hum. Percept, 123, pp. 161-177, 1994. [27] S. Engle, X. Zhang, and B. A. Wandell, “Colour tunning in human visual cortex measured with functional magnetic resonance imageing,” Nature, 388(6637), pp. 6871, 1997.

46

[28] C. W. Eriksen and Y. Y. Yeh, “Allocation of attention in the visual ﬁeld,” J. Experimental Psychology: Human Perception amd Performance, 11(5), pp. 583-597, 1985. [29] C. W. Eriksen and J. D. St. James, “Visual attention within and around the ﬁeld of focal attention: a zoom lens model,” Perception and psychophysics, 40(4), pp.225240, 1986. [30] S. Exel, and L. Pessoa, “Attention visual recognition,” International Conference on Pattern Recognition, Brisbane, Australia, 1998. [31] M. J. Farah, et al. “what” and “where” in visual attention: evidence from the neglect syndrome,” in: Unilateral Neglect: Clinical and Experimental, pp. 123-138, 1993. [32] V. Ferrara, and S. Lisberger, “Attention and target selection for smooth pursuit eye movements,” J. Neurosci., 15(11), pp. 7472-7484, 1995. [33] G. R. Fink, et al. “Space-based and object-based visual attention: shared and speciﬁc neural domains,” Brain, 120, pp. 2013-2028, 1997. [34] C. H. Folk, W. R. Remington, and J. H. Wright, “The structure of attentional control: contingent attentional capture by apparent motion, abrupt onset, and color,” Journal of Experimental Psychology: Human Perception and Performance, 20(2), pp. 317-329, 1994. [35] J. P. Gottlieb, et al. “The representation of visual salience in monkey parietal cortex,” Nature, 391(6666), pp. 481-484, 1998. [36] H. Greenspan, S. Belongie, R. Goodman, P. Persona, S. Rakshit, and C. H. Anderson, “Overcomplete steerable pyramid ﬁlters and rotation invariance,” In proc. IEEE Computer Vision and Pattern Recognition, pp. 222-228, Seattle, Washington, 1994. [37] W. E. L. Grimson, et al. “An active visual attention system to play “Where’s Waldo”,” Proceedings Conference on Computer Vision and Pattern Recognition, Seattle, WA, 85-90, 1994. [38] S. Grossberg, et al. “A neural theory of attentive visual search: interactions of boundary, surface, spatial and object representations,” Psychological Review, 10(3), pp. 470-489, 1994. [39] S. Grossberg, “How does the cerebral cortex work? Learning, attention, and grouping by the laminar circuits of visual cortex,” Spatial Vision, 12(2), pp. 13-185, 1999. [40] S. Grossberg and R. Raizada, “Contrast-sensitive perceptual grouping and objectbased attention in the laminar circuits of primary visual cortex,” Vision Reserach, 40, pp. 1413-1432, 2000. [41] T. D. Grove, and R. B. Fisher, “Attention in iconic object matching,” Proc. BMVC96, Edinburgh, pp. 293-302, 1996. [42] D. Heinke and G. W. Humphreys, “SAIM: A model of visual attention and neglect,” In Proc. International Conference on Artiﬁcial Neural Networks, pp. 913-918, New York, NY, 1997.

47

[43] G. W. Humphreys, “SEarch via recursive rejection (SERR): A connectionist model of visual search,” Cognitive Psychology, 25, pp. 43-110, 1993. [44] G. W. Humphreys, “Neural representation of objects in space: a dual coding account,” Phil. Trans. R. Soc. Lond. B, 353, pp. 1341-1351, 1998. [45] J. E. Hoﬀman, “Visual attention and eye movements,” In Attention, Edited by Harold Pashler, Psychology Press, pp. 119-154, 1998. [46] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), pp. 1254-1259, 1998. [47] L. Itti, and C. Koch, “A saliency-based search mechanism for overt and covert shifts of visual attention,” Vision Research, Vol. 40, No. 10-12, pp. 1489-1506, 2000. [48] E. N. Johnson, M. J. Hawken, and R. Shapley, “The spatial transﬀormation of color in the primary visual cortex of the macaque monkey,” Nature, 4(4), pp. 409-416, 2001. [49] D. Kahneman, and A. Henik, “Perceptual organization and attention,” in: M. Kubovy and J. R. Pomerantz (eds.), Perceptual Organization, pp. 181-211, Hillsdale, NJ: Erdbaum, 1984. [50] P. Kaiser, R. M. Boynton, “Human color vision,” Second Edition, Published by the Optical Society of America, 1996. [51] S. Kastner, and L. G. Ungerleider, “Mechanisms of visual attention in the human cortex,” Annual Review of Neuroscience, 23, pp. 315-341, 2000. [52] Y. B. Kazanovich, and R. M. Borisyuk, “Dynamics of neural networks with a central element,” Neural Networks, 12, pp. 441-454, 1999. [53] E. Kowler, et al. “The role of attention in the programming of saccades,” Vision Research, 35(13), pp. 1897-1916, 1995. [54] C. Koch, and S. Ullman, “Shifts in selective visual attention: towards the underlying neural circuitry,” Human Neurobiology, 4, pp. 219-227, 1985. [55] A. F. Kramer, and A. Jacobson, “Perceptual organization and focused attention: The role of objects and proximity in visual processing,” Perception and Psychophysics, 50, pp. 267-284, 1991. [56] V. I. Kryukov, “An attention model based on the principle of dominanta,” In: A. V. Holden and V. I. Kryukov (eds.) Neurocomputers and Attention I: Neurobiology, Synchronization and Chaos, Manchester: Manchester University Press, pp. 319, 1991. [57] D. LaBerge, “Attentional processing: the brain’s art of mindfulness,” Harvard University Press, 1995. [58] N. Lavie, “Perceptual load as a necessary condition for selective attention,” J. Exp. Psychol: Hum. Percept. Perf., 21, pp. 451-468, 1995.

48

[59] N. Lavie, and J. Driver, “On the spatial extent of attention in object-based selection,” Perception and Psychophysics, 58, pp. 1238-1251, 1996. [60] G. D. Logan, “The CODE theory of visual attention: an integration of space-based and object-based attention,” Psychological Review, 103(4), pp. 603-649, 1996. [61] S. J. Luck, “Neurophysiology of selective attention,” in Attention, edited by H. Pashler, Psychology Press Ltd., pp. 257-295, 1998. [62] R. M. McPeek, et al. “Saccades require focal attention and are facilitated by a shortterm memory system,” Vision Research, 39, pp. 1555-1566, 1999. [63] A. Nemcsics, Color dynamics, Publisher Akadmiai Kiad, Budapest, 1993. [64] E. Niebur, et al. “An oscillation based model for the neuronal basis of attention,” Vision Research, 33, pp. 2789-2802, 1993. [65] E. Niebur, and C. Koch, “A model for the neuronal implementation of selective visual attention based on tempporal correlation among neurons,” J. Neurosci., 1, pp. 141-158, 1994. [66] H.C. Nothdurft, “The conspicuousness of orientation and motion contrast,” Spatial Vision, 7(4), pp. 341-363, 1993. [67] B. A. Olshausen, et al. “A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information,” J. Neuroscience, 13(11), pp. 4700-4719, 1993. [68] S. E. Palmer, Vision Science-Photons to Phenomenology, Cambridge, MA: MIT Press, 1999. [69] H. Pashler, The psychology of attention, Cambridge, MA: MIT Press, 1998. [70] G. A. Patel, and K. Sathian, “Visual search: bottom-up or top-down?” Frontiers in Bioscience, 5, d169-193, January 1, 2001. [71] Peter Kaiser, R. M. Boynton, Human Color Vision, 2nd edition, published by the Optical Society of America, 1996. [72] M. E. Posner, “Orienting of attention,” Q. J. Exp. Psychol., 32, pp. 3-25, 1980. [73] E. O. Postma, et al. “SCAN: A scalable model of attentional selection,” Neural Networks, 10, pp. 993-1015, 1997. [74] C. Poynton, “Frequently asked questions about color,” http://www.inforamp.net/ Poynton/. [75] A. L. Ratan, “The role of ﬁxation and visual attention in object recognition,” MIT AI-TR-1529, July, 1995. [76] D. J. Robinson, and S. E. Peterson, “The pulvinar and visual salience,” Trends in Neuroscience, 15(4), pp. 127-132, 1992. [77] I. A. Rybak, et al. “A model of attention-guided visual perception and recognition,” Vision Research, 38, pp. 2387-2400, 1998.

49

[78] B. J. Scholl, “Objects and attention: the state of the art,” Cognition, 80, pp. 1-46, 2001. [79] A. Shokoufandeh, et al. “View-based object recognition using saliency maps,” Image and Computing, 17, pp. 445-460, 1999. [80] A. M. Sillito, et al. “Visual cortex mechanisms detecting focal orientation discontinuities,” Nature, 378, pp. 492-496, 1995. [81] W. Singer, and C. W. Gray, “Visual feature integration and the temporal correlation hypothesis,” Annu. Rev. Neurosci., 18, pp. 555-86, 1995. [82] G. Sela, and M. D. Levine, “Real-time attention fro robotic vision,” Real-Time Imaging, 3, pp. 173-194, 1997. [83] Y. Sun, “Object-based visual attention and attention-driven saccadic eye movements for machine vision,” PhD thesis, the University of Edinburgh, expected 2003. [84] B. Takacs, and H. Wechsler, “A dynamic and multiresolution model of visual attention and its application to facial landmark detection,” Computer Vision and Image Understanding, 70(1), pp. 63-73, 1998. [85] A. Treisman, and G. Gelade, “A feature integration theory of attention,” Cognition Psychology, 12, pp. 97-136, 1980. [86] A. Treisman, “Features and objects: the fourteenth Bartlett Memorial lecture,” Q. J. Experimential Psychology, 40A, pp. 201-237, 1988. [87] A. Treisman, “The perception of features and objects,” In A. Baddeley and L. Weiskrantz (Eds.) Attention: Selection, awareness, and control, Oxford: Uarendon Press, pp. 5-35, 1993. [88] J. K. Tsotsos, et al. “Modelling visual attention via selective tuning,” Artiﬁcial Intelligence, 78, pp. 507-545, 1995. [89] M. Usher, and N. Donnelly, “Visual synchrony aﬀects binding and segmentation in perception,” Nature, 394, pp. 179-182, 1998. [90] E. J. Chichilnisky, and B. A. Wandell, “Trichromatic opponent color classiﬁcation,” Vision Research, 39:20, pp. 3444-58, 1999. [91] B. A. Wandell, “Computational neuroimaging: color representations and processing,” In New Cognitive Neuroscience, M. S. Gazzaniga Ed., MIT Press, 1999. [92] C. F. Westin, et al. “Attention control for robot vision,” Proceedings of IEEE Computer Vision and Pattern Recognition, San Francisco, CA, pp. 18-20, 1996. [93] J. W. Wolfe, “Guided Search 2.0: A revised model of visual search,” Psychonomic Bulletin and Review, 1, pp. 202-238, 1994. [94] J. W. Wolfe, “Visual search,” in Attention, edited by H. Pashler, Psychology Press Ltd., pp. 13-73, 1998.

50

[95] Gunter Wyszechi, W. S. Stiles, Gunther Wyszecki, and Gnther Wyszecki, Color Science: Concepts and Methods, Quantitative Data and Formulae, 2nd edition, published by John Wiley & Sons, 2000. [96] S. Yantis, “Control of visual attention,” in Attention, H. Pashler, (ed.), Psychology Press Ltd., pp. 223-256, 1998.

51