Skin Colour Analysis

Jamie Sherrah and Shaogang Gong

The detection of skin colour in images is a very useful and increasingly popular technique in computer vision for detecting and tracking humans. As a visual cue, skin colour is robust and inexpensive to compute, making it useful as an attention-focusing mechanism for more expensive computations. It has been found that skin colour from all ethnicities clusters tightly in hue-saturation (HS)-space [5]. Ignoring intensity immediately introduces some invariance to lighting conditions. In the literature, physical models have been introduced for modelling colour [3,2], and in particular for human skin [6,4]. For example, one can model light sources as black-body radiators, the reflectance and melanin content of skin, and the linear and non-linear camera characteristics of the colour-calibrated camera. However, such a model would still be incomplete because the blood content in the skin affects its colour distribution, and the viewing geometry would not be known in general. A simple, general system is usually required that can operate with: (1) un-calibrated cameras, (2) arbitrary viewing geometry and (3) unknown but commonly-encountered illuminants. Let us now use an example to illustrate how a skin model can be built through sampling. To produce an empirical camera-independent model for skin colour in HS-space, we assumed that the illuminant is one of the commonly-encountered ``white'' illuminants, namely daylight, and fluorescent or incandescent light sources. Thirty-two skin colour image samples were collected from our own cameras and the internet under reasonable lighting conditions. Images with blue and orange light sources were discarded. The pixels from these images were converted to Hue, Saturation and Intensity (HSI)-space. The HS components are plotted in polar coordinate in Figure 1. Hue being the angle $\theta $, and saturation is the angle $\rho $, with red at $\theta = 0$.

Figure 1: Skin pixels plotted in HS-space. Hue being the angle $\theta $, and saturation is the angle $\rho $, with red at $\theta = 0$. Rays are plotted at $\theta_1 = 6\mbox{$^\circ$}$ and $\theta_2 = 38\mbox{$^\circ$}$.

On notices easily that the skin pixels occupy only a subset of all colours. What is not directly evident from Figure 1 is that in fact there is no single image in which all pixels have saturation below $\rho=0.1$. When modelling skin colour, the problem is usually thought of as a binary classification problem. Each pixel is classified as skin or non-skin based on it's colour components. Given Figure 1, it would appear that the problem could be easily solved for all commonly-encountered environments. However, realistically this is less likely. It is inevitable that there will be overlap in colour space between skin pixels and background pixels. For example, the face is likely to contain specularities that will be indistinguishable from white-ish regions in the background. The accuracy of the classification will depend on the scene background colour distribution. Therefore better results are obtained by tightly modelling the skin colour distribution in this image set rather than trying to cope with all possible skin hues. A classifier can be trained off-line by having a human identify skin and non-skin pixels. An example would be to model the skin pixels in HS-space using a single 2D Gaussian. The classification task is complicated by changing illumination conditions, which alter the distribution of skin colour in the image over time. There have been successful applications of adaptive skin colour models that track the colour distribution over time. The two main approaches use histograms [1], and mixtures-of-Gaussians adapted using Expectation-Maximisation [5]. The difficulty with these models is to identify when colour tracking has failed, as the adaptation can occasionally conform to background colours. To conclude, in a general setting skin colour alone will not be sufficiently reliable to specifically identify human subjects in a scene likely to contain skin-look-alike background. Therefore this useful and computationally inexpensive visual cue ought to be combined with other sources of information such as shape, appearance and motion in order to be truly effective.

Shaogang Gong 2001-05-18