Gaussian mixture models of texture and colour for image database retrieval

We introduce Gaussian mixture models of 'structure' and colour features in order to classify coloured textures in images, with a view to the retrieval of textured colour images from databases. Classifications are performed separately using structure and colour and then combined using a confidence criterion. We apply the models to the VisTex database and to the classification of man-made and natural areas in aerial images. We compare these models with others in the literature, and show an overall improvement in performance.


INTRODUCTION
In many domains of image processing, there is a strong correspondence between entities in the scene and textures (by texture, we mean both what we will later call 'structure' information, and colour information) in the image.This implies that the ability to classify these textures can furnish important semantic information about the scene.Consequently, the problems of texture description and classification, and the closely related problem of segmentation, have received considerable attention, with numerous approaches being proposed (see, e.g., [1] and references therein).In particular, in the field of content-based image retrieval, the ability to answer the question: "Is there a significant amount of such-and-such texture in this image?",can be the basis for many types of query.
One approach to characterizing textures is to use statistical models.Many kinds of statistical models have been applied to texture classification, but the closest to the models proposed in this paper are those based on various Markov models.Motivated by the desire to incorporate contextual information, Li and Gray [2] proposed a 2D HMM for image classification.A somewhat different model is the noncausal HMM described in [3].Another recently popular class of models uses hidden Markov trees to model the joint statistics of wavelet coefficients [4,5,6].In [4,5,7], an independent mixture model (IMM) in the wavelet domain is introduced.This model bears some similarities to the GMM proposed in this paper, since it employs a mixture of Gaussians, but in the IMM each wavelet feature is modelled separately by a mixture of models, and it is assumed that the features are independent.In the GMM framework adopted in this paper, no independence assumption about the features is made.The performance of the IMM-based classification was evaluated in [4,5,7], using the wavelet coefficients at each pixel as features.

GMMS FOR TEXTURE CLASSIFICATION
We introduce our classification model, describe Gaussian mixture models and motivate their use.We assume that we are dealing with N texture classes, labelled by n ∈ N ∼ = {1, . . ., N }, corresponding to different entities.

Classification model
Since texture is not a local phenomenon, in order to classify a pixel one must take into account a neighbourhood of that pixel.We will compute features from, and assign classes to, S × S subimages called 'blocks'.The set of blocks is denoted B. We define the neighbourhood P (b) of a block b, called a 'patch', to be the set of blocks in a larger T × T subimage with b at its centre.We denote by D b the data associated to block b, and by ν b ∈ N the classification of b.Given the likelihood of the data in a block given its class, Pr(D b |ν b ), we use the following classification rule: This says: "Assign to a block b that class n which, if all the blocks in P (b) had class n, would maximize the probability of the data in P (b)" (we assume conditional independence of the data in the blocks in a patch given the classification).
The effect of this classification rule is similar to that of a Potts prior, in that it encourages spatial homogeneity of the classification.Its advantage is that it is not necessary to consider the classifications of neighbouring blocks in making a classification decision.This reduces computation time considerably.

Gaussian mixture models
The data D b associated to each block will be a vector of features, denoted x.We must choose, for each texture class, a probability distribution that represents the feature statistics of a block of that class.We will use Gaussian mixture models for this purpose.Thus, for a given texture class, the probability that x be observed is a convex combination of M Gaussian densities: where b( x, µ, Σ) is a Gaussian of mean µ and covariance Σ.The parameters for a given class are thus It is clear that modelling a texture class with a GMM rather than a single Gaussian gives a great deal of added flexibility to the model.Indeed, if one is allowed an arbitrary number of components, any continuous density function can be approximated to any desired accuracy.A GMM is also the natural model to use if a texture class contains a number of distinct subclasses, as is often the case (for example, forest texture in an aerial image).

Parameter estimation for GMMs
To apply the above classification procedure, we must learn the parameters of the GMM models.Given a training set consisting of the data from T blocks of a particular texture class, X = { x t |t ∈ T }, we would like to estimate the parameters of the Gaussian mixture density using a maximum likelihood estimator.Fortunately, maximum likelihood parameter estimation for a GMM can be solved using the EM algorithm [8], or by using k-means.Lack of space prevents an exposition of these algorithms here, but note that the update steps for GMMs are expressible in closed form.

The BKG model
In addition to the texture classes that we wish to classify, we introduce also the background ('BKG') class.Its parameters are learned from the blocks in the union of the training sets of each class, the k-means algorithm being used because of its faster convergence properties on large amounts of data.The BKG model has two roles.First, it is used to initialise the training of the individual texture models, thus ensuring that the initialisation is the same for all classes and not biased towards any one.Second, the BKG model is used as a 'no decision' class.If the BKG model is more likely than any of the individual classes, then no decision is made.

FEATURE EXTRACTION
We must choose sizes for a block and a patch.For segmentation, there is a trade-off between our ability to discriminate classes and the accuracy of boundary estimation.However, for retrieval purposes, the accuracy of texture boundaries is not such a big issue.We choose a block size of S = 16, since this seems large enough to capture a reasonable sample of the largest structures in the textures in the images with which we are dealing.Choosing patch size is equivalent to choosing a degree of smoothing for the classifications: there is a tendency for blocks near the centre of a given patch to be assigned the same class, since their corresponding patches have many blocks in common.We choose square patches containing nine blocks.

Structure Features
Structure features are designed to capture spatial regularity of the texture over the block.We extract structure information from the intensity images alone.We compared several sets of features for this purpose: the energies in different wavelet subbands for both Haar and Daubechies wavelets; AR models of different orders; and the energies of DCT coefficients in regions of frequency space corresponding to a wavelet decomposition.We found in practice that the wavelet-like DCT and the Haar wavelet features performed best, although only the AR models did significantly worse.The DCT is computationally the most efficient however, and we chose these energies as our structure features.

Colour features
Colour provides an extremely powerful cue for the distinguishing of different entities in the scene.As colour features, we used the mean RGB values over a block and the data covariance of the RGB values over a block.Since the covariance matrix is symmetric, only half of it, including the diagonal, is included in the feature vector.The colour feature vector is thus a 9-dimensional vector, 3 coming from the mean and 6 from the covariance.
of the classification decision taken as: If the classifications resulting from using the structure features and the colour features conflict, we choose the decision with the highest confidence.

EXPERIMENTAL RESULTS
The experiments in this section were conducted on the MIT Vision Texture (VisTex) database, and on the aerial images of the San Francisco Bay area that were used in [9,2,10,11].

Vistex texture database
We chose randomly 24, 512 × 512 textured colour images from the Vistex database.The textures are displayed in figure 1.Each image was divided into subimages of size 32 × 32 pixels.All blocks extracted from the first 96 subimages of each texture were used for training, while the remaining 160 subimages were used for testing.
For each class we trained a GMM with five components using 30 iterations of the EM algorithm.We chose five components because increasing the number of components did not improve the results significantly.We used 30 iterations for a similar reason: the EM algorithm appeared to have converged after this number of iterations.We used the same initialization for each texture class: the BKG model.The results of the classification using the colour features, the structure features and the combined decision are shown in

Aerial images
This database includes six 512 × 512 grey-scale images.There exist also manual segmentations of the images into man-made and natural areas.We use these segmentations as ground truth.The images are displayed in Figure 2. We used this database for evaluation exactly as it was used in [9,2].For each iteration, one image was used as test data, and the other five were used as training data.Performance is evaluated by averaging over all iterations.Each class ('man-made' and 'natural') was modelled by a fivecomponent GMM of the structure features.For initialization we used the BKG model.
The results from the GMM algorithm were compared to the results from other statistical models reported in [2,9]: the 2D HMM (two-dimensional hidden Markov model) [2]; the 2D MHMM (two-dimensional multi-resolution hidden Markov model) [9]; CART (a decision tree algorithm) [12]; and LVQ1 (version 1 of Kohonen's learning vector quantization) [13].The classification error rates for each test image in the six-fold cross-validation and the average error rates are listed in table 2.

CONCLUSION
We have described Gaussian mixture models of texture and colour features, and used them for the classification of textures in the VisTex database and for classifying 'man-made' and 'natural' areas in aerial images.We have compared these models with others in the literature, and shown an overall improvement in performance.

Fig. 2 .
Fig. 2. Aerial images.On the left of each pair, the original images.On the right, the manual segmentations.The dark areas are natural, the lighter areas, man-made.

Table 2 .
Classification error rates (percentage) by algorithm