To illustrate its application to genomics, clustering applied to genes from a set of microarray data groups together those genes whose expression levels exhibit similar behavior throughout the samples, and when applied to samples it offers the potential to discriminate pathologies based on their differential patterns of gene expression.
Although clustering has now been used for many years in the context of gene expression microarrays, it has remained highly problematic.
Factors to consider when choosing an algorithm include the nature of the application, the characteristics of the objects to be analyzed, the expected number and shape of the clusters, and the complexity of the problem versus computational power available.Although we will not go over the mathematical details of [19, 20], in this section we summarize some essential points regarding clustering error, error estimation, and inference.Within a probabilistic framework, objects to be clustered are assumed to be described by vectors of numerical values.These vectors are realizations of a random labeled point process, which produces random sets in a multi-dimensional space with unknown random labels associated with each vector.Two vectors are properly in the same cluster if and only if they have the same label produced by the random process .Although used for many years in the context of gene expression microarray data, clustering has remained highly problematic [2, 12, 17].Some criticisms raise the question as to whether clustering can be used for scientific knowledge : how may one judge the relative worth of clustering algorithms unless the assessment is based on their inference capabilities?, which used the K-means algorithm to identify transcriptional regulatory sub-networks.Another graph based algorithm called CLICK was introduced in 2000 by Sharan and Shamir .  presented the use of model based clustering, where the clusters are modeled as mixtures of Gaussian distributions, and proposed the use of the BIC criterion for selecting the number of clusters. presented in 2002  an algorithm to select the best clustering rule for a dataset, based on noise injection, replication, and cluster accuracy.In comparison, clustering has historically been approached heuristically; there has been almost no consideration of learning or optimization, and error estimation has been handled indirectly validation indices.Only recently has a rigorous clustering theory been developed in the context of random sets .