User:Manudouz/sandbox/single-linkage clustering

From Wikipedia, the free encyclopedia

Working example[edit]

First step[edit]

  • First clustering

Let us assume that we have five elements and the following matrix of pairwise distances between them:

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

In this example, is the lowest value of , so we join elements and .

  • First branch length estimation

Let denote the node to which and are now connected. Setting ensures that elements and are equidistant from . This corresponds to the expectation of the ultrametricity hypothesis. The branches joining and to then have lengths (see the final dendrogram)

  • First distance matrix update

We then proceed to update the initial proximity matrix into a new proximity matrix (see below), reduced in size by one row and one column because of the clustering of with . Bold values in correspond to the new distances, calculated by retaining the minimum distance between each element of the first cluster and each of the remaining elements:

Italicized values in are not affected by the matrix update as they correspond to distances between elements not involved in the first cluster.

Second step[edit]

  • Second clustering

We now reiterate the three previous steps, starting from the new distance matrix  :

(a,b) c d e
(a,b) 0 21 31 21
c 21 0 28 39
d 31 28 0 43
e 21 39 43 0

Here, and are the lowest values of , so we join cluster with element and with element .

  • Second branch length estimation

Let denote the node to which , and are now connected. Because of the ultrametricity constraint, the branches joining or to , and to , and also to are equal and have the following total length:

We deduce the missing branch length: (see the final dendrogram)

  • Second distance matrix update

We then proceed to update the matrix into a new distance matrix (see below), reduced in size by two rows and two columns because of the clustering of with and with  :

Final step[edit]

The final matrix is:

((a,b),c,e) d
((a,b),c,e) 0 28
d 28 0

So we join clusters and .

Let denote the (root) node to which and are now connected. The branches joining and to then have lengths:

We deduce the remaining branch length:

The single-linkage dendrogram[edit]

Single Linkage Dendrogram 5S data
Single Linkage Dendrogram 5S data

The dendrogram is now complete. It is ultrametric because all tips (, , , , and ) are equidistant from  :

The dendrogram is therefore rooted by , its deepest node.