Monday, January 27, 2014

This Week in R: Cluster Analyses

Here I'm going to play around with cluster analytic techniques to see which is best able to correctly categorize R.A. Fisher's famous iris data. For a more thorough overview of R's capabilities in this domain, see
Here's a glimpse at the data frame.

I'm going to remove "species" because I want only numeric data measured in the same units (otherwise I would need to normalize/scale the data). Also, the first step in most clustering techniques is to generate a distance matrix: for data with x observations, the distance matrix will have x rows and x columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. I'm going to use the "dist" function to calculate these distances in R, using just the 4-dimensional Euclidean distance default.


Initially, each flower specimen is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each iteration, distances between clusters are recomputed by the Lance–Williams dissimilarity update formula according to the particular clustering method being used.
This gives me a big dendrogram which is a bit hard to visualize; everything on the left and right looks good, but there's mixing in the middle.




Now, I want to pit this clustering technique against two others to see which can get closest to finding clusters that correspond to "species" based on "sepal length", "sepal width", "petal length", and "petal width" only. The first requires the "cluster" package and is based on partitioning around mediods (PAM); it is similar to k-means in that you have to specify the number of clusters beforehand. While k-means attempts to minimize the within-cluster sum of squares, the PAM algorithm randomly selects 3 of the n data points as mediods and associates the remaining data points to the closest mediod. Then, it swaps each medoid point for each non-mediod point and computes the cost of the configuration, selecting the one with the lowest cost and repeating the process. The other technique, agglomerative nesting (AGNES), is very similar to hclust. At first, each observation is its own cluster, and at each stage the two nearest (i.e., smallest average of dissimilarities between points in one cluster and points in another) clusters are combined. Additionally, it yields an agglomerative coefficient which measures the amount of clustering structure found.

As you can see, the agglomerative nesting technique comes out on top, completely nailing setosa and versicolor, just beating out PAM and hclust. Interestingly, PAM and AGNES both had difficulty with virginica (while hclust was troubled by versicolor) even though hclust and AGNES are both hierarchical techniques.
Here's the AGNES plot. Average silhouette width is reported as a measure of our clustering strength (0.5 to 0.7 represents a "reasonable structure").




Below is a scatterplot matrix, color-coded by species, to give some intuition for the natural fracturing of the data set along the measured variables