# Evaluating Clustering

## Preliminaries

import numpy as np from sklearn.metrics import silhouette_score from sklearn import datasets from sklearn.cluster import KMeans from sklearn.datasets import make_blobs

## Create Feature Data

# Generate feature matrix X, _ = make_blobs(n_samples = 1000, n_features = 10, centers = 2, cluster_std = 0.5, shuffle = True, random_state = 1)

## Cluster Observations

# Cluster data using k-means to predict classes model = KMeans(n_clusters=2, random_state=1).fit(X) # Get predicted classes y_hat = model.labels_

## Calculate Silhouette Coefficient

Formally, the \(i\)th observation's silhouette coefficient is:

$$s_{i} = \frac{b_{i} - a_{i}}{\text{max}(a_{i}, b_{i})}$$

where \(s_{i}\) is the silhouette coefficient for observation \(i\), a_{i} is the mean distance between \(i\) and all observations of the same class and b_{i} is the mean distance between \(i\) and all observations from the closest cluster of a different class. The value returned by `silhouette_score`

is the mean silhouette coefficient for all observations. Silhouette coefficients range between -1 and 1, with 1 indicating dense, well separated clusters.

# Evaluate model silhouette_score(X, y_hat)

0.89162655640721422