v Evaluating Clustering - Machine Learning

Evaluating Clustering

Preliminaries

import numpy as np
from sklearn.metrics import silhouette_score
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

Create Feature Data

# Generate feature matrix
X, _ = make_blobs(n_samples = 1000,
                  n_features = 10,
                  centers = 2,
                  cluster_std = 0.5,
                  shuffle = True,
                  random_state = 1)

Cluster Observations

# Cluster data using k-means to predict classes
model = KMeans(n_clusters=2, random_state=1).fit(X)

# Get predicted classes
y_hat = model.labels_

Calculate Silhouette Coefficient

Formally, the \(i\)th observation's silhouette coefficient is:

$$s_{i} = \frac{b_{i} - a_{i}}{\text{max}(a_{i}, b_{i})}$$

where \(s_{i}\) is the silhouette coefficient for observation \(i\), a_{i} is the mean distance between \(i\) and all observations of the same class and b_{i} is the mean distance between \(i\) and all observations from the closest cluster of a different class. The value returned by silhouette_score is the mean silhouette coefficient for all observations. Silhouette coefficients range between -1 and 1, with 1 indicating dense, well separated clusters.

# Evaluate model
silhouette_score(X, y_hat)
0.89162655640721422