Lecture 16 - (23/04/2026)

Today’s Topics:

Taxomony of Machine Learning
K-Means Clustering
Hierarchical Agglomerative Clustering
Choosing a Clustering Algorithm

Last time, we began our journey into unsupervised learning by discussing Principal Component Analysis (PCA).

In this lecture, we will explore another very popular unsupervised learning concept: clustering. Clustering allows us to “group” similar datapoints together without being given labels of what “class” or where each point explicitly comes from.

Taxomony of Machine Learning¶

Supervised Learning¶

In supervised learning, our goal is to create a function that maps inputs to outputs. Each model is learned from example input/output pairs (training set), validated using input/output pairs, and eventually tested on more input/output pairs. Each pair consists of:

Input vector (features)
Output value (label)

In regression, our output value is quantitative, and in classification, our output value is categorical.

Unsupervised Learning¶

In unsupervised learning, our goal is to identify patterns in unlabeled data. In this type of learning, we do not have input/output pairs. Sometimes, we may have labels but choose to ignore them (e.g. PCA on labeled data). Instead, we are more interested in the inherent structure of the data we have rather than trying to simply predict a label using that structure of data. For example, if we are interested in dimensionality reduction, we can use PCA to reduce our data to a lower dimension.

Now, let’s consider a new problem: clustering.

Clustering Examples¶

First off, let’s see some use cases for clustering

Example 1 - Image Compression¶

Clustering can be used for image compression. Why? Digital images consume significant storage and bandwidth, so reducing the number of colors simplifies images while retaining their visual appeal.

Clustering can group similar colors (pixels) in an image into clusters.
Each cluster represents a color centroid (mean color of the group).
Replace each pixel in the image with the color of its cluster centroid, reducing the number of unique colors.

In the example below, notice that using more clusters will make the compressed image look more similar to the original image, while using fewer clusters will look more like a silhouette of the image. Finding a good balance can help resemble the original image while storing less information.

In social network clustering, we identify groups (clusters or communities) of individuals who interact more frequently with each other. This can help analyze network behavior, predict relationships, and recommend connections. Some applications include:

Detecting communities on social network platforms.
Understanding collaboration in organizations.
Optimizing targeted marketing strategies.

Shown above is a synthetic graph representing interactions in a photography club, where each Node represents a member of the club and each Edge is an interaction between members (or Nodes). Members can be clustered based on the people they interact the most with.

Example 3 - Clustering in Climate Sciences¶

Clustering in climate science helps to identify patterns in large, complex datasets and provide insights into global and regional climate trends. Some examples include:

Climate zone identification
Weather pattern analysis
Environmental monitoring

In this example, we grouped regions based on average temperatures to understand climate zones.

There are many types of clustering algorithms, and they all have strengths, inherent weaknesses, and different use cases. There are two main groups of clustering algorithms we will focus on: Agglomerative approaches to clustering, and Partitional approaches to clustering.

We will first focus on a partitional approach: K-Means clustering.

K-Means Clustering¶

The most popular clustering approach is K-Means. The algorithm itself entails the following:

Pick an arbitrary $k$ , and randomly place $k$ “centers”, each a different color.
Repeat until convergence:
1. Color points according to the closest center (also called centroids). b. Move the center for each color to the center of points with that color.

import numpy as np
import plotly.graph_objects as go
from sklearn.cluster import KMeans

np.random.seed(42)

blob1 = np.random.normal(loc=[2, 2], scale=1.0, size=(600, 2))
blob2 = np.random.normal(loc=[8, 8], scale=1.2, size=(50, 2))
X = np.vstack((blob1, blob2))

k = 2
max_iter = 10

centroids = X[np.random.choice(len(X), k, replace=False)]

frames = []

for i in range(max_iter):
    distances = np.linalg.norm(X[:, None] - centroids, axis=2)
    labels = np.argmin(distances, axis=1)

    frames.append(go.Frame(
        data=[
            go.Scatter(
                x=X[:, 0],
                y=X[:, 1],
                mode='markers',
                marker=dict(color=labels, colorscale='Viridis', size=5),
            ),
            go.Scatter(
                x=centroids[:, 0],
                y=centroids[:, 1],
                mode='markers',
                marker=dict(color='red', size=14, symbol='x'),
            )
        ],
        name=f"Iteration {i}"
    ))

    new_centroids = np.array([
        X[labels == j].mean(axis=0) if np.any(labels == j) else centroids[j]
        for j in range(k)
    ])

    if np.allclose(centroids, new_centroids):
        break

    centroids = new_centroids

fig = go.Figure(
    data=frames[0].data,
    frames=frames
)

fig.update_layout(
    title="K-Means Iterations",
    xaxis_title="Feature 1",
    yaxis_title="Feature 2",
    template="plotly_white",
    updatemenus=[{
        "type": "buttons",
        "buttons": [
            {"label": "Play", "method": "animate", "args": [None]},
            {"label": "Pause", "method": "animate", "args": [[None], {"frame": {"duration": 0}}]}
        ]
    }],
    sliders=[{
        "steps": [
            {
                "method": "animate",
                "label": f"{i}",
                "args": [[f"Iteration {i}"], {"mode": "immediate"}]
            }
            for i in range(len(frames))
        ]
    }]
)

fig.show()

Note: K-Means is a completely different algorithm than K-Nearest Neighbors. K-means is used for clustering, where each point is assigned to one of $K$ clusters. On the other hand, K-Nearest Neighbors is used for classification (or, less often, regression), and the predicted value is typically the most common class among the $K$ -nearest data points in the training set.

One major difference between these two is that $K$ -Nearest Neighbors is used for supervised learning, where we have labeled input and output pairs, and $K$ -means is used for unsupervised learning, where we only have features, and no label that we’re trying to predict.

The names may be similar, but there isn’t really anything in common.

Due to the randomness of where the $K$ centers initialize/start, you will get a different output/clustering every time you run $K$ -Means. Consider three possible $K$ -Means outputs; the algorithm has converged, and the colors denote the final cluster they are clustered as.

import numpy as np
import plotly.graph_objects as go
from sklearn.cluster import KMeans
from plotly.subplots import make_subplots

np.random.seed(42)

blob1 = np.random.normal(loc=[2, 2], scale=1.0, size=(600, 2))
blob2 = np.random.normal(loc=[4, 4], scale=1.2, size=(50, 2))
X = np.vstack((blob1, blob2))

k = 4
seeds = [0, 1, 2, 3]

fig = make_subplots(rows=2, cols=2, subplot_titles=[f"Seed {s}" for s in seeds])

for idx, seed in enumerate(seeds):
    kmeans = KMeans(n_clusters=k, random_state=seed)
    labels = kmeans.fit_predict(X)
    centers = kmeans.cluster_centers_

    row = idx // 2 + 1
    col = idx % 2 + 1

    fig.add_trace(
        go.Scatter(
            x=X[:, 0],
            y=X[:, 1],
            mode='markers',
            marker=dict(color=labels, colorscale='Viridis', size=5, opacity=0.7),
            showlegend=False
        ),
        row=row, col=col
    )

    fig.add_trace(
        go.Scatter(
            x=centers[:, 0],
            y=centers[:, 1],
            mode='markers',
            marker=dict(color='red', size=12, symbol='x'),
            showlegend=False
        ),
        row=row, col=col
    )

fig.update_layout(
    title="K-Means with k=4",
    template="plotly_white",
    height=800
)

fig.show()

Which clustering output is the best? To evaluate different clustering results, we need a loss function.

The two common loss functions are:

Inertia: Sum of squared distances from each data point to its center.
Distortion: Weighted sum of squared distances from each data point to its center.

In the example above:

Calculated Inertia: $0.47^2 + 0.19^2 + 0.34^2 + 0.25^2 + 0.58^2 + 0.36^2 + 0.44^2$
Calculated Distortion: $\frac{0.47^2 + 0.19^2 + 0.34^2}{3} + \frac{0.25^2 + 0.58^2 + 0.36^2 + 0.44^2}{4}$

It turns out that the function K-Means is trying to minimize is inertia, but often fails to find global optimum. Why does this happen? We can think of K-means as a pair of optimizers that take turns. The first optimizer holds center positions constant and optimizes data colors. The second optimizer holds data colors constant and optimizes center positions.

This is a hard problem: give an algorithm that optimizes inertia FOR A GIVEN $K$ ; $K$ is picked in advance. Your algorithm should return the EXACT best centers and colors, but you don’t need to worry about runtime.

We won’t dwell too much on this problem as it delves deep into material from CSCI 350/353/761.

For all possible $k^n$ colorings:
- Compute the $k$ centers for that coloring
- Compute the inertia for the $k$ centers
  - If the current inertia is better than the best known, write down the current centers and colorng and call that the new best known

No better algorithm has been found for solving the problem of minimizing inertia exactly.

Hierarchical Agglomerative Clustering¶

Now, let us introduce Hierarchical Agglomerative Clustering! We start with every data point in a separate cluster, and we’ll keep merging the most similar pairs of data points/clusters until we have one big cluster left. This is called a bottom-up or agglomerative method.

There are various ways to decide the order of combining clusters called Linkage Criterion:

Single linkage (similarity of the most similar): the distance between two clusters as the minimum distance between a point in the first cluster and a point in the second.
Average linkage: the distance between two clusters as the average of all pairwise distances between points in the first cluster and points in the second.
Complete linkage (similarity of the least similar): the distance between two clusters as the maximum distance between a point in the first cluster and a point in the second.

The linkage criterion decides how we measure the “distance” between two clusters.

When the algorithm starts, every data point is in its own cluster. In the plot below, there are 12 data points, so the algorithm starts with 12 clusters. As the clustering begins, it assesses which clusters are the closest together.

import numpy as np
import plotly.graph_objects as go
from scipy.cluster.hierarchy import linkage, fcluster

np.random.seed(42)
X = np.random.rand(12, 2) * 10

Z = linkage(X, method='ward')

max_clusters = len(X)

frames = []

for t in range(max_clusters, 0, -1):
    labels = fcluster(Z, t, criterion='maxclust')
    
    cluster_traces = []
    for cluster_id in np.unique(labels):
        cluster_points = X[labels == cluster_id]
        if len(cluster_points) > 1:
            cluster_traces.append(go.Scatter(
                x=cluster_points[:, 0],
                y=cluster_points[:, 1],
                mode='lines+markers',
                line=dict(width=2),
                marker=dict(size=12),
                name=f"Cluster {cluster_id}",
                showlegend=False
            ))
        else:
            cluster_traces.append(go.Scatter(
                x=cluster_points[:, 0],
                y=cluster_points[:, 1],
                mode='markers',
                marker=dict(size=12),
                name=f"Cluster {cluster_id}",
                showlegend=False
            ))
    
    frames.append(go.Frame(data=cluster_traces, name=f"{t} clusters"))

fig = go.Figure(
    data=frames[0].data,
    frames=frames
)

fig.update_layout(
    title="Hierarchical Clustering",
    xaxis_title="X",
    yaxis_title="Y",
    template="plotly_white",
    updatemenus=[{
        "type": "buttons",
        "buttons": [
            {"label": "Play", "method": "animate", "args": [None]},
            {"label": "Pause", "method": "animate", "args": [[None], {"frame": {"duration": 0}}]}
        ]
    }],
    sliders=[{
        "steps": [
            {"method": "animate",
             "label": frame.name,
             "args": [[frame.name], {"mode": "immediate"}]}
            for frame in frames
        ],
        "currentvalue": {"prefix": "Clusters: "}
    }]
)

fig.show()

Clustering, Dendrograms, and Intuition¶

Agglomerative clustering is one form of “hierarchical clustering.” It is interpretable because we can keep track of when two clusters got merged (each cluster is a tree), and we can visualize the merging hierarchy, resulting in a “dendrogram.” Won’t discuss this in detail for this course, but you might see these in the wild. Here are some examples:

import numpy as np
import plotly.graph_objects as go
from scipy.cluster.hierarchy import linkage, dendrogram

# 12 points in 2D
np.random.seed(42)
X = np.random.rand(12, 2) * 10

# Hierarchical clustering
Z = linkage(X, method='ward')

# Dendrogram info
dendro = dendrogram(Z, no_plot=True)

# Coordinates for dendrogram lines
icoord = np.array(dendro['icoord'])
dcoord = np.array(dendro['dcoord'])

# Create figure with 1x2 subplots
from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=["Dendrogram", "Cluster Tree (2D)"])

# 1. Dendrogram (left panel)
for i in range(len(icoord)):
    fig.add_trace(
        go.Scatter(x=icoord[i], y=dcoord[i], mode='lines', line=dict(color='blue', width=2), showlegend=False),
        row=1, col=1
    )

# 2. 2D cluster tree (right panel)
# Map dendrogram leaf order to X coordinates
leaf_order = dendro['leaves']
ordered_X = X[leaf_order]

# Plot points
fig.add_trace(
    go.Scatter(x=ordered_X[:,0], y=ordered_X[:,1], mode='markers+text', marker=dict(size=12, color='red'),
               text=[str(i) for i in leaf_order], textposition='top center'),
    row=1, col=2
)

# Plot cluster merging lines in 2D tree
for i in range(len(Z)):
    c1, c2 = int(Z[i, 0]), int(Z[i, 1])
    # Map cluster indices to 2D positions
    if c1 < 12: x1, y1 = X[c1]
    else: x1, y1 = ordered_X[c1 - 12]  # merged cluster approximate
    if c2 < 12: x2, y2 = X[c2]
    else: x2, y2 = ordered_X[c2 - 12]

    fig.add_trace(
        go.Scatter(x=[x1, x2], y=[y1, y2], mode='lines', line=dict(color='blue', width=2), showlegend=False),
        row=1, col=2
    )

fig.update_layout(height=600, width=1000, template="plotly_white", title_text="Hierarchical Clustering: Dendrogram & 2D Tree")
fig.show()

Applying Clustering¶

Population Data¶

The algorithms we’ve discussed require us to pick a $K$ before we start. But how do we pick $K$ ? Often, the best $K$ is subjective. For example, consider the state plot below.

# ------------------------
# Example 1: Cluster by total population in 2012
# ------------------------
df_tot = df[(df['ages'] == 'total') & (df['year'] == 2012)].copy()
df_tot = df_tot[df_tot['population'].notna()]  # drop missing
df_tot = df_tot[['state/region', 'population']].rename(columns={'state/region':'State', 'population':'Population'})

kmeans_pop = KMeans(n_clusters=3, random_state=42)
df_tot['Cluster'] = kmeans_pop.fit_predict(df_tot[['Population']]).astype(str)

fig1 = px.bar(df_tot.sort_values('Population'), x='State', y='Population', color='Cluster',
              title="US States Clustered by Total Population (2012)")
fig1.update_layout(xaxis_tickangle=-45)
fig1.show()

# ------------------------
# Example 2: Cluster by population growth 2000 -> 2012
# ------------------------
df_growth = df[(df['ages'] == 'total')].pivot(index='state/region', columns='year', values='population')
df_growth = df_growth.fillna(0)

df_growth['Growth'] = df_growth[2012] / df_growth[2000]

kmeans_growth = KMeans(n_clusters=3, random_state=42)
df_growth['Cluster'] = kmeans_growth.fit_predict(df_growth[['Growth']]).astype(str)
df_growth['State'] = df_growth.index

fig2 = px.bar(df_growth.sort_values('Growth'), x='State', y='Growth', color='Cluster',
              title="US States Clustered by Population Growth (2000 to 2012)")
fig2.update_layout(xaxis_tickangle=-45)
fig2.show()