Wangsheng Wu

Notes

STAT 5242 - Part 1

Clustering Overview
Distance Based Clustering
K-means
Lloyd's Algorithm
EM for GMM
Spectral Culstering
Density Based Culstering
DBSCAN
Mean-Shift

Hyperparameter Choices in Unsupervised Learning

Last Update: October 3, 2024

Clustering Overview

Set up: \( \{ X_i \}_1^n \), some data \(\in \mathbb{R}^d\)
Goal: grouping "similar" data points.

Distance based: K-means, EM for Gaussian Mixture Models
Spectral based: in-between approach
Density based: DBSCAN, MeanShift, etc.

K-means: group similar points together.

\(\{X_i\}_1^n\), points \(\in \mathbb{R}^d\)
Number of clusters: K
Stemming perhaps from application.
Example:

document clustering
- \(X \in \mathbb{R}^d \equiv document\)
- \(X_{(i)}\)'s could be word frequencies
Image/Speech Segmentation
Recommendation System (e.g., grouping Netflix users by their watching history)

K-means Objectives:

\(\{X_i\}_1^n \in \mathbb{R}^d\)
A partition \(\{\xi_l\}_1^K\) of the \(\{X_i\}_1^n\), together with a set of "centers" \(\{\mu_l\}_{l=1}^K\) of each "cluster" so that the total distance between data points and their assigned cluster "center"(centroid) is minimized.

\[ \begin{equation*} \begin{split} \mathcal{R}( \{\xi_l\}, \{\mu_l\} ) & \equiv \sum_{l=1}^K \sum_{X_i \in \xi_l} \|X_i - \mu_l\|^2 \\ & \propto \sum_{l=1}^n \frac{1}{n} \sum_{X_i \in \xi_l} \|X_i - \mu_l\|^2 \\ & = \sum_{l=1}^n \frac{n_l}{n} \frac{1}{n_l} \sum \| X_i - \mu_l\|^2 \end{split} \end{equation*} \]

where \( \mathcal{R}( \{\xi_l\}, \{\mu_l\} ) \) is K-means objective, \( n_l \equiv \) # of points in \( \xi_l \), and \(\frac{1}{n_l} \sum \| X_i - \mu_l\|^2\) is the average distance to center.

Lloyd's Algorithm

Start with some choice of \( \{ \mu_l \} \)
At each time step:
- \( \{\xi_l\} \leftarrow \) Assign all \( X_i\)'s to closest \(\mu_l\)
- \( \{\mu_l\} \leftarrow \) Average of \( X_i\)'s in \(\xi_l\)
Repeat till "converge"

(approximate optimization)

For the graph on the right, the points to start may influence the result. "K-means ++" consider start from different ways.

EM (Expectation Maximization) for Gaussian Mixture Models

Idea: model clusters as a GMM \[ f(x) = \sum_{l=1}^K \Pi_l \mathcal{N}( \mu_l, \Sigma_{l; x} ) \] where \( \sum\nolimits_{l; x} \) is covariance matrix and \( \sum \Pi_l = 1\)

\[ \mathcal{N}(\mu_l, \Sigma_{l; x}) \propto exp(- \frac{1}{2} (x - \mu_l)^T \Sigma_l^{-1} (x - \mu_l)) \] \[ \Sigma_l = \mathbb{E}(x - \mu_l)(x - \mu_l)^T \] \[ X \sim \mathcal{N}(\mu_l, \Sigma_l) \]

Lloyd's Algorithm for K-means can be viewed as an approximate EM for a GMM where similar to \[ \mathcal{N}(\mu_l, \Sigma_l) \propto \mathcal{N}(\mu_l, \sigma^{2} I_d) \] EM for GMM & K-means are implicitly assuming "ellipsoidal" (spherical for K-means) of clusters.

Questions:

Which of K-means or EM is more general?
EM is more general.
Why NOT always use EM?
EM is time-expensive, involves very "expensive" step: inversing of Matrix \( \Sigma_l^{-1} \)

Idea: Reduce Dimensions?

PCA or SVD down to dimension \( \mathcal{O}(K) \), \( K \ll d\).

Example of PCA:

Projecting the data to the top s "principal components"/eigenvectors of \(\hat{\Sigma} \equiv\) learned covariance matrix. \[ \hat{\Sigma} = \frac{1}{n} \sum_i(X_i - \hat{\mu})(X_i - \bar{\mu})^T \] \[ \hat{\mu} = \frac{1}{n} \sum_i X_i \]

Suppose PCA to \(s = 1 (K = 2) \). Recall the "variance" in direction \( v \in \mathbb{R}^d \), \( \| v \| = 1 \) is given by \( Var(X^Tv) = v^T \Sigma v \).
We hope that clusters are preserved and "directions" of highest variance might be aligned with cluster means S.

Note:

Could also make clustering easier in terms of "separations between clusters"?

Example Continues

Mathematical Intuition:
Suppose: \( c_1 \equiv \sigma^2 I_d = c_2 \equiv \sigma^2 I_d, \mu_1 = - \mu_2 \) each comes in proportions \( \frac{1}{2} \).

\[ \begin{equation*} \begin{split} \Sigma = \mathbb{E}XX^T & = \frac{1}{2} \mathbb{E}_1XX^T + \frac{1}{2} \mathbb{E}_2XX^T \\ & = \frac{1}{2} ((\mathbb{E}_1XX^T - \mu_1\mu_1^T ) + \mu_1\mu_1^T + \\ & (\mathbb{E}_2XX^T - \mu_2\mu_2^T ) + \mu_2\mu_2^T) \\ & = \frac{1}{2} (\Sigma_1 + \mu_1\mu_1^T + \Sigma_2 + \mu_2\mu_2^T) \\ & = \frac{1}{2} (\sigma^2 I_d + \mu_1\mu_1^T + \sigma^2 I_d + \mu_1\mu_1^T) \\ & = \sigma^2 I_d + \mu_1\mu_1^T \end{split} \end{equation*} \] where \(X \overset{iid}{\sim} \{ X_i\}_1^n\)

Highest Variance Direction:

\( \underset{\|v\|=1}{\max} v^T \Sigma v = \underset{\|v\|=1}{\max} \sigma^2v^Tv + v^T\mu_1\mu_1^Tv \) \( \Rightarrow v\) in the direction of \( (\mu_2 - \mu_1) \).

[ Week 1 Ends HERE ]

[ Week 2 Starts HERE ]

Note: Embedding \( X_i \)'s \(\rightarrow \tilde{X_i}\).
Here by PCA.

Spectral Culstering

Requires # of \(K\) clusters.

Data: \(\{X_i\}_1^n \sim \mathbb{R}^n \)
Def: neighborhood or "proximity" graph.

K-NN Graph: \(X_i \) - \(X_j\) if \(X_j\) is in the K-nearest neighbors of \(X_i\).
E-NN Graph: \(X_i \) - \(X_j\) if \( \|X_i - X_j \| \le \epsilon \)
Weighted Graph: Put weight \(w_{ij}\) between \(X_i\) & \(X_j\),
\(w_{ij} \searrow\) \(_0\) for "far" \(X_i, X_j\)...
e.g., \(w_{ij} \propto exp(-\| X_i - X_j \|^2 / (2\sigma^2)) \)

Neighborhood size is critical

e.g., \(\epsilon \sim \) percentile of interpoint distances

pick \(\epsilon \approx 0 \) (Wrong)
pick \(\epsilon \rightarrow \infty \) (Wrong)
pick "somewhere" in between

Important Graphical Objects:

Adjacency Matrix: \( A \in \mathbb{R}^{n \times n} \), \( A_{ij} = \mathbb{I}\{ X_i - X_j \} \) (if \(X_i\) and \(X_j\) are connected.)
Degree Matrix: \( D \in \mathbb{R}^{n \times n} \), \(D_{ii} =\)"degree" of node \(i\) (# of edges), \(D_{ii} = \sum_iA_{ij}\)
Graph Laplacian: \( \mathcal{L} = D - A = \sum_i \lambda_i v_i v_i^T \)

Spectral Cluster Algorithm:

Proposition:

\(\mathcal{L}\) is Positive (Semi) Definite.

The eigenvalue O has multiplicity \(K\)
(\( \iff \) Null space \( Null(\mathcal{L}) \) has dimension \(K\) )
every \(v \in Null(\mathcal{L}) \) is constant on each CC \(\xi_l, l = 1, \dotsm, K\).
if \(X_i, X_j\) connected (in fact are in the same \(\xi_l\)), then \(v_{(i)} = v_{(j)} \).

Corollory:

If \(X_i\), \(X_j\) belong to same \(\xi_l\), then \( \tilde{X_i} = \tilde{X_j} \).
If \(X_i\), \(X_j\) belong to different \(\xi\)'s (CC's), \( \tilde{X_i} \ne \tilde{X_j} \).
(There are \(K\) distinct rows in \(V\); row rank \(=\) column rank \(=\) \(K\))

Proof of Proposition:

\[ \begin{align*} v^T \mathcal{L} v & \equiv v^T D v - v^T A v\\ & = \sum_{ij} A_{ij} v_{i}^2 - \sum_{ij} A_{ij} v_{i} v_{j} \end{align*} \]

Notice that \( \sum_{ij} A_{ij} v_{i}^2 = \sum_{ij} A_{ij} v_{j}^2 \).

Also \(A_{ij} \ge 0\), then

\[ \begin{align*} 2 v^T \mathcal{L} v & = \sum_{ij} A_{ij} (v_{i}^2 + v_{j}^2) - 2 \sum_{ij} A_{ij} v_{i} v_{j}\\ & = \sum_{ij} A_{ij} (v_i - v_j)^2 \\ & \ge 0 \end{align*} \]

\( v^T \mathcal{L} v = 0 \) iff \(v_i = v_j\) (\(X_i\) & \(X_j\) are connected) whenever \(A_{ij} \ne 0\).
\(\Rightarrow v \) is constant on each Connected Component.

Finally, verify \( v_l \equiv \mathbb{I}\{\text{on } \xi_l\} \) are orthogonal, and that every \(v \in Null(\mathcal{L}) = \sum \alpha_l v_l \).

\(\blacksquare\)

Non-Ideal Case:

[Graph to be inserted.]

We build \(G' \approx \text{ideal } G\).
\( \mathcal{L}' \approx \mathcal{L} \Rightarrow\) Bottom \(K\) eigenvectors of \(\mathcal{L}'\) would be "approximately" constant on clusters.
"Matrix Perturbation Theory"

Density Based Clustering

Intuition:

View "clusters" as high-density regions.

A Usual Procedure:

\(\{X_i\}_{i=1}^n\)

Estimate a density \(f\) as some \(\hat{f}\). (\(f\) as the density of some distribution \(P\) on \(\mathbb{R}^d\))
Estimate "cluster cores" as regions of high density.
Assign every \(X_i\) to a cluster core. (usually the closest)

[Graph to be inserted.]

2-Major Families:

DB-Scan: "Cluster-core" isan upper-level set of \(\hat{f}\).
So pick a \(\lambda\), \( \{X_i: \hat{f}(X_i) \ge \lambda \} \)
"connected components" of \( \{X_i: \hat{f}(X_i) \ge \lambda \} \)
& assign every \(X_i\)'s \(\rightarrow \) "closest" connected component of \( \{X_i: \hat{f}(X_i) \ge \lambda \} \).
One-way: Neighborhood graphs.
Mean-Shift: "Cluster-core" are "mode" of \(\hat{f}\).
"Local maxima" of \(\hat{f}\).
Assign every \(X_i\) to the mode it reaches by "gradient ascent."

[ Graph to be inserted. ]

[ Week 2 Ends HERE ]

Remarks

about DB-Scan
- Choice of \(\lambda\) is somewhat arbitrary
- Does not care about smoothness, but the "height" of each "hill"
- # of cluster cores depends on choice of \(\lambda\) 🙃

about Mean-Shift
- "Distribution" NOT that smooth
- Expensive
- Mean-Shift can also be unstable due to fluctuations in \(\hat{f}\)
[ Graph to be inserted. ]

Density Estimation

(Intuition:)
For \(x \in \mathbb{R}\), suppose \(x\) has CDF, \(F(x)\). Then, for almost all \(x\), \(f(x) = F'(x)\).

\[ (f(x) \approx \frac{F(x+h) - F(x-h)}{2h} \text{ for small }h) \] \[ \begin{align*} f(x) & = \frac{P(x \in [x-h, x+h])}{2h} \\ & \approx \frac{ \text{\# of points in [x-h, x+h]} }{ \text{length of [x-h, x+h]} } \end{align*} \]

For \(x \in \mathbb{R}^d\),

\[ f(x) \approx \frac{P(x \in B(x,h))}{\sigma(B(x,h))} \] where \(B(x,h) = \{ x': \mid x' - x \mid \le h \} \) and \(\sigma(B(x,h)) = h^d \cdot \sigma_d \) is the volume of \(B(x,h)\).

A simple density estimate over data \( \{ X_i \}_1^n \)

\[ \begin{align*} \hat{f}(x) & = \frac{1}{n} \cdot \frac{\text{\# of }X_i \text{'s in }B(x,h)}{\sigma(B(x,h))} \\ & = \frac{1}{n \cdot h^d} \frac{1}{\sigma_d} \sum_{i=1}^{n} \mathbb{I}\{ x_i \in B(x,h) \} \\ & = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} \frac{1}{\sigma_d} \mathbb{I}\{ \frac{\| x - x_i \|}{h} \le 1 \} \end{align*} \]

To summarize,

\[ \hat{f}(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}), \]

where \(K(u) = \frac{1}{\sigma_d} \mathbb{I} \{ \| u \| \le 1 \} \).

\(\hat{f}\) is called "kernel density estimation" where \(K\) is a "kernel" and usually satisfies it that \(K\) is a density itself. Namely, \(K \ge 0\) and \( \int K(u) du = 1 \).

Examples of Kernels

Box Kernel
\( K(u) = \frac{1}{\sigma_d} \mathbb{I} \{ \| u \| \le 1 \} \) where \( \| u \| \le 1 \) is a "unit ball."

Gaussian Kernel
\(K(u) \propto \exp( - \frac{\|u\|^2}{2} ) \)

Triangle Kernel
\( K(u) \propto (1 - \|u\|)_{+} \)

Question: Is \(\hat{f}\) a density?

\( \hat{f}(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}) \) for some \( \int K(u) du = 1, K \ge 0 \)

\( \frac{x-x_i}{h} = u \Rightarrow x = h u + x_i \Rightarrow dx = h^d du \)
Thus, \( \int \hat{f}(x) dx = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} \int K(\frac{x-x_i}{h}) dx = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} h^d \int K(u) du = \frac{1}{n} \sum_{i=1}^{n} 1 = 1 \)
Hence, \(\hat{f}\) is a density as long as \(\int K(u) du = 1, K \ge 0\)

Choice of \(h\)

\(h\) is the band width of kernel \(\hat{f}_h(x)\).
\(h\) plays a similar role as "neighborhood" size in the "neighborhood" graphs.

If \(h\) too large, it would be over-smoothing.
If \(h\) too small, it would be under-smoothing.

[ Graph to be inserted. ]

Note:

Expand \( \|\hat{f} - f\|^2 = \int (\hat{f}(x) - f(x))^2 dx \equiv \|\hat{f}_h\|^2 + \|f\|^2 - 2 \int \hat{f}_h(x) f(x) dx \)
We want to minimize over \(h\). But notice the step \(\|\hat{f}_h\|^2\) is an expensive step.

Mean-Shift

\( K(u) \propto \exp(- \frac{\|u\|^2}{2}) \) (differentiable)
Recall:

Start an ascent from each \(X_i\).

[ Graph to be inserted. ]

gradient ascent: start at \(t=0\) from some \(x_0\).
\( x_{t+1} = x_t + \eta_t \nabla \hat{f}(x_t) \) , where \(\eta_t\) is positive "step size."

\[ \hat{f}_h(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}), \] where \( K(\frac{x-x_i}{h}) \propto C_h \cdot \exp(- \frac{\|x-x_i\|^2}{2h^2}) \)

Then, \[ \begin{align*} \nabla \hat{f}_h(x) & = C_{h,n} \sum_{i=1}^{n} \nabla K(\frac{x-x_i}{h})\\ & = C_{h,n}' \sum_{i=1}^{n} (x_i - x) \exp( - \frac{\|x-x_i\|^2}{2h^2} ) \end{align*} \]

Ascent: \[ \begin{align*} x_{t+1} & = x_t + \eta_t \nabla \hat{f}(x_t)\\ x_{t+1} & = x_t + \eta_t C_{h,n}' \sum_{i=1}^n (x_i - x_t) K(\frac{x_t - x_i}{h})\\ & = x_t - x^t ( \eta_t \cdot C_{h,n}' \sum K(\frac{x_t - x_i}{h}) ) \\ & + \eta_t \cdot C_{h,n}' \sum x_i K(\frac{x_t - x_i}{h}) \end{align*} \] We set \( \eta_t \cdot C_{h,n}' \sum K(\frac{x_t - x_i}{h}) \) to 1 \( \Rightarrow \eta_t \propto \frac{1}{C_{h,n}' \sum K(\frac{x_t - x_i}{h})} \).

We then have: \[ x_{t+1} = \sum_i x_i w_h (\|x_i - x_t\|) \Rightarrow \sum w_h = 1 \] where \(\|x_i - x_t\|\) is distance from \(x_i\) to \(x_t\).
We find \( x_{t+1} = \sum_i x_i w_h (\|x_i - x_t\|) \) is weighted average, that is why it is called "mean-shift."

[ Graph to be inserted. ]

[ Part 1 ends here. ]