Clustering Overview
Set up: \( \{ X_i \}_1^n \), some data \(\in \mathbb{R}^d\)
Goal: grouping "similar" data points.
- Distance based: K-means, EM for Gaussian Mixture Models
- Spectral based: in-between approach
- Density based: DBSCAN, MeanShift, etc.
K-means: group similar points together.
\(\{X_i\}_1^n\), points \(\in \mathbb{R}^d\)
Number of clusters: K
Stemming perhaps from application.
Example:
-
document clustering
- \(X \in \mathbb{R}^d \equiv document\)
- \(X_{(i)}\)'s could be word frequencies
- Image/Speech Segmentation
- Recommendation System (e.g., grouping Netflix users by their watching history)
K-means Objectives:
\(\{X_i\}_1^n \in \mathbb{R}^d\)
A partition \(\{\xi_l\}_1^K\) of the \(\{X_i\}_1^n\), together with
a set of "centers" \(\{\mu_l\}_{l=1}^K\) of each "cluster" so that
the total distance between data points and their assigned cluster
"center"(centroid) is minimized.
\[
\begin{equation*}
\begin{split}
\mathcal{R}( \{\xi_l\}, \{\mu_l\} ) & \equiv \sum_{l=1}^K \sum_{X_i \in \xi_l} \|X_i - \mu_l\|^2 \\
& \propto \sum_{l=1}^n \frac{1}{n} \sum_{X_i \in \xi_l} \|X_i - \mu_l\|^2 \\
& = \sum_{l=1}^n \frac{n_l}{n} \frac{1}{n_l} \sum \| X_i - \mu_l\|^2
\end{split}
\end{equation*}
\]
where \( \mathcal{R}( \{\xi_l\}, \{\mu_l\} ) \) is K-means objective,
\( n_l \equiv \) # of points in \( \xi_l \),
and \(\frac{1}{n_l} \sum \| X_i - \mu_l\|^2\) is the average distance to center.
Lloyd's Algorithm
- Start with some choice of \( \{ \mu_l \} \)
- At each time step:
- \( \{\xi_l\} \leftarrow \) Assign all \( X_i\)'s to closest \(\mu_l\)
- \( \{\mu_l\} \leftarrow \) Average of \( X_i\)'s in \(\xi_l\)
- Repeat till "converge"
(approximate optimization)

For the graph on the right, the points to start may influence the result.
"K-means ++" consider start from different ways.
EM (Expectation Maximization) for Gaussian Mixture Models
Idea: model clusters as a GMM
\[ f(x) = \sum_{l=1}^K \Pi_l \mathcal{N}( \mu_l, \Sigma_{l; x} ) \]
where \( \sum\nolimits_{l; x} \) is covariance matrix and \( \sum \Pi_l = 1\)
\[ \mathcal{N}(\mu_l, \Sigma_{l; x}) \propto exp(- \frac{1}{2} (x - \mu_l)^T \Sigma_l^{-1} (x - \mu_l)) \]
\[ \Sigma_l = \mathbb{E}(x - \mu_l)(x - \mu_l)^T \]
\[ X \sim \mathcal{N}(\mu_l, \Sigma_l) \]
Lloyd's Algorithm for K-means can be viewed as an approximate EM for a GMM
where similar to
\[ \mathcal{N}(\mu_l, \Sigma_l) \propto \mathcal{N}(\mu_l, \sigma^{2} I_d) \]
EM for GMM & K-means are implicitly assuming "ellipsoidal" (spherical for K-means) of clusters.
Questions:
-
Which of K-means or EM is more general?
EM is more general.
-
Why NOT always use EM?
EM is time-expensive, involves very "expensive" step: inversing of Matrix \( \Sigma_l^{-1} \)
Idea: Reduce Dimensions?
-
PCA or SVD down to dimension \( \mathcal{O}(K) \), \( K \ll d\).
Example of PCA:
Projecting the data to the top s "principal components"/eigenvectors of
\(\hat{\Sigma} \equiv\) learned covariance matrix.
\[ \hat{\Sigma} = \frac{1}{n} \sum_i(X_i - \hat{\mu})(X_i - \bar{\mu})^T \]
\[ \hat{\mu} = \frac{1}{n} \sum_i X_i \]

Suppose PCA to \(s = 1 (K = 2) \).
Recall the "variance" in direction \( v \in \mathbb{R}^d \),
\( \| v \| = 1 \) is given by \( Var(X^Tv) = v^T \Sigma v \).
We hope that clusters are preserved and "directions" of highest variance might
be aligned with cluster means S.
Note:
Could also make clustering easier in terms of "separations between clusters"?
Example Continues

Mathematical Intuition:
Suppose: \( c_1 \equiv \sigma^2 I_d = c_2 \equiv \sigma^2 I_d, \mu_1 = - \mu_2 \)
each comes in proportions \( \frac{1}{2} \).
\[
\begin{equation*}
\begin{split}
\Sigma = \mathbb{E}XX^T & = \frac{1}{2} \mathbb{E}_1XX^T + \frac{1}{2} \mathbb{E}_2XX^T \\
& = \frac{1}{2} ((\mathbb{E}_1XX^T - \mu_1\mu_1^T ) + \mu_1\mu_1^T + \\
& (\mathbb{E}_2XX^T - \mu_2\mu_2^T ) + \mu_2\mu_2^T) \\
& = \frac{1}{2} (\Sigma_1 + \mu_1\mu_1^T + \Sigma_2 + \mu_2\mu_2^T) \\
& = \frac{1}{2} (\sigma^2 I_d + \mu_1\mu_1^T + \sigma^2 I_d + \mu_1\mu_1^T) \\
& = \sigma^2 I_d + \mu_1\mu_1^T
\end{split}
\end{equation*}
\]
where \(X \overset{iid}{\sim} \{ X_i\}_1^n\)
Highest Variance Direction:
\( \underset{\|v\|=1}{\max} v^T \Sigma v = \underset{\|v\|=1}{\max} \sigma^2v^Tv + v^T\mu_1\mu_1^Tv \)
\( \Rightarrow v\) in the direction of \( (\mu_2 - \mu_1) \).
[ Week 1 Ends HERE ]
[ Week 2 Starts HERE ]

Note: Embedding \( X_i \)'s \(\rightarrow \tilde{X_i}\).
Here by PCA.
Spectral Culstering

Requires # of \(K\) clusters.
Data: \(\{X_i\}_1^n \sim \mathbb{R}^n \)
Def: neighborhood or "proximity" graph.
- K-NN Graph: \(X_i \) - \(X_j\) if \(X_j\) is in the K-nearest neighbors of \(X_i\).
- E-NN Graph: \(X_i \) - \(X_j\) if \( \|X_i - X_j \| \le \epsilon \)
-
Weighted Graph: Put weight \(w_{ij}\) between \(X_i\) & \(X_j\),
\(w_{ij} \searrow\) \(_0\) for "far" \(X_i, X_j\)...
e.g., \(w_{ij} \propto exp(-\| X_i - X_j \|^2 / (2\sigma^2)) \)
Neighborhood size is critical

e.g., \(\epsilon \sim \) percentile of interpoint distances
- pick \(\epsilon \approx 0 \) (Wrong)
- pick \(\epsilon \rightarrow \infty \) (Wrong)
- pick "somewhere" in between
Important Graphical Objects:
-
Adjacency Matrix: \( A \in \mathbb{R}^{n \times n} \),
\( A_{ij} = \mathbb{I}\{ X_i - X_j \} \) (if \(X_i\) and \(X_j\) are connected.)
-
Degree Matrix: \( D \in \mathbb{R}^{n \times n} \),
\(D_{ii} =\)"degree" of node \(i\) (# of edges), \(D_{ii} = \sum_iA_{ij}\)
- Graph Laplacian: \( \mathcal{L} = D - A = \sum_i \lambda_i v_i v_i^T \)
Spectral Cluster Algorithm:
Proposition:
\(\mathcal{L}\) is Positive (Semi) Definite.
-
The eigenvalue O has multiplicity \(K\)
(\( \iff \) Null space \( Null(\mathcal{L}) \) has dimension \(K\) )
-
every \(v \in Null(\mathcal{L}) \) is constant on each CC \(\xi_l, l = 1, \dotsm, K\).
if \(X_i, X_j\) connected (in fact are in the same \(\xi_l\)), then \(v_{(i)} = v_{(j)} \).
Corollory:
- If \(X_i\), \(X_j\) belong to same \(\xi_l\), then \( \tilde{X_i} = \tilde{X_j} \).
-
If \(X_i\), \(X_j\) belong to different \(\xi\)'s (CC's), \( \tilde{X_i} \ne \tilde{X_j} \).
(There are \(K\) distinct rows in \(V\); row rank \(=\) column rank \(=\) \(K\))
Proof of Proposition:
\[
\begin{align*}
v^T \mathcal{L} v & \equiv v^T D v - v^T A v\\
& = \sum_{ij} A_{ij} v_{i}^2 - \sum_{ij} A_{ij} v_{i} v_{j}
\end{align*}
\]
Notice that \( \sum_{ij} A_{ij} v_{i}^2 = \sum_{ij} A_{ij} v_{j}^2 \).
Also \(A_{ij} \ge 0\), then
\[
\begin{align*}
2 v^T \mathcal{L} v & = \sum_{ij} A_{ij} (v_{i}^2 + v_{j}^2) - 2 \sum_{ij} A_{ij} v_{i} v_{j}\\
& = \sum_{ij} A_{ij} (v_i - v_j)^2 \\
& \ge 0
\end{align*}
\]
\( v^T \mathcal{L} v = 0 \) iff \(v_i = v_j\) (\(X_i\) & \(X_j\) are connected)
whenever \(A_{ij} \ne 0\).
\(\Rightarrow v \) is constant on each Connected Component.
Finally, verify \( v_l \equiv \mathbb{I}\{\text{on } \xi_l\} \) are orthogonal,
and that every \(v \in Null(\mathcal{L}) = \sum \alpha_l v_l \).
\(\blacksquare\)
Non-Ideal Case:
[Graph to be inserted.]
We build \(G' \approx \text{ideal } G\).
\( \mathcal{L}' \approx \mathcal{L} \Rightarrow\) Bottom \(K\) eigenvectors of \(\mathcal{L}'\)
would be "approximately" constant on clusters.
"Matrix Perturbation Theory"
Density Based Clustering
Intuition:
View "clusters" as high-density regions.
A Usual Procedure:
\(\{X_i\}_{i=1}^n\)
-
Estimate a density \(f\) as some \(\hat{f}\).
(\(f\) as the density of some distribution \(P\) on \(\mathbb{R}^d\))
- Estimate "cluster cores" as regions of high density.
-
Assign every \(X_i\) to a cluster core.
(usually the closest)
[Graph to be inserted.]
2-Major Families:
-
DB-Scan: "Cluster-core" isan upper-level set of \(\hat{f}\).
So pick a \(\lambda\), \( \{X_i: \hat{f}(X_i) \ge \lambda \} \)
"connected components" of \( \{X_i: \hat{f}(X_i) \ge \lambda \} \)
& assign every \(X_i\)'s \(\rightarrow \) "closest" connected component
of \( \{X_i: \hat{f}(X_i) \ge \lambda \} \).
One-way: Neighborhood graphs.
-
Mean-Shift: "Cluster-core" are "mode" of \(\hat{f}\).
"Local maxima" of \(\hat{f}\).
Assign every \(X_i\) to the mode it reaches by "gradient ascent."
[ Graph to be inserted. ]
[ Week 2 Ends HERE ]
Remarks
-
about DB-Scan
- Choice of \(\lambda\) is somewhat arbitrary
- Does not care about smoothness, but the "height" of each "hill"
- # of cluster cores depends on choice of \(\lambda\) 🙃
-
about Mean-Shift
- "Distribution" NOT that smooth
- Expensive
- Mean-Shift can also be unstable due to fluctuations in \(\hat{f}\)
[ Graph to be inserted. ]
Density Estimation
(Intuition:)
For \(x \in \mathbb{R}\), suppose \(x\) has CDF, \(F(x)\).
Then, for almost all \(x\), \(f(x) = F'(x)\).
\[
(f(x) \approx \frac{F(x+h) - F(x-h)}{2h} \text{ for small }h)
\]
\[
\begin{align*}
f(x) & = \frac{P(x \in [x-h, x+h])}{2h} \\
& \approx \frac{ \text{\# of points in [x-h, x+h]} }{ \text{length of [x-h, x+h]} }
\end{align*}
\]
For \(x \in \mathbb{R}^d\),
\[
f(x) \approx \frac{P(x \in B(x,h))}{\sigma(B(x,h))}
\]
where \(B(x,h) = \{ x': \mid x' - x \mid \le h \} \)
and \(\sigma(B(x,h)) = h^d \cdot \sigma_d \) is the volume of \(B(x,h)\).
A simple density estimate over data \( \{ X_i \}_1^n \)
\[
\begin{align*}
\hat{f}(x) & = \frac{1}{n} \cdot \frac{\text{\# of }X_i \text{'s in }B(x,h)}{\sigma(B(x,h))} \\
& = \frac{1}{n \cdot h^d} \frac{1}{\sigma_d} \sum_{i=1}^{n} \mathbb{I}\{ x_i \in B(x,h) \} \\
& = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} \frac{1}{\sigma_d} \mathbb{I}\{ \frac{\| x - x_i \|}{h} \le 1 \}
\end{align*}
\]
To summarize,
\[
\hat{f}(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}),
\]
where \(K(u) = \frac{1}{\sigma_d} \mathbb{I} \{ \| u \| \le 1 \} \).
\(\hat{f}\) is called "kernel density estimation" where \(K\) is a "kernel" and
usually satisfies it that \(K\) is a density itself. Namely, \(K \ge 0\) and \( \int K(u) du = 1 \).
Examples of Kernels
-
Box Kernel
\( K(u) = \frac{1}{\sigma_d} \mathbb{I} \{ \| u \| \le 1 \} \) where \( \| u \| \le 1 \) is a "unit ball."
-
Gaussian Kernel
\(K(u) \propto \exp( - \frac{\|u\|^2}{2} ) \)
-
Triangle Kernel
\( K(u) \propto (1 - \|u\|)_{+} \)
Question: Is \(\hat{f}\) a density?
\(
\hat{f}(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h})
\)
for some
\(
\int K(u) du = 1, K \ge 0
\)
-
\(
\frac{x-x_i}{h} = u \Rightarrow x = h u + x_i \Rightarrow dx = h^d du
\)
-
Thus,
\(
\int \hat{f}(x) dx
= \frac{1}{n \cdot h^d} \sum_{i=1}^{n} \int K(\frac{x-x_i}{h}) dx
= \frac{1}{n \cdot h^d} \sum_{i=1}^{n} h^d \int K(u) du
= \frac{1}{n} \sum_{i=1}^{n} 1 = 1
\)
-
Hence, \(\hat{f}\) is a density as long as \(\int K(u) du = 1, K \ge 0\)
Choice of \(h\)
\(h\) is the band width of kernel \(\hat{f}_h(x)\).
\(h\) plays a similar role as "neighborhood" size in the "neighborhood" graphs.
- If \(h\) too large, it would be over-smoothing.
- If \(h\) too small, it would be under-smoothing.
[ Graph to be inserted. ]
Note:
Expand \( \|\hat{f} - f\|^2 = \int (\hat{f}(x) - f(x))^2 dx \equiv \|\hat{f}_h\|^2 + \|f\|^2 - 2 \int \hat{f}_h(x) f(x) dx \)
We want to minimize over \(h\). But notice the step \(\|\hat{f}_h\|^2\) is an expensive step.
Mean-Shift
\( K(u) \propto \exp(- \frac{\|u\|^2}{2}) \)
(differentiable)
Recall:
-
Start an ascent from each \(X_i\).
[ Graph to be inserted. ]
-
gradient ascent: start at \(t=0\) from some \(x_0\).
\(
x_{t+1} = x_t + \eta_t \nabla \hat{f}(x_t)
\)
, where \(\eta_t\) is positive "step size."
\[
\hat{f}_h(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}),
\]
where \( K(\frac{x-x_i}{h}) \propto C_h \cdot \exp(- \frac{\|x-x_i\|^2}{2h^2}) \)
Then,
\[
\begin{align*}
\nabla \hat{f}_h(x) & = C_{h,n} \sum_{i=1}^{n} \nabla K(\frac{x-x_i}{h})\\
& = C_{h,n}' \sum_{i=1}^{n} (x_i - x) \exp( - \frac{\|x-x_i\|^2}{2h^2} )
\end{align*}
\]
Ascent:
\[
\begin{align*}
x_{t+1} & = x_t + \eta_t \nabla \hat{f}(x_t)\\
x_{t+1} & = x_t + \eta_t C_{h,n}' \sum_{i=1}^n (x_i - x_t) K(\frac{x_t - x_i}{h})\\
& = x_t - x^t ( \eta_t \cdot C_{h,n}' \sum K(\frac{x_t - x_i}{h}) ) \\
& + \eta_t \cdot C_{h,n}' \sum x_i K(\frac{x_t - x_i}{h})
\end{align*}
\]
We set \( \eta_t \cdot C_{h,n}' \sum K(\frac{x_t - x_i}{h}) \) to 1 \( \Rightarrow \eta_t \propto \frac{1}{C_{h,n}' \sum K(\frac{x_t - x_i}{h})} \).
We then have:
\[
x_{t+1} = \sum_i x_i w_h (\|x_i - x_t\|) \Rightarrow \sum w_h = 1
\]
where \(\|x_i - x_t\|\) is distance from \(x_i\) to \(x_t\).
We find \( x_{t+1} = \sum_i x_i w_h (\|x_i - x_t\|) \) is weighted average,
that is why it is called "mean-shift."
[ Graph to be inserted. ]