Wangsheng's World
Hyperparameter Choices in Unsupervised Learning
Last Update: October 3, 2024
Clustering Overview
Set up: \( \{ X_i \}_1^n \), some data \(\in \mathbb{R}^d\)
Goal: grouping "similar" data points.
  • Distance based: K-means, EM for Gaussian Mixture Models
  • Spectral based: in-between approach
  • Density based: DBSCAN, MeanShift, etc.
K-means: group similar points together.
\(\{X_i\}_1^n\), points \(\in \mathbb{R}^d\)
Number of clusters: K
Stemming perhaps from application.
Example:
  • document clustering
    • \(X \in \mathbb{R}^d \equiv document\)
    • \(X_{(i)}\)'s could be word frequencies
  • Image/Speech Segmentation
  • Recommendation System (e.g., grouping Netflix users by their watching history)
K-means Objectives:
\(\{X_i\}_1^n \in \mathbb{R}^d\)
A partition \(\{\xi_l\}_1^K\) of the \(\{X_i\}_1^n\), together with a set of "centers" \(\{\mu_l\}_{l=1}^K\) of each "cluster" so that the total distance between data points and their assigned cluster "center"(centroid) is minimized.
\[ \begin{equation*} \begin{split} \mathcal{R}( \{\xi_l\}, \{\mu_l\} ) & \equiv \sum_{l=1}^K \sum_{X_i \in \xi_l} \|X_i - \mu_l\|^2 \\ & \propto \sum_{l=1}^n \frac{1}{n} \sum_{X_i \in \xi_l} \|X_i - \mu_l\|^2 \\ & = \sum_{l=1}^n \frac{n_l}{n} \frac{1}{n_l} \sum \| X_i - \mu_l\|^2 \end{split} \end{equation*} \]
where \( \mathcal{R}( \{\xi_l\}, \{\mu_l\} ) \) is K-means objective, \( n_l \equiv \) # of points in \( \xi_l \), and \(\frac{1}{n_l} \sum \| X_i - \mu_l\|^2\) is the average distance to center.
Lloyd's Algorithm
  • Start with some choice of \( \{ \mu_l \} \)
  • At each time step:
    • \( \{\xi_l\} \leftarrow \) Assign all \( X_i\)'s to closest \(\mu_l\)
    • \( \{\mu_l\} \leftarrow \) Average of \( X_i\)'s in \(\xi_l\)
  • Repeat till "converge"
(approximate optimization)
For the graph on the right, the points to start may influence the result. "K-means ++" consider start from different ways.
EM (Expectation Maximization) for Gaussian Mixture Models
Idea: model clusters as a GMM \[ f(x) = \sum_{l=1}^K \Pi_l \mathcal{N}( \mu_l, \Sigma_{l; x} ) \] where \( \sum\nolimits_{l; x} \) is covariance matrix and \( \sum \Pi_l = 1\)
\[ \mathcal{N}(\mu_l, \Sigma_{l; x}) \propto exp(- \frac{1}{2} (x - \mu_l)^T \Sigma_l^{-1} (x - \mu_l)) \] \[ \Sigma_l = \mathbb{E}(x - \mu_l)(x - \mu_l)^T \] \[ X \sim \mathcal{N}(\mu_l, \Sigma_l) \]
Lloyd's Algorithm for K-means can be viewed as an approximate EM for a GMM where similar to \[ \mathcal{N}(\mu_l, \Sigma_l) \propto \mathcal{N}(\mu_l, \sigma^{2} I_d) \] EM for GMM & K-means are implicitly assuming "ellipsoidal" (spherical for K-means) of clusters.
Questions:
  • Which of K-means or EM is more general?
    EM is more general.
  • Why NOT always use EM?
    EM is time-expensive, involves very "expensive" step: inversing of Matrix \( \Sigma_l^{-1} \)
Idea: Reduce Dimensions?
  • PCA or SVD down to dimension \( \mathcal{O}(K) \), \( K \ll d\).
Example of PCA:
Projecting the data to the top s "principal components"/eigenvectors of \(\hat{\Sigma} \equiv\) learned covariance matrix. \[ \hat{\Sigma} = \frac{1}{n} \sum_i(X_i - \hat{\mu})(X_i - \bar{\mu})^T \] \[ \hat{\mu} = \frac{1}{n} \sum_i X_i \] Suppose PCA to \(s = 1 (K = 2) \). Recall the "variance" in direction \( v \in \mathbb{R}^d \), \( \| v \| = 1 \) is given by \( Var(X^Tv) = v^T \Sigma v \).
We hope that clusters are preserved and "directions" of highest variance might be aligned with cluster means S.
Note:
Could also make clustering easier in terms of "separations between clusters"?
Example Continues
Mathematical Intuition:
Suppose: \( c_1 \equiv \sigma^2 I_d = c_2 \equiv \sigma^2 I_d, \mu_1 = - \mu_2 \) each comes in proportions \( \frac{1}{2} \).
\[ \begin{equation*} \begin{split} \Sigma = \mathbb{E}XX^T & = \frac{1}{2} \mathbb{E}_1XX^T + \frac{1}{2} \mathbb{E}_2XX^T \\ & = \frac{1}{2} ((\mathbb{E}_1XX^T - \mu_1\mu_1^T ) + \mu_1\mu_1^T + \\ & (\mathbb{E}_2XX^T - \mu_2\mu_2^T ) + \mu_2\mu_2^T) \\ & = \frac{1}{2} (\Sigma_1 + \mu_1\mu_1^T + \Sigma_2 + \mu_2\mu_2^T) \\ & = \frac{1}{2} (\sigma^2 I_d + \mu_1\mu_1^T + \sigma^2 I_d + \mu_1\mu_1^T) \\ & = \sigma^2 I_d + \mu_1\mu_1^T \end{split} \end{equation*} \] where \(X \overset{iid}{\sim} \{ X_i\}_1^n\)
Highest Variance Direction:
\( \underset{\|v\|=1}{\max} v^T \Sigma v = \underset{\|v\|=1}{\max} \sigma^2v^Tv + v^T\mu_1\mu_1^Tv \) \( \Rightarrow v\) in the direction of \( (\mu_2 - \mu_1) \).
[ Week 1 Ends HERE ]
[ Week 2 Starts HERE ]
Note: Embedding \( X_i \)'s \(\rightarrow \tilde{X_i}\).
Here by PCA.
Spectral Culstering
Requires # of \(K\) clusters.
Data: \(\{X_i\}_1^n \sim \mathbb{R}^n \)
Def: neighborhood or "proximity" graph.
  • K-NN Graph: \(X_i \) - \(X_j\) if \(X_j\) is in the K-nearest neighbors of \(X_i\).
  • E-NN Graph: \(X_i \) - \(X_j\) if \( \|X_i - X_j \| \le \epsilon \)
  • Weighted Graph: Put weight \(w_{ij}\) between \(X_i\) & \(X_j\),
    \(w_{ij} \searrow\) \(_0\) for "far" \(X_i, X_j\)...
    e.g., \(w_{ij} \propto exp(-\| X_i - X_j \|^2 / (2\sigma^2)) \)
Neighborhood size is critical
e.g., \(\epsilon \sim \) percentile of interpoint distances
  • pick \(\epsilon \approx 0 \) (Wrong)
  • pick \(\epsilon \rightarrow \infty \) (Wrong)
  • pick "somewhere" in between
Important Graphical Objects:
  • Adjacency Matrix: \( A \in \mathbb{R}^{n \times n} \), \( A_{ij} = \mathbb{I}\{ X_i - X_j \} \) (if \(X_i\) and \(X_j\) are connected.)
  • Degree Matrix: \( D \in \mathbb{R}^{n \times n} \), \(D_{ii} =\)"degree" of node \(i\) (# of edges), \(D_{ii} = \sum_iA_{ij}\)
  • Graph Laplacian: \( \mathcal{L} = D - A = \sum_i \lambda_i v_i v_i^T \)
Spectral Cluster Algorithm:
Proposition:
\(\mathcal{L}\) is Positive (Semi) Definite.
  • The eigenvalue O has multiplicity \(K\)
    (\( \iff \) Null space \( Null(\mathcal{L}) \) has dimension \(K\) )
  • every \(v \in Null(\mathcal{L}) \) is constant on each CC \(\xi_l, l = 1, \dotsm, K\).
    if \(X_i, X_j\) connected (in fact are in the same \(\xi_l\)), then \(v_{(i)} = v_{(j)} \).
Corollory:
  • If \(X_i\), \(X_j\) belong to same \(\xi_l\), then \( \tilde{X_i} = \tilde{X_j} \).
  • If \(X_i\), \(X_j\) belong to different \(\xi\)'s (CC's), \( \tilde{X_i} \ne \tilde{X_j} \).
    (There are \(K\) distinct rows in \(V\); row rank \(=\) column rank \(=\) \(K\))
Proof of Proposition:
\[ \begin{align*} v^T \mathcal{L} v & \equiv v^T D v - v^T A v\\ & = \sum_{ij} A_{ij} v_{i}^2 - \sum_{ij} A_{ij} v_{i} v_{j} \end{align*} \]
Notice that \( \sum_{ij} A_{ij} v_{i}^2 = \sum_{ij} A_{ij} v_{j}^2 \).
Also \(A_{ij} \ge 0\), then
\[ \begin{align*} 2 v^T \mathcal{L} v & = \sum_{ij} A_{ij} (v_{i}^2 + v_{j}^2) - 2 \sum_{ij} A_{ij} v_{i} v_{j}\\ & = \sum_{ij} A_{ij} (v_i - v_j)^2 \\ & \ge 0 \end{align*} \]
\( v^T \mathcal{L} v = 0 \) iff \(v_i = v_j\) (\(X_i\) & \(X_j\) are connected) whenever \(A_{ij} \ne 0\).
\(\Rightarrow v \) is constant on each Connected Component.
Finally, verify \( v_l \equiv \mathbb{I}\{\text{on } \xi_l\} \) are orthogonal, and that every \(v \in Null(\mathcal{L}) = \sum \alpha_l v_l \).
\(\blacksquare\)
Non-Ideal Case:
[Graph to be inserted.]
We build \(G' \approx \text{ideal } G\).
\( \mathcal{L}' \approx \mathcal{L} \Rightarrow\) Bottom \(K\) eigenvectors of \(\mathcal{L}'\) would be "approximately" constant on clusters.
"Matrix Perturbation Theory"
Density Based Clustering
Intuition:
View "clusters" as high-density regions.
A Usual Procedure:
\(\{X_i\}_{i=1}^n\)
  • Estimate a density \(f\) as some \(\hat{f}\). (\(f\) as the density of some distribution \(P\) on \(\mathbb{R}^d\))
  • Estimate "cluster cores" as regions of high density.
  • Assign every \(X_i\) to a cluster core. (usually the closest)
[Graph to be inserted.]
2-Major Families:
  • DB-Scan: "Cluster-core" isan upper-level set of \(\hat{f}\).
    So pick a \(\lambda\), \( \{X_i: \hat{f}(X_i) \ge \lambda \} \)
    "connected components" of \( \{X_i: \hat{f}(X_i) \ge \lambda \} \)
    & assign every \(X_i\)'s \(\rightarrow \) "closest" connected component of \( \{X_i: \hat{f}(X_i) \ge \lambda \} \).
    One-way: Neighborhood graphs.
  • Mean-Shift: "Cluster-core" are "mode" of \(\hat{f}\).
    "Local maxima" of \(\hat{f}\).
    Assign every \(X_i\) to the mode it reaches by "gradient ascent."
[ Graph to be inserted. ]
[ Week 2 Ends HERE ]
Remarks
  • about DB-Scan
    • Choice of \(\lambda\) is somewhat arbitrary
    • Does not care about smoothness, but the "height" of each "hill"
    • # of cluster cores depends on choice of \(\lambda\) 🙃
  • about Mean-Shift
    • "Distribution" NOT that smooth
    • Expensive
    • Mean-Shift can also be unstable due to fluctuations in \(\hat{f}\)
    [ Graph to be inserted. ]
Density Estimation
(Intuition:)
For \(x \in \mathbb{R}\), suppose \(x\) has CDF, \(F(x)\). Then, for almost all \(x\), \(f(x) = F'(x)\).
\[ (f(x) \approx \frac{F(x+h) - F(x-h)}{2h} \text{ for small }h) \] \[ \begin{align*} f(x) & = \frac{P(x \in [x-h, x+h])}{2h} \\ & \approx \frac{ \text{\# of points in [x-h, x+h]} }{ \text{length of [x-h, x+h]} } \end{align*} \]
For \(x \in \mathbb{R}^d\),
\[ f(x) \approx \frac{P(x \in B(x,h))}{\sigma(B(x,h))} \] where \(B(x,h) = \{ x': \mid x' - x \mid \le h \} \) and \(\sigma(B(x,h)) = h^d \cdot \sigma_d \) is the volume of \(B(x,h)\).
A simple density estimate over data \( \{ X_i \}_1^n \)
\[ \begin{align*} \hat{f}(x) & = \frac{1}{n} \cdot \frac{\text{\# of }X_i \text{'s in }B(x,h)}{\sigma(B(x,h))} \\ & = \frac{1}{n \cdot h^d} \frac{1}{\sigma_d} \sum_{i=1}^{n} \mathbb{I}\{ x_i \in B(x,h) \} \\ & = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} \frac{1}{\sigma_d} \mathbb{I}\{ \frac{\| x - x_i \|}{h} \le 1 \} \end{align*} \]
To summarize,
\[ \hat{f}(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}), \]
where \(K(u) = \frac{1}{\sigma_d} \mathbb{I} \{ \| u \| \le 1 \} \).
\(\hat{f}\) is called "kernel density estimation" where \(K\) is a "kernel" and usually satisfies it that \(K\) is a density itself. Namely, \(K \ge 0\) and \( \int K(u) du = 1 \).
Examples of Kernels
  • Box Kernel
    \( K(u) = \frac{1}{\sigma_d} \mathbb{I} \{ \| u \| \le 1 \} \) where \( \| u \| \le 1 \) is a "unit ball."
  • Gaussian Kernel
    \(K(u) \propto \exp( - \frac{\|u\|^2}{2} ) \)
  • Triangle Kernel
    \( K(u) \propto (1 - \|u\|)_{+} \)
Question: Is \(\hat{f}\) a density?
\( \hat{f}(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}) \) for some \( \int K(u) du = 1, K \ge 0 \)
  • \( \frac{x-x_i}{h} = u \Rightarrow x = h u + x_i \Rightarrow dx = h^d du \)
  • Thus, \( \int \hat{f}(x) dx = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} \int K(\frac{x-x_i}{h}) dx = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} h^d \int K(u) du = \frac{1}{n} \sum_{i=1}^{n} 1 = 1 \)
  • Hence, \(\hat{f}\) is a density as long as \(\int K(u) du = 1, K \ge 0\)
Choice of \(h\)
\(h\) is the band width of kernel \(\hat{f}_h(x)\).
\(h\) plays a similar role as "neighborhood" size in the "neighborhood" graphs.
  • If \(h\) too large, it would be over-smoothing.
  • If \(h\) too small, it would be under-smoothing.
[ Graph to be inserted. ]
Note:
Expand \( \|\hat{f} - f\|^2 = \int (\hat{f}(x) - f(x))^2 dx \equiv \|\hat{f}_h\|^2 + \|f\|^2 - 2 \int \hat{f}_h(x) f(x) dx \)
We want to minimize over \(h\). But notice the step \(\|\hat{f}_h\|^2\) is an expensive step.
Mean-Shift
\( K(u) \propto \exp(- \frac{\|u\|^2}{2}) \) (differentiable)
Recall:
  • Start an ascent from each \(X_i\).
[ Graph to be inserted. ]
  • gradient ascent: start at \(t=0\) from some \(x_0\).
    \( x_{t+1} = x_t + \eta_t \nabla \hat{f}(x_t) \) , where \(\eta_t\) is positive "step size."
\[ \hat{f}_h(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K(\frac{x-x_i}{h}), \] where \( K(\frac{x-x_i}{h}) \propto C_h \cdot \exp(- \frac{\|x-x_i\|^2}{2h^2}) \)
Then, \[ \begin{align*} \nabla \hat{f}_h(x) & = C_{h,n} \sum_{i=1}^{n} \nabla K(\frac{x-x_i}{h})\\ & = C_{h,n}' \sum_{i=1}^{n} (x_i - x) \exp( - \frac{\|x-x_i\|^2}{2h^2} ) \end{align*} \]
Ascent: \[ \begin{align*} x_{t+1} & = x_t + \eta_t \nabla \hat{f}(x_t)\\ x_{t+1} & = x_t + \eta_t C_{h,n}' \sum_{i=1}^n (x_i - x_t) K(\frac{x_t - x_i}{h})\\ & = x_t - x^t ( \eta_t \cdot C_{h,n}' \sum K(\frac{x_t - x_i}{h}) ) \\ & + \eta_t \cdot C_{h,n}' \sum x_i K(\frac{x_t - x_i}{h}) \end{align*} \] We set \( \eta_t \cdot C_{h,n}' \sum K(\frac{x_t - x_i}{h}) \) to 1 \( \Rightarrow \eta_t \propto \frac{1}{C_{h,n}' \sum K(\frac{x_t - x_i}{h})} \).
We then have: \[ x_{t+1} = \sum_i x_i w_h (\|x_i - x_t\|) \Rightarrow \sum w_h = 1 \] where \(\|x_i - x_t\|\) is distance from \(x_i\) to \(x_t\).
We find \( x_{t+1} = \sum_i x_i w_h (\|x_i - x_t\|) \) is weighted average, that is why it is called "mean-shift."
[ Graph to be inserted. ]
[ Part 1 ends here. ]