Wangsheng's World
Optimization in Machine Learning
Last Update: October 3, 2024
Overview
Classification or regression
  • Recall: \( \{ (x_i, y_i) \}_1^n \) (often assumed \(iid\) from some distribution) satisfy \(y_i \approx f^*(x_i)\), for \(f\) unknown.
  • Risk: \( R(f) = \mathbb{E}_{x,l} l(f(x), y) \) for some loss \(l\) (e.g., \( \|f(x) - y\|^2 \))
  • Emperical Risk: \( \hat{R}(f) = \frac{1}{n} \sum_{i=1}^n l (f(x), y) \)
  • Posit some function class: Assume \(f^* \in \mathcal{F} \) (e.g., linear function, polynomial, nerual network, ...)
  • Goal: \( \hat{f} = \argmin_{f \in \mathcal{F}} \hat{R}(f) \). (or even more sophisticated: minimize \( \hat{R}(f) + pen(f) \))
Optimization
Goal:
For \(x \in D\), minimize \(F(x)\).
where \(D \equiv \text{ domain of }x\), usually \(\mathbb{R}^d\) or a subset thereof.
Back to ML: \(\theta \in \Theta\), minimize \(\hat{R}(\theta)\).
where \(\hat{R}(\theta) \equiv \hat{R}(f_\theta) \) for a parametrization of \(f\)'s in \(\mathcal{F}\).
Example
e.g., \(\mathcal{F} \equiv \text{ (all linear functions) } = \{ f(x) = \theta^\top x, \theta \in \mathbb{R}^d \} \)
Then, we want to minimize \(\hat{R}(\theta) = \frac{1}{n} \sum_{i=1}^n \|y_i-\theta^\top x_i\|^2 \)
Optimization Procedure - Main Idea
  • Start at some \(x_0\) at \(t=0\).
  • At time \(t = 1, 2, \dotsm\),
    • Approximate \(F\) at \(x_t\) by some polynomial of degree \(l\), \(P_l(x)\)
    • Find a direction \(v\) of descent \(P_l\)
    \(x_{t+1} = x_t + \eta \cdot v\), where \(\eta\) is step size.
Example
Suppose: \(\tilde{x} = \argmin P_l(x)\) and \(v = \tilde{x} - x_t\).
\(\Rightarrow x_{t+1} = x_t + \eta(\tilde{x} - x_t)\).
  • If \(P_l = \) degree 1 polynomial \(\rightarrow\) Gradient Descent.
  • If \(P_l = \) degree 2 polynomial \(\rightarrow\) Newton's Method.
Newton's Method
Gradient Descent