Wangsheng Wu

Notes

STAT 5242 - Part 2

Overview
Optimization
Newton's Method
Gradient Descent

Optimization in Machine Learning

Last Update: October 3, 2024

Overview

Classification or regression

Recall: \( \{ (x_i, y_i) \}_1^n \) (often assumed \(iid\) from some distribution) satisfy \(y_i \approx f^*(x_i)\), for \(f\) unknown.
Risk: \( R(f) = \mathbb{E}_{x,l} l(f(x), y) \) for some loss \(l\) (e.g., \( \|f(x) - y\|^2 \))
Emperical Risk: \( \hat{R}(f) = \frac{1}{n} \sum_{i=1}^n l (f(x), y) \)
Posit some function class: Assume \(f^* \in \mathcal{F} \) (e.g., linear function, polynomial, nerual network, ...)
Goal: \( \hat{f} = \argmin_{f \in \mathcal{F}} \hat{R}(f) \). (or even more sophisticated: minimize \( \hat{R}(f) + pen(f) \))

Optimization

Goal:

For \(x \in D\), minimize \(F(x)\).
where \(D \equiv \text{ domain of }x\), usually \(\mathbb{R}^d\) or a subset thereof.

Back to ML: \(\theta \in \Theta\), minimize \(\hat{R}(\theta)\).
where \(\hat{R}(\theta) \equiv \hat{R}(f_\theta) \) for a parametrization of \(f\)'s in \(\mathcal{F}\).

Example

e.g., \(\mathcal{F} \equiv \text{ (all linear functions) } = \{ f(x) = \theta^\top x, \theta \in \mathbb{R}^d \} \)
Then, we want to minimize \(\hat{R}(\theta) = \frac{1}{n} \sum_{i=1}^n \|y_i-\theta^\top x_i\|^2 \)

Optimization Procedure - Main Idea

Start at some \(x_0\) at \(t=0\).
At time \(t = 1, 2, \dotsm\),
- Approximate \(F\) at \(x_t\) by some polynomial of degree \(l\), \(P_l(x)\)
- Find a direction \(v\) of descent \(P_l\)
\(x_{t+1} = x_t + \eta \cdot v\), where \(\eta\) is step size.

Example

Suppose: \(\tilde{x} = \argmin P_l(x)\) and \(v = \tilde{x} - x_t\).
\(\Rightarrow x_{t+1} = x_t + \eta(\tilde{x} - x_t)\).

If \(P_l = \) degree 1 polynomial \(\rightarrow\) Gradient Descent.
If \(P_l = \) degree 2 polynomial \(\rightarrow\) Newton's Method.

Newton's Method

Gradient Descent