NTU Machine Learning - Lec5 · Stay hungry. Stay foolish

30 Dec 2017

Machine Learning / NTU

上一講提到了在 hypothesis set $\mathcal{H}$ 為有限，且sample size $|\mathcal{D}|$ 足夠大的情況下，是 PAC-learnable 的。但當 $|\mathcal{H}| \, \rightarrow \, \infty$的時候呢？ (e.g 2D linear perceptron)

Recap of Feasibility of Learning

$E_o(g) \approx E_i(g)$ ?
$E_i(g) \rightarrow 0$ ?

Hoeffding Ineq. 對於再複雜的 ground truth function $f$ 及 hypothesis function $h$ 均是成立的，1. 據此必是正確，但這僅能 verify 我們得到的結果，還是沒有告訴我們在複雜的 $f$ (usually 不能只依靠一個 hypothesis candidate $h$，而是必須從 size 無限大的 hypothesis set $\mathcal{H}$ 去挑選)下，學習是否可行？

Some observation: Does the Union Bound too loose?

From 前一講最後，

\[ \Pr_D[\text{BAD} \, \text{for} \, \textcolor{orange}{\mathcal{H}}] \leq 2 \, M \, \exp(-2 \epsilon^2 N) \]

其中，$M$ 是來自 Union Bound 將各個 $\underset{D}{\Pr}\,[\text{BAD} \, \text{for} \, \textcolor{orange}{h_n}]$ 做加總，等號成立於事件彼此互斥。

事件彼此互斥？也就是說會讓 $h_n$ 壞掉的 $\mathcal{D}$ 各自不同，但想想這其實不合理。就拿 perceptron 來說好了，若 $h_1 \approx h_2$ ，幾何上來說，這兩個 hyperplane 其實是非常靠近的，對大部分的數據點，均會給出一樣的 perdiction ，所以會讓 $h_1$ 壞掉的 $\mathcal{D}$ ，也 probably 會讓 $h_2$ 壞掉。

Effective number of Hypotheses

首先，我們先將 $h$ 們做適當的分群，從數據點的角度來看，會 predict 出相同結果的 hypothesis set 視為相同的一群 (詳細範例可以看講義👀)，對 binary classification 來說，每一群又稱為一個二元分類器 (dichotomy)。
給定數據點 $\mathbf{x}_1 \sim \mathbf{x}_N$，我們可以定義不同的 dichotomy ， $\mathcal{H}(\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_N)$ 為這些 dichotomies 的集合(跟原先的 $\mathcal{H}$ 有何不同呢？可以想成若 $h_i, h_j$ 在 training data 上結果相同，則只會對應到當中的一個 dichotomy)。而我們想以這個集合的大小 (也就是分群數)，去取代原先的 $M$ 。

Growth Function $m_H$

可以發現，隨著 sample 的 $\mathcal{D}$ 不同，分群的結果也會不同， in order to generalization ，我們定義 growth function $m_H$ 為當中的 upper bound ，也是我們之後要拿來取代 $M$ 的量。

$$ m_H(N) = \underset{\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_N \in \mathcal{X}}{\max} |\mathcal{H}(\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_N)| $$

其中 $m_H(N) \leq 2^N$ (which is easy to see)

$m_H$ is POLY?

回到最上方的式子，隨著 sample size $N$ 的增長，其壓低壞事發生機率的能力是 EXP ，所以我們只要能證明 $m_H$ 為 POLY ，便可以 claim 說，在 $N$ 足夠大的情況下， PAC-learnable (even $\mathcal{H}$ is infinite)!

可以在講義看到幾個簡單的例子(👀 1D perceptron, 1D interval, 2D contour)，並求出其 growth function 的 analytical form，但很多時候要寫下這個 form 並不那麼簡單，然而，其實我們關心的只是它的 order，或者說他是否為 POLY 罷了。為了方便討論，以下做幾個定義。

Shatter

\[ \text{If} \: \mathcal{H} \: \text{can \textbf{shatter} N points} \Leftrightarrow m_H(N) = 2^N \]

解釋是說，所擁有的 dichotomies ，可以應付training data中所有可能的分群情況。
($\exists \, h $ s.t $h(\mathbf{x}_i) = f(\mathbf{x}_i) \; \forall i=1 \sim N$)，直覺來想，這代表了這個 Hypothesis set 很強，但同時也表示了需要搜索的數量很多，呈現在 growth function 為 EXP 。

※注意到 $m_H(N)$ 是指最大的 dichotomies set size，所以 $m_H(N)$只能說明存在至少一組 data 可被 shatter ，但不是所有的 data 都可以被 shatter。
那些退化的情況反而是不能被 shatter 的特例，像是 2 維平面上，點為 colinear 的話， 3 個點 linear perceptron 便無法處理了，又或者像是 convex set 中，點如果沒有落在圓上，也是無法被 shatter 的

Break Point

然而，很多時候我們對資料有些洞見，知道他滿足某種性質，不一定要找那種可以應付所有情況的 Hypothesis set，拿 Linear perceptron 所處理的對象為例，我們假設資料是線性可分(i.e 對線性不可分的資料無法 shatter)，那麼便不需要用到 convex set 這麼強的 Hypothesis set 了 (所謂殺雞焉用牛刀)

\[ \text{If no data set of size } \, k \, \text{can be \textbf{shattered} by} \, \mathcal{H}, \; k \, \text{is the \textbf{break point} for } \, \mathcal{H} \]

簡單來說就是， $\exists \; k \in \mathbb{N} \, s.t. \, m_H(k) < 2^{k}$ ,which is POLY
(當 $N \geq k$ 時，開始出現 $\mathcal{H}$ 無法處理，或者說不考慮的情況)

※承 shatter 的討論，這裡必須要確保 對所有 size 為 $k$ 的 data，都無法被 shatter (不存在只有特例才有的那些特性 )，才能說 $k$ 為 hypothesis set 的 break point 。
所以如果今天只知道某一組 data ，無法被 $\mathcal{H}$ 給 shatter ，是無法判定 break point 為多少的 (有可能是 $k \geq N$中，被造出的特例；亦有可能如同我們直覺想的，因為 $k < N$，所以任一組當然找不到)。

Reference

林軒田老師的講義

View times

NTU Machine Learning - Lec5

Training vs Testing