ML basics - Decision Theory & Linear Regression
Deep Learning Modelβs Outcome is the Probability of the Variable X
Decision Theory
μλ‘μ΄ κ° \(\mathbf x\)κ° μ£Όμ΄μ‘μ λ νλ₯ λͺ¨λΈ \(p(\mathbf x,\mathbf t)\)μ κΈ°λ°ν΄ μ΅μ μ κ²°μ (ex. λΆλ₯)μ λ΄λ¦¬λ κ².
μΆλ‘ λ¨κ³: κ²°ν©νλ₯ λΆν¬ \(p(\mathbf x, C_{k})\)λ₯Ό ꡬνλ κ² (\(p(C_{k}\mid\mathbf x)\)). μ΄κ²λ§ μμΌλ©΄ λͺ¨λ κ²μ ν μ μμ
κ²°μ λ¨κ³: μν©μ λν νλ₯ μ΄ μ£Όμ΄μ‘μ λ μ΄λ»κ² μ΅μ μ κ²°μ μ λ΄λ¦΄ κ²μΈμ§? μΆλ‘ λ¨κ³λ₯Ό κ±°μ³€λ€λ©΄ κ²°μ λ¨κ³λ λ§€μ° μ½λ€.
μμ : X-Ray μ΄λ―Έμ§λ‘ μ νλ³
- \(\mathbf x\): X-Ray μ΄λ―Έμ§
- \(C_{1}\): μμΈ κ²½μ°
- \(C_{2}\): μμ΄ μλ κ²½μ°
- \(p(C_{k}\mid\mathbf x)\)μ κ°μ μκΈ° μν¨
μ§κ΄μ μΌλ‘ λ³Ό λ \(p(C_{k}\mid\mathbf x)\)λ₯Ό μ΅λν μν€λ kλ₯Ό ꡬνλ κ²μ΄ μ’μ κ²°μ
Binary Classification
Decision Region
\(\mathcal{R}_{i} = \{x:pred(x) = C_{i}\}\)
\(x\)κ° \(C_{i}\) ν΄λμ€λ‘ ν λΉμ νκ² λλ©΄(νΉμ λΆλ₯λ₯Ό νκ² λλ©΄) \(x\)λ \(\mathcal{R}_{i}\)μ μνκ² λλ€.
κ°κ°μ \(\mathcal{R}_{i}\)λ \(x\)μ μ§ν©μ΄λΌκ³ λ³Ό μ μλ€.
ν΄λμ€ \(i\)μ μνλ λͺ¨λ \(x\)μ μ§ν©.
Prob of Misclassification
\(\begin{align} p(mis) &= p(x \in \mathcal{R}_{1}, C_{2}) + p(x \in \mathcal{R}_{2}, C_{1}) \\\\ &= \int_{\mathcal{R}_{1}}\ p(x, C_{2})dx + \int_{\mathcal{R}_{2}}\ p(x, C_{1})dx \\\\ \end{align}\)
\(p(x \in \mathcal{R}_{1}, C_{2})\): class 1μΌλ‘ λΆλ₯λ₯Ό νμ§λ§ μ€μ λ‘λ class 2μΈ νλ₯
\(p(x \in \mathcal{R}_{2}, C_{1})\): class 2μΌλ‘ λΆλ₯λ₯Ό νμ§λ§ μ€μ λ‘λ class 1μΈ νλ₯
μ΄κ²μ μ λΆμ ννλ‘ λνλΈ κ²μ΄ \(\int_{\mathcal{R}_{1}}\ p(x, C_{2})dx + \int_{\mathcal{R}_{2}}\ p(x, C_{1})dx\)
\(\int_{\mathcal{R}_{1}}\ p(x, C_{2})dx\)μ areaλ κ·Έλνμμ λΉ¨κ°μκ³Ό μ΄λ‘μμΌλ‘ μΉ ν΄μ§ λ©΄μ μ΄λ€.
\(\int_{\mathcal{R}_{2}}\ p(x, C_{1})dx\)μ areaλ κ·Έλνμμ 보λΌμμΌλ‘ μΉ ν΄μ§ λ©΄μ μ΄λ€.
κ²°κ΅ λΆλ₯μ€λ₯ νλ₯ μ μμ΄ μΉ ν΄μ§ λ©΄μ μ μ΄ν©μ΄λΌκ³ λ³Ό μ μλ€.
Minimize Misclassification
\(\hat x\)κ° μΌμͺ½μΌλ‘ μ΄λνλ€λ©΄,
μλ¬λ₯Ό λ§λ€μ΄λ΄λ λΆλΆμ μμ΄μ λ³νλ μμκ³Ό λ³νμ§ μλ μμμ΄ μλ€.
λΉ¨κ°μ μμμ \(\hat x\)μ΄ μΌμͺ½μΌλ‘ μ΄λν¨μ λ°λΌ μ€μ΄λ€μκ³ , λλ¨Έμ§ μμλ€μ λ³νμ§ μλλ€.
λΉ¨κ°μ μμμ μ΅μνμν€λ©΄ μ 체 μλ¬ μμμ΄ μ΅μνλ κ²μ΄λ€.
\(\hat x\)μ΄ \(x_{0}\)κ°μ κ°μ§λ μμμμλ λΉ¨κ°μ μμμ΄ μμ ν μ¬λΌμ§κ³ μ΅μνκ° λλ€.
μ€λ₯λ₯Ό μ΅μννκΈ°μν΄ \(p(x, C_{1}) > p(x, C_{2})\)μ΄ λλ©΄ \(x\)λ₯Ό \(\mathcal{R}_{1}\)μ ν λΉν΄μΌ νλ€.
\(p(x, C_{1}) < p(x, C_{2})\)μ΄ λλ©΄ C1μ ν λΉνλ κ²μ΄ μλλΌ C2μ ν λΉνκ² λλ©΄ μ€λ₯λ₯Ό μ΅μν ν μ μλ€.
\[\begin{align} p(x, C_{1}) > p(x, C_{2}) &\Leftrightarrow p(C_{1}\mid x)p(x) > p(C_{2}\mid x)p(x) \\\\ &\Leftrightarrow p(C_{1}\mid x) > p(C_{2}\mid x) \\\\ \end{align}\]Multiclass
multiclassμ κ²½μ° μ€λ₯λ³΄λ€ μ νμ±μ μ΄μ μ λ§μΆλ κ²μ΄ μ’λ€.
\[\begin{align} p(correct) &= \sum_{k=1}^{K}p(\mathbf x \in \mathcal{R_{k}}, \mathcal{C_{k}}) \\\\ &= \sum_{k=1}^{K}\int_{\mathcal{R_{k}}}p(\mathbf x, \mathcal{C_{k}})dx \\\\ \end{align}\] \[pred(x) = \arg\max_{k}p(C_{k}\mid x)\]Objective of Decision Theory (Classification)
κ²°ν©νλ₯ λΆν¬ \(p(\mathbf x, C_{k})\)κ° μ£Όμ΄μ‘μ λ μ΅μ μ κ²°μ μμλ€ \(\mathcal{R_{1}},...,\mathcal{R_{K}}\)λ₯Ό μ°Ύλ κ²
\(\hat C(\mathbf x)\)λ₯Ό \(\mathbf x\)κ° μ£Όμ΄μ‘μ λ μμΈ‘κ° \((1,...,K μ€ νλμ κ°)\)μ λλ €μ£Όλ ν¨μλΌκ³ νμ.
κ²°ν©νλ₯ λΆν¬ \(p(\mathbf x, C_{k})\)κ° μ£Όμ΄μ‘μ λ μ΅μ μ ν¨μ \(\hat C(\mathbf x)\)λ₯Ό μ°Ύλ κ².
βμ΅μ μ ν¨μβλ μ΄λ€ κΈ°μ€μΌλ‘?
Minimizing the Expected Loss
μμμ μ€λ₯λ₯Ό μ΅μν νλ€κ³ νμ§λ§ μ‘°κΈ λ νμ₯νλ€λ©΄ κΈ°λκ°μΌλ‘ κ° μ μμ κ²μ΄λ€.
λͺ¨λ κ²°μ μ΄ λμΌν 리μ€ν¬λ₯Ό κ°λ κ²μ΄ μλ
- μμ΄ μλλ° μμΌλ‘ μ§λ¨
- μμΈλ° μμ΄ μλ κ²μΌλ‘ μ§λ¨ (risky)
μμ€νλ ¬ (Loss Matrix)
- \(L_{kj}\): \(C_{k}\)μ μνλ \(\mathbf x\)λ₯Ό \(C_{j}\)λ‘ λΆλ₯ν λ λ°μνλ μμ€
νμ μ€μ ν΄λμ€, μ΄μ λΆλ₯ν μμΈ‘κ°μ΄λ€.
λ°μ΄ν°μ λν λͺ¨λ μ 보λ νλ₯ λΆν¬λ‘ ννλκ³ μμμ κΈ°μ΅ν΄μΌ νλ€. μ°λ¦¬κ° κ΄μ°°ν μ μλ μνμ νλ₯ λΆν¬λ₯Ό ν΅ν΄μ μμ±λ κ²μ΄λΌκ³ κ°μ£Όνλ€.
λ°λΌμ μμ€νλ ¬ \(L\)μ΄ μ£Όμ΄μ‘μ λ, λ€μκ³Ό κ°μ κΈ°λμμ€μ μ΅μν νλ κ²μ λͺ©νλ‘ ν μ μλ€.
\[\mathbb E[L] = \sum_{k}\sum_{j}\int_{\mathcal{R_{i}}}\ L_{kj}p(\mathbf x,C_{k})d\mathbf x\]κΈ°λμμ€ μ΅μν
\(\mathbb E[L] = \sum_{k}\sum_{j}\int_{\mathcal{R_{i}}}\ L_{kj}p(\mathbf x,C_{k})d\mathbf x\)
\(\hat C(\mathbf x)\)λ₯Ό \(\mathbf x\)κ° μ£Όμ΄μ‘μ λ μμΈ‘κ° \((1,...,K μ€ νλμ κ°)\)μ λλ €μ£Όλ ν¨μ
\[\mathbf x \in \mathcal{R_{i}} \Leftrightarrow \hat C(\mathbf x) = j\]λ°λΌμ μμ \(\mathbb E[L]\) μμ μλμ κ°μ΄ ννν μ μλ€.
- μμ κΈ°λμμ€μμμλ \(L_{kj}\) λμ μ \(\hat C(\mathbf x)\)λ‘ λ°κΎΈκ³
- κ³±μ λ²μΉμ μ΄μ©ν΄, κ²°ν©νλ₯ (\(p(\mathbf x, C_{k})\))μ 쑰건λΆνλ₯ (\(p(C_{k}\mid \mathbf x)\))κ³Ό marginal prob(\(p(\mathbf x)\))λ‘ λ°κΏμ£Όμλ€.
μ΄λ κ² ννλ \(\mathbb E[L]\)λ \(\hat C(\mathbf x)\)μ λ²ν¨μμ΄κ³ μ΄ λ²ν¨μλ₯Ό μ΅μνμν€λ ν¨μ \(\hat C(\mathbf x)\)λ₯Ό μ°ΎμΌλ©΄ λλ€.
μλ§μ λ§μ
μ μ΅μν μν¨λ€κ³ μκ°ν΄λ³΄λ©΄,
\(p(\mathbf x)\) > 0 μ΄κΈ° λλ¬Έμ
κ°κ°μ xμ λν΄μ \(\sum_{k=1}^{K}L_{k\hat C(\mathbf x)}p(C_{k}\mid \mathbf x)\) μ΄ λΆλΆλ§ μ΅μν μν€κ² λλ©΄
μ 체μ ν©μ΄ μ΅μν λ κ²μ΄λ€.
\[\hat C(\mathbf x) = \arg\min_{j}\sum_{k=1}^{K}\ L_{kj}p(C_{k}\mid\mathbf x)\]λ²ν¨μ: νν ν¨μλ₯Ό μμν λ μ«μκ° μ λ ₯λμμ λ μ«μλ₯Ό μΆλ ₯μν€λ κ°μμ μμλΌ μκ°νλ€. μ΄μ λΉμ·νκ² λ²ν¨μλ₯Ό μ΄λ€ μμλ‘ μμνλ€λ©΄, μ«μ λμ ν¨μκ° μ λ ₯λκ³ κ·Έμ λν κ²°κ³Όλ‘ μ«μκ° μΆλ ₯λλ μμλΌκ³ ν μ μλ€.
- μ«μ -> ν¨μ f -> μ«μ
- ν¨μ -> λ²ν¨μ f -> μ«μ
- ν¨μμ ν¨μλΌκ³ μκ°ν μ μλ€.
- μ¦, \(\mathbb E[L]\)μ \(\hat C(\mathbf x)\)μ λ°λΌ κ°μ΄ λ³νκΈ° λλ¬Έ.
: κ°λ₯ν \(j\)λ₯Ό λͺ¨λ μλλ₯Ό νμ λ \(\sum_{k=1}^{K}\ L_{kj}p(C_{k}\mid\mathbf x)\)μ΄ μ΅μκ° λλ j
λ§μ½μ μμ€νλ ¬μ΄ 0-1 lossμΈ κ²½μ° (μ£Όλκ°μ μμλ€μ 0 λλ¨Έμ§λ 1)
μ΄ κ²½μ°μ \(L_{kj}=1\), \(L_{jj}=0\) μ΄λΌλ μ¬μ€μ νμ©ν΄μ,
\[\hat C(\mathbf x) = \arg\min_{j}\sum_{k=1}^{K}\ L_{kj}p(C_{k}\mid\mathbf x)\] \[= \sum_{k=1}^{K}\ p(C_{k}\mid\mathbf x) - p(C_{j}\mid \mathbf x)\] \[= 1 - p(C_{j}\mid \mathbf x)\]λ°λΌμ \(1 - p(C_{j}\mid \mathbf x)\) μ΄ κ°μ μ΅μννλ κ²μ κ²°κ΅ \(p(C_{j}\mid \mathbf x)\) μ΄ κ°μ μ΅λν νλ κ²μ΄λ€.
κ²°λ‘ :
\[\begin{align} \hat C(\mathbf x) &= \arg\min_{j} 1 - p(C_{j}\mid\mathbf x) \\\\ &= \arg\max_{j} p(C_{j}\mid\mathbf x) \end{align}\]- μ μμμ \(= \sum_{k=1}^{K}\ p(C_{k}\mid\mathbf x) - p(C_{j}\mid \mathbf x)\) μ΄ λΆλΆμ΄ μ μ΄ν΄κ° μλλ€. λκ°μ΄ 0μΈκ±΄ μκ² μΌλ μ΄λ»κ² μ΄λ κ² λ³νλλμ§ λͺ¨λ₯΄κ² λ€.
μμ : μλ£μ§λ¨
\(C_{k} \in \{1,2\} \Leftrightarrow \{sick, healthy\}\)
\[L = \begin{bmatrix}0 & 100\\1 & 0\end{bmatrix}\]L[1,1] = 0: sick & diagnosed as sick
L[1,2] = 100: sick & diagnosed as healthy
μ΄ κ²½μ° κΈ°λμμ€(expected loss):
\[\begin{align} \mathbb E[L] &= \int_{\mathcal{R_{2}}}L_{1,2}p(\mathbf x, C_{1})d\mathbf x + \int_{\mathcal{R_{1}}}L_{2,1}p(\mathbf x, C_{2})d\mathbf x \\\\ &= \int_{\mathcal{R_{2}}}100\times p(\mathbf x, C_{1})d\mathbf x + \int_{\mathcal{R_{1}}}p(\mathbf x, C_{2})d\mathbf x \\\\ \end{align}\]ν = Groundtruth
μ΄ = Prediction(Diagnosis)
\(\int_{\mathcal{R_{2}}}L_{1,2}p(\mathbf x, C_{1})d\mathbf x\): predicted as healthy(\(\mathcal{R_{2}}\)), but groundtruth is sick(\(C_{1}\))
since \(L_{1,2}\) = 100, \(\ \int_{\mathcal{R_{2}}}L_{1,2}p(\mathbf x, C_{1})d\mathbf x = \int_{\mathcal{R_{2}}}100\times\ p(\mathbf x, C_{1})d\mathbf x\).
from this j = 1:
\(\hat C(\mathbf x) = \arg\min_{j}\sum_{k=1}^{K}\ L_{kj}p(C_{k}\mid\mathbf x)\)
j = 2:
\[\begin{align} \sum_{k=1}^{K}\ L_{k,2}p(C_{k}\mid\mathbf x) & = L_{12}p(C_{1}\mid \mathbf x) + L_{22}p(C_{2}\mid \mathbf x)\\ &= 100\times\ p(C_{1}\mid \mathbf x) \\\\ \end{align}\]thus:
\(p(C_{2}\mid \mathbf x), 100\times\ p(C_{1}\mid \mathbf x)\)
건κ°νλ€(\(C_{2}\))κ³ νλ¨νκΈ° μν 쑰건μ:
\(p(C_{2}\mid \mathbf x)> 100\times\ p(C_{1}\mid \mathbf x)\)
sickμ νλ₯ λ³΄λ€ 100ν¬κ² λμμΌ νλ€.
μμ νκ² μ§λ¨μ νκΈ° μν΄μ(μ€μ§λ¨μ 리μ€ν¬λ₯Ό μ€μ΄κΈ° μν΄), μμ€νλ ¬μ λͺ¨λΈμμ ν¬ν¨μμΌμ κ²°μ μ λ΄λ¦¬λ κ²μ΄ μ’μ κ²μ΄λ€.
Regression
λͺ©νκ° \(t \in \mathcal{R}\)
μμ€ν¨μ: \(L(t,y(\mathbf x)) = \{y(\mathbf x)-t\}^{2}\)
μμ€κ°μ κΈ°λκ°μΈ \(E[L]\)λ₯Ό μ΅μνμν€λ ν¨μ \(y(\mathbf x)\)λ₯Ό ꡬνλ κ²μ΄ λͺ©ν.
\[\begin{align} F[y] = E[L] &= \int_{\mathcal{R}}\int_{\mathcal{X}}\{ y(\mathbf x) - t \}^{2}p(\mathbf x, t)d\mathbf x dt \\ &= \int_{\mathcal{X}}\left( \int_{\mathcal{R}}\{ y(\mathbf x) - t \}^{2}p(\mathbf x, t)dt \right)d\mathbf x \\ &= \int_{\mathcal{X}}\left( \int_{\mathcal{R}}\{ y(\mathbf x) - t \}^{2}p(t\mid \mathbf x)dt \right)p(\mathbf x)d\mathbf x \\ \end{align}\]κ²°λ‘ :
\(\mathbf x\)λ₯Ό μν μ΅μ μ μμΈ‘κ°μ \(y(\mathbf x) = \mathbb E_{t}[t\mid x]\)μμ λ³΄μΌ κ²μ΄λ€.
\(\mathbb E_{t}[t\mid x]\): xκ° μ£Όμ΄μ‘μ λ tμ κΈ°λκ°.
μ κ·Έλ¦Όμμ μ°λ¦¬κ° μκ³ μλ κ²μ \(x_{0}\)κ° μ£Όμ΄μ‘μ λ tμ 쑰건λΆνλ₯ \(p(t\mid x_{0})\) μ΄κ³ , μ΄κ²μ κΈ°λκ°μ \(y(x_{0})\)μ΄λ€.
Methods for Decision Problems
Classification
νλ₯ λͺ¨λΈμ μμ‘΄νλ κ²½μ°
- μμ±λͺ¨λΈ(generative model): λ¨Όμ κ° ν΄λμ€ \(C_{k}\)μ λν΄ λΆν¬ \(p(\mathbf x\mid C_{k})\)μ μ¬μ νλ₯ \(p(C_{k})\)λ₯Ό ꡬν λ€μ λ² μ΄μ¦ μ 리λ₯Ό μ¬μ©ν΄μ μ¬ννλ₯ \(p(C_{k}\mid \mathbf x)\)λ₯Ό ꡬνλ€.
\(p(\mathbf x)\)λ λ€μκ³Ό κ°μ΄ ꡬν μ μλ€.
\[p(\mathbf x) = \sum_{k}p(\mathbf x\mid C_{k})p(C_{k})\]μ¬ννλ₯ μ΄ μ£Όμ΄μ‘κΈ° λλ¬Έμ λΆλ₯λ₯Ό μν κ²°μ μ μ½κ² μ΄λ£¨μ΄μ§ μ μλ€. κ²°ν©λΆν¬μμ λ°μ΄ν°λ₯Ό μνλ§ν΄μ βμμ±βν μ μμΌλ―λ‘ μ΄λ° λ°©μμ μμ±λͺ¨λΈμ΄λΌκ³ λΆλ₯Έλ€.
- μλ³λͺ¨λΈ(discriminative model): λͺ¨λ λΆν¬λ₯Ό λ€ κ³μ°νμ§ μκ³ μ€μ§ μ¬ννλ₯ \(p(C_{k}\mid \mathbf x)\)λ₯Ό ꡬνλ€. μμ λμΌνκ² κ²°μ μ΄λ‘ μ μ μ©ν μ μλ€.
νλ³ν¨μμ μμ‘΄νλ κ²½μ°
νλ₯ λͺ¨λΈμ μμ‘΄νμ§ μλ λͺ¨λΈ
- νλ³ν¨μ(discriminant function): μ λ ₯ \(\mathbf x\)μ ν΄λμ€λ‘ ν λΉνλ νλ³ν¨μ(discriminant function)μ μ°Ύλλ€. νλ₯ κ°μ κ³μ°νμ§ μλλ€.
Regression
- κ²°ν©λΆν¬\(p(\mathbf x, t)\)λ₯Ό ꡬνλ μΆλ‘ (inference)λ¬Έμ λ₯Ό λ¨Όμ νΌ λ€μ 쑰건λΆνλ₯ λΆν¬ \(p(t\mid \mathbf x)\)λ₯Ό ꡬνλ€. κ·Έλ¦¬κ³ μ£Όλ³ν(marginalize)λ₯Ό ν΅ν΄ \(\mathbb E_{t}[t\mid x]\)λ₯Ό ꡬνλ€.
- 쑰건λΆνλ₯ λΆν¬ \(p(t\mid \mathbf x)\)λ₯Ό ꡬνλ μΆλ‘ λ¬Έμ λ₯Ό νΌ λ€μ μ£Όλ³ν(marginalize)λ₯Ό ν΅ν΄ \(\mathbb E_{t}[t\mid x]\)λ₯Ό ꡬνλ€.
- \(y(\mathbf x)\)λ₯Ό μ§μ μ μΌλ‘ ꡬνλ€.
Optional
Euler-Lagrange Equation
μμ€ν¨μμ λΆν΄
Appendix
MathJax
\(\mathbb E\):
$$\mathbb E$$
\(\mathcal{R}\):
$$\mathcal{R}$$
\(\arg\min_{j}\):
$$\arg\min_{j}$$
matrix with bracket: \(L = \begin{bmatrix}a & b\\c & d\end{bmatrix}\)
$$L = \begin{bmatrix}a & b\\c & d\end{bmatrix}$$
matrix with curly braces:
\(\begin{Bmatrix}aaa & b\cr c & ddd \end{Bmatrix}\)
$$\begin{Bmatrix}aaa & b\cr c & ddd \end{Bmatrix}$$
κ°λ³ κ΄νΈ with escape curly brackets
\(\left\{-\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} \right\}\):
$$\left\{-\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} \right\}$$
References
Drawing Graph with PPT: https://www.youtube.com/watch?v=MQEBu9NnCuI
Decision Theory: http://norman3.github.io/prml/docs/chapter01/5.html
Pattern Recognition and Machine Learning: https://tensorflowkorea.files.wordpress.com/2018/11/bishop-pattern-recognition-and-machine-learning-2006.pdf
Leave a comment