NN basics - ML & Math

10 minute read

To do

ํŽธ๋ฏธ๋ถ„ (partial derivative)

Jacobian Matrix

https://angeloyeo.github.io/2020/07/24/Jacobian.html
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/jacobian/v/jacobian-prerequisite-knowledge

Hessian Matrix

2์ฐจ ํŽธ๋„ ํ•จ์ˆ˜

Linear Algebra

Vector & Matrix

Matrix Multiplication & Vector Tranformation(function or mapping)

\[A\boldsymbol{x} = \boldsymbol{b}\] \[\begin{bmatrix} 4 & -3 & 1 & 3 \\ 2 & 0 & 5 & 1 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ 8 \\ \end{bmatrix}\] \[\boldsymbol{x} \in \mathbb{R}^{4} \rightarrow \boldsymbol{b} \in \mathbb{R}^2\]

x๋ผ๋Š” vector์— A๋ผ๋Š” ํ–‰๋ ฌ์„ ๊ณฑํ–ˆ์„ ๋•Œ ์ƒˆ๋กœ์šด ๊ณต๊ฐ„์— b๋ผ๋Š” vector๋กœ ํˆฌ์˜ ๋จ. (x๋Š” 4์ฐจ์› ์‹ค์ˆ˜ ๊ณต๊ฐ„์—์„œ 2์ฐจ์› ์‹ค์ˆ˜ ๊ณต๊ฐ„์œผ๋กœ)

matrix-multiplication-transformation

Examples:
matrix-multiplication-transformation-2

ํ–‰๋ ฌ์˜ ๊ณฑ์…‰์€ ๊ทธ ๋Œ€์ƒ์ด ๋ฒกํ„ฐ๋“  ํ–‰๋ ฌ์ด๋“  ๊ณต๊ฐ„์˜ ์„ ํ˜•์  ๋ณ€ํ™˜.

๊ฒฐ๊ตญ ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด representation learning์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ.

Representaion Learning

๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ์™€ Represention Learning

๊ธฐ์กด๋Œ€๋กœ๋ผ๋ฉด ์„ ํ˜•์œผ๋กœ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์„ ํ˜• ๋ถ„๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ฒŒ๋” ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณ€ํ˜•๋๋‹ค๋Š” ์–˜๊ธฐ์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ์˜ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ์˜ representaion์ด (\(x_1,x_2\)) ์—์„œ (\(z_1,z_2\)) ๋กœ ๋ฐ”๋€ ๊ฒƒ.

representation-learning-vectors

์ด ๊ธ€์—์„œ๋Š” ์„ค๋ช…์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด ๋‹จ์ˆœ ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋ฅผ ์˜ˆ๋กœ ๋“ค์—ˆ์œผ๋‚˜, ๊นŠ๊ณ  ๋ฐฉ๋Œ€ํ•œ ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ๊ฐ€ ๊ฝค ๋ณต์žกํ•œ represention์ด์–ด๋„ ์ด๋ฅผ ์„ ํ˜• ๋ถ„๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•  ์ •๋„๋กœ ๋‹จ์ˆœํ™”ํ•˜๋Š” ๋ฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋ฅผ representation learner๋ผ๊ณ  ๋ถ€๋ฅด๋Š” ์‚ฌ๋žŒ๋“ค๋„ ์žˆ์Šต๋‹ˆ๋‹ค.


representation learning์ด๋ž€, ์–ด๋–ค task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ์ ์ ˆํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ์˜ representation์„ ๋ณ€ํ˜•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฆ‰ ์–ด๋–ค task๋ฅผ ๋” ์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ํ‘œํ˜„์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Raw data์— ๋งŽ์€ feature engineering๊ณผ์ •์„ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ๋”ฅ๋Ÿฌ๋‹ ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ์š”์†Œ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์ตœ์ ์˜ representation์„ ๊ฒฐ์ •ํ•ด์ฃผ๊ณ  ์ด ์ž ์žฌ๋œ representation์„ ์ฐพ๋Š” ๊ฒƒ์„ representation learning ๋˜๋Š” feature learning์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

Size of Vector & Matrix (Distance)

์œ ์‚ฌ๋„๋Š” ๋‚ด์ ์„ ํ†ตํ•ด ์–ด๋Š ์ •๋„ ๊ตฌํ• ์ˆ˜ ์žˆ์Œ.

๊ฑฐ๋ฆฌ(size)๋Š” Norm ์œผ๋กœ ๊ตฌํ•จ.

A norm is a function from a real or complex vector space to the nonnegative real numbers that behaves in certain ways like the distance from the origin.

Vector P Norm

Norm์€ ๋ฒกํ„ฐ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•

P Norm

\(\Vert \boldsymbol{x}\Vert = \left(\sum_{i=1,d} \vert x_{i}^{p}\vert \right)^{\frac{1}{p}}\)

Absolute-value Norm(1์ฐจ ๋†ˆ)

\(\vert \boldsymbol{x}_{1}\vert_{1}\)

Euclidean Norm(2์ฐจ ๋†ˆ)

\(\vert \boldsymbol{x}_{1}\vert_{2}\)

Max Norm

\(\Vert \boldsymbol{x}_{\infty}\Vert = max(\vert x_{1}\vert,\ldots,\vert x_{d}\vert)\)

์˜ˆ, x = (3, -4, 1)์ผ ๋•Œ, 2์ฐจ ๋†ˆ์€ \(\Vert \boldsymbol{x}_{1}\Vert_{2} = (3^2 +(-4)^2 + 1^2)^{1/2} \approx 5.099\)
์ด๊ฑด ์›์ ์œผ๋กœ ๋ถ€ํ„ฐ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธํ•จ.

๋ฒกํ„ฐ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋„ ๊ฐ€๋Šฅํ•จ.
\(\Vert \boldsymbol{z}_{1}\Vert_{2} = ((3-\boldsymbol{a})^2 +(-4-\boldsymbol{b})^2 + (1-\boldsymbol{c})^2)\)

type-of-norms

Matrix Frobenious Norm

ํ–‰๋ ฌ์˜ ํฌ๊ธฐ๋ฅผ ์ธก์ •

\[\Vert \boldsymbol{A}\Vert_{F} = \left( \sum_{i=1,n}\sum_{j=1,m}a_{ij}^2 \right)^{\frac{1}{2}}\]

์˜ˆ, \(\boldsymbol{A} = \begin{bmatrix} 2 & 1 \\ 6 & 4 \end{bmatrix}\)์ผ๋•Œ,

\[\Vert \boldsymbol{A}\Vert_{F} = \sqrt{2^2 + 1^2 + 6^2 + 4^2} = 7.550\]

Norm์˜ ํ™œ์šฉ

  • ๊ฑฐ๋ฆฌ(ํฌ๊ธฐ)์˜ ๊ฒฝ์šฐ
  • Regularization์˜ ๊ฒฝ์šฐ
    • norm-regularization

    • optimal point์—์„œ ์ƒ๊ธธ์ˆ˜ ์žˆ๋Š” overfitting์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด optimal point์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•˜๋„๋ก l2 norm์˜ boundary์•ˆ์—์„œ optimal point์™€์˜ ์ตœ์†Œ๊ฐ’์„ ๊ฐ€์ง€๋„๋ก Gradient์— Norm์„ ์ถ”๊ฐ€ํ•˜๋Š” ํ˜•ํƒœ๋กœ ์‚ฌ์šฉํ•จ.

ํผ์…‰ํŠธ๋ก  (Perceptron)

input \(\boldsymbol{x}\)์™€ \(\boldsymbol{W}\)๋ฅผ ๋‚ด์ ํ•œ ๊ฒƒ์ด ๊ฒฐ๊ณผ๊ฐ’์ธ๋ฐ ์ด ๊ฒฐ๊ณผ๊ฐ’์— activation function์„ ์ ์šฉํ•ด์„œ threshold ๊ธฐ์ค€์œผ๋กœ + or - ํ˜•ํƒœ๋กœ ๋„์ถœํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๋‚ด์ ์ด ์ค‘์š”ํ•œ ์ด์œ ๋Š” ๋‚ด์ ์„ ํ†ตํ•ด ๋‚˜์˜จ scalar๊ฐ’์ด activation function์˜ input์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ณต๊ฐ„์ ์œผ๋กœ ๋ณด๋ฉด,
ํ•™์Šต์„ ํ†ตํ•ด ์–ป์–ด์ง„ \(\boldsymbol{W}\)์™€ input ์ด ๋‚ด์ ์„ ํ•œ ๊ฐ’์„ \(\boldsymbol{W}\)์— ๋Œ€ํ•ด \(\text{Proj}\)ํ–ˆ์„ ๋•Œ์˜ ์œ„์น˜๊ฐ€, decision boundary ์ƒ์—์„œ ์–ด๋””์— ์œ„์น˜ํ•ด ์žˆ๋Š๋ƒ.
์ด \(\text{Proj}\) ๊ฐ’์ด ํผ์…‰ํŠธ๋ก ์˜ activation function์ด ์ •ํ•œ threshold(T) ๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜ ๊ฐ™์„ ๊ฒฝ์šฐ 1, ์•„๋‹ˆ๋ฉด -1์„ ouput์œผ๋กœ ๋‚˜์™€์„œ ๋ถ„ํ•  ํ•ด์ฃผ๋Š” ๊ฒƒ์ž„.

\(\boldsymbol{W}\)์— ์ˆ˜์ง์ธ Decision boundary๋Š” 2์ฐจ์›์ผ๋•Œ๋Š” decision line, 3์ฐจ์›์ผ๋•Œ๋Š” decision plane, 4์ฐจ์› ์ด์ƒ์€ decision hyperplane.

๊ฒฐ์ • ๊ฒฝ๊ณ„ (decision boundary)๋Š” ์„ ํ˜•ํŒ๋ณ„ํ•จ์ˆ˜ \(y(\boldsymbol{x})=\boldsymbol{w}^T\boldsymbol{x}+w_0\)๊ฐ€ 0์„ ๋งŒ์กฑ์‹œํ‚ค๋Š” \(\boldsymbol{x}\)์˜ ์ง‘ํ•ฉ
๋งŒ์•ฝ bias๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ bias(\(w_0\))๊ฐ€ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์š”์†Œ๊ฐ€ ๋จ(๊ธฐ์šธ๊ธฐ๋Š” ๊ฐ™์ง€๋งŒ ์œ„์น˜ ๋‹ค๋ฅธ).
์—ฌ๊ธฐ๋ฅผ ์ฐธ์กฐ.

Linear Classifier (cs231n)

linear-classifier-1
linear-classifier-2

์ด๋ฏธ์ง€ X(tensor)๋ฅผ flatten ์‹œ์ผœ์„œ 32 x 32 x 3์ธ ๋ฐ์ดํ„ฐ๋ฅผ 3072 x 1์˜ shape์ธ vector์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋„ฃ์–ด์ค„ ๊ฒƒ์ด๊ณ ,
W๋Š” 10 x 3072์˜ shape์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ์ด ํ–‰๋ ฌ๊ณผ input vector์™€์˜ ๋‚ด์ ์œผ๋กœ ๋‚˜์˜จ ๊ฐ’์ด 10 x 1์˜ shape์„ ์ทจํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์คŒ. (10๊ฐœ์˜ ํผ์…‰ํŠธ๋ก ์ด ์žˆ๋‹ค๊ณ  ๋ณด๋ฉด ๋จ)
10๊ฐœ์˜ target shape์„ ๋งŒ๋“  ๊ฒƒ์€ ์ž„์˜์ž„. 3๊ฐœ์˜ target class๋ฅผ ์›ํ•œ๋‹ค๋ฉด 3 x 1์˜ shape์œผ๋กœ ์ ์šฉํ•˜๋ฉด ๋จ.

์•„๋ž˜์˜ slide๋ฅผ ๋ณด๋ฉด,
cat, dog and ship์ธ 3๊ฐœ์˜ ํด๋ž˜์Šค๋กœ ๊ตฌ๋ถ„ํ•˜๋Š” ์˜ˆ์ œ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด์„œ W์˜ shape์ด 3 x 4์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Œ.

linear-classifier-3

ํผ์…‰ํŠธ๋ก  ํ•˜๋‚˜๊ฐ€ ์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ 1๊ฐœ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ.

cs231n:
The problem is that the linear classifier is only learning one template for each class.
So, if thereโ€™s sort of variations in how that class might appear, itโ€™s trying to average out all those different variations, all those different appearances, and use just one single template to recognize each of those categories.

linear-classifier-4

Images as points in high dimensional space. The linear classifier is putting in these linear decision boundaries to try to draw linear separation between one category and the rest of the categories.

์—ญํ–‰๋ ฌ (Inverse Matrix)

vector๋ฅผ transform ์‹œํ‚จ ๋‹ค์Œ ๋‹ค์‹œ ์›๊ณต๊ฐ„์œผ๋กœ ๋˜๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋จ.

๊ณต๊ฐ„์˜ ๋ณ€ํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์ด๋‚˜ ๊ฐ€์„ค(PCA ๊ฐ™์€ ๊ฒฝ์šฐ)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์— ์—ญํ–‰๋ ฌ์ด ๊ฐ€๋Šฅํ•˜๋ฉด ํŽธ๋ฆฌํ•ด์ง€๋Š” ๋ถ€๋ถ„์ด ์กด์žฌํ•จ.

์„ ํ˜•๋Œ€์ˆ˜์—์„œ ์—ญํ–‰๋ ฌ์„ ํ†ตํ•ด ๋ฐฉ์ •์‹์„ ๋ณด๋‹ค ํŽธ๋ฆฌํ•˜๊ฒŒ ํ’€๊ธฐ ์œ„ํ•จ.

  • ๊ฐ€์—ญํ–‰๋ ฌ(Invertible Matrix)์˜ ์กฐ๊ฑด ์ค‘ ์ค‘์š”ํ•œ ๊ฒƒ.
    • \(\boldsymbol{A}\)์˜ ๋ชจ๋“  ํ–‰๊ณผ ์—ด์ด ์„ ํ˜•๋…๋ฆฝ์ด๋‹ค.
    • \(\det(\boldsymbol{A}) \neq 0\).
    • \(\boldsymbol{A}^{T}\boldsymbol{A}\)๋Š” positive definite ๋Œ€์นญํ–‰๋ ฌ์ž„.
    • \(\boldsymbol{A}\)์˜ eigenvalue๋Š” ๋ชจ๋‘ 0์ด ์•„๋‹ˆ๋‹ค.

ํ–‰๋ ฌ์‹ (Determinant)

๊ธฐํ•˜ํ•™์  ์˜๋ฏธ: ํ–‰๋ ฌ์˜ ๊ณฑ์— ์˜ํ•œ ๊ณต๊ฐ„์˜ ํ™•์žฅ ๋˜๋Š” ์ถ•์†Œ ํ•ด์„

  • \(\det(\boldsymbol{A}) = 0\): ํ•˜๋‚˜์˜ ์ฐจ์›์„ ๋”ฐ๋ผ ์ถ•์†Œ๋˜์–ด ๋ถ€ํ”ผ๋ฅผ ์žƒ๊ฒŒ ๋จ
  • \(\det(\boldsymbol{A}) = 1\): ๋ถ€ํ”ผ ์œ ์ง€ํ•œ ๋ณ€ํ™˜/๋ฐฉํ–ฅ ๋ณด์กด ๋จ
  • \(\det(\boldsymbol{A}) = -1\): ๋ถ€ํ”ผ ์œ ์ง€ํ•œ ๋ณ€ํ™˜/๋ฐฉํ–ฅ ๋ณด์กด ์•ˆ๋จ
  • \(\det(\boldsymbol{A}) = 5\): 5๋ฐฐ ๋ถ€ํ”ผ ํ™•์žฅ๋˜๋ฉฐ ๋ฐฉํ–ฅ ๋ณด์กด

์›๊ณต๊ฐ„์— ๋Œ€ํ•œ ๋ณ€ํ™˜์˜ ๋ถ€ํ”ผ์˜ ๋ณ€ํ™”๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ž„.

์ •๋ถ€ํ˜ธ ํ–‰๋ ฌ (positive definite matrices)

ํ–‰๋ ฌ์˜ ๊ณต๊ฐ„์˜ ๋ชจ์Šต์„ ํŒ๋‹จํ•˜๊ธฐ ์œ„ํ•ด?
์–‘์˜ ์ •๋ถ€ํ˜ธ ํ–‰๋ ฌ: 0์ด ์•„๋‹Œ ๋ชจ๋“  ๋ฒกํ„ฐ \(\boldsymbol{x}\)์— ๋Œ€ํ•ด, \(\boldsymbol{x}^{T}\boldsymbol{A}\boldsymbol{x} > 0\)
์„ฑ์งˆ

  • ๊ณ ์œ ๊ฐ’ ๋ชจ๋‘ ์–‘์ˆ˜
  • ์—ญํ–‰๋ ฌ๋„ ์ •๋ถ€ํ˜ธ ํ–‰๋ ฌ
  • \(\det(\boldsymbol{A}) = 0\).

๋ถ„ํ•ด (Decomposition)

๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด (Eigen-decomposition)

ML-Basics-Linear-Algebra ์ฐธ์กฐ.

์ •๋ฐฉํ–‰๋ ฌ \(A\in \mathbb{R}^{n\times n}\) ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, \(Ax = \lambda x, x\neq 0\)์„ ๋งŒ์กฑํ•˜๋Š” \(\lambda \in \mathbb{C}\)๋ฅผ \(A\)์˜ ๊ณ ์œ ๊ฐ’(eigenvalue) ๊ทธ๋ฆฌ๊ณ  \(x\in \mathbb{C}^n\)์„ ์—ฐ๊ด€๋œ ๊ณ ์œ ๋ฒกํ„ฐ(eigenvector)๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

Eigenvectors: ์„ ํ˜•๋ณ€ํ™˜(T)์ด ์ผ์–ด๋‚œ ํ›„์—๋„ ๋ฐฉํ–ฅ์ด ๋ณ€ํ•˜์ง€ ์•Š๋Š” ์˜๋ฒกํ„ฐ๊ฐ€ ์•„๋‹Œ ๋ฒกํ„ฐ.

Eigenvalues: Eigenvectors์˜ ๊ธธ์ด๊ฐ€ ๋ณ€ํ•˜๋Š” ๋ฐฐ์ˆ˜(scale), reversed๋‚˜ scaled๊ฐ€ ๋  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ฐฉํ–ฅ์€ ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค.
They make for interesting basis vectors. Basis vectors whos transformation matrices are maybe computationally more simpler or ones that make for better coordinate systems.

numpy.linalg ๋ชจ๋“ˆ์˜ eig ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์œ ๊ฐ’๊ณผ ๊ณ ์œ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

the-effect-of-eigenvectors-and-eigenvalues

Figure 2.3: An example of the effect of eigenvectors and eigenvalues. Here, we have a matrix A with two orthonormal eigenvectors, v(1) with eigenvalue ฮป1 and v(2) with eigenvalue ฮป2. (Left) We plot the set of all unit vectors u โˆˆ R2 as a unit circle. (Right) We plot the set of all points Au. By observing the way that A distorts the unit circle, we can see that it scales space in direction v(i) by ฮปi. Deep Learning. Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด์„œ ํ–‰๋ ฌ์˜ ์—ญํ–‰๋ ฌ๋„ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ , PCA์—์„œ๋„ ํ™œ์šฉํ•จ.

๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด๋Š” ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ์—๋งŒ ์ ์šฉ๋จ.
ํ•˜์ง€๋งŒ, ML์—์„œ ํ•ญ์ƒ ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ๋งŒ ์กด์žฌํ•œ๋‹ค๋Š” ๋ณด์žฅ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— SVD๋ฅผ ์‚ฌ์šฉ.

ํŠน์ž‡๊ฐ’ ๋ถ„ํ•ด (SVD: Singular Value Decomposition)

์ •์‚ฌ๊ฐ ํ–‰๋ ฌ์ด ์•„๋‹Œ ํ–‰๋ ฌ์˜ ์—ญํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋จ

ํŠน์ด๊ฐ’๋ถ„ํ•ด(SVD)์˜ ๊ธฐํ•˜ํ•™์  ์˜๋ฏธ

ํ–‰๋ ฌ์„ \(x' = Ax\)์™€ ๊ฐ™์ด ์ขŒํ‘œ๊ณต๊ฐ„์—์„œ์˜ ์„ ํ˜•๋ณ€ํ™˜์œผ๋กœ ๋ดค์„ ๋•Œ ์ง๊ตํ–‰๋ ฌ(orthogonal matrix)์˜ ๊ธฐํ•˜ํ•™์  ์˜๋ฏธ๋Š” ํšŒ์ „๋ณ€ํ™˜(rotation transformation) ๋˜๋Š” ๋ฐ˜์ „๋œ(reflected) ํšŒ์ „๋ณ€ํ™˜, ๋Œ€๊ฐํ–‰๋ ฌ(diagonal maxtrix)์˜ ๊ธฐํ•˜ํ•™์  ์˜๋ฏธ๋Š” ๊ฐ ์ขŒํ‘œ์„ฑ๋ถ„์œผ๋กœ์˜ ์Šค์ผ€์ผ๋ณ€ํ™˜(scale transformation)์ด๋‹ค.

ํ–‰๋ ฌ \(R\)์ด ์ง๊ตํ–‰๋ ฌ(orthogonal matrix)์ด๋ผ๋ฉด \(RR^T = E\)์ด๋‹ค. ๋”ฐ๋ผ์„œ \(\det(RR^T) = \det(R)\det(R^T) = \det(R^2) = 1\)์ด๋ฏ€๋กœ \(\det(R)\)๋Š” ํ•ญ์ƒ +1, ๋˜๋Š” -1์ด๋‹ค. ๋งŒ์ผ \(\det(R)=1\)๋ผ๋ฉด ์ด ์ง๊ตํ–‰๋ ฌ์€ ํšŒ์ „๋ณ€ํ™˜์„ ๋‚˜ํƒ€๋‚ด๊ณ  \(\det(R)=-1\)๋ผ๋ฉด ๋’ค์ง‘ํ˜€์ง„(reflected) ํšŒ์ „๋ณ€ํ™˜์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

๐Ÿ‘‰ ๋”ฐ๋ผ์„œ ์‹ \((1), A = U\Sigma V^T\)์—์„œ \(U, V\)๋Š” ์ง๊ตํ–‰๋ ฌ, \(\Sigma\)๋Š” ๋Œ€๊ฐํ–‰๋ ฌ์ด๋ฏ€๋กœ \(Ax\)๋Š” \(x\)๋ฅผ ๋จผ์ € \(V^T\)์— ์˜ํ•ด ํšŒ์ „์‹œํ‚จ ํ›„ \(\Sigma\)๋กœ ์Šค์ผ€์ผ์„ ๋ณ€ํ™”์‹œํ‚ค๊ณ  ๋‹ค์‹œ \(U\)๋กœ ํšŒ์ „์‹œํ‚ค๋Š” ๊ฒƒ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

svd

<๊ทธ๋ฆผ1> ์ถœ์ฒ˜: ์œ„ํ‚คํ”ผ๋””์•„

๐Ÿ‘‰ ์ฆ‰, ํ–‰๋ ฌ์˜ ํŠน์ด๊ฐ’(singular value)์ด๋ž€ ์ด ํ–‰๋ ฌ๋กœ ํ‘œํ˜„๋˜๋Š” ์„ ํ˜•๋ณ€ํ™˜์˜ ์Šค์ผ€์ผ ๋ณ€ํ™˜์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ‘‰ ๊ณ ์œ ๊ฐ’๋ถ„ํ•ด(eigendecomposition)์—์„œ ๋‚˜์˜ค๋Š” ๊ณ ์œ ๊ฐ’(eigenvalue)๊ณผ ๋น„๊ตํ•ด ๋ณด๋ฉด ๊ณ ์œ ๊ฐ’์€ ๋ณ€ํ™˜์— ์˜ํ•ด ๋ถˆ๋ณ€์ธ ๋ฐฉํ–ฅ๋ฒกํ„ฐ(-> ๊ณ ์œ ๋ฒกํ„ฐ)์— ๋Œ€ํ•œ ์Šค์ผ€์ผ factor์ด๊ณ , ํŠน์ด๊ฐ’์€ ๋ณ€ํ™˜ ์ž์ฒด์˜ ์Šค์ผ€์ผ factor๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ‘‰ ์ด ์ฃผ์ œ์™€ ๊ด€๋ จํ•˜์—ฌ ์กฐ๊ธˆ ๋” ์ƒ์ƒ์˜ ๋‚˜๋ž˜๋ฅผ ํŽด ๋ณด๋ฉด, \(m \times n\) ํ–‰๋ ฌ \(A\)๋Š” \(n\)์ฐจ์› ๊ณต๊ฐ„์—์„œ \(m\)์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ์˜ ์„ ํ˜•๋ณ€ํ™˜์ด๋‹ค. \(n\)์ฐจ์› ๊ณต๊ฐ„์— ์žˆ๋Š” ์›, ๊ตฌ ๋“ฑ๊ณผ ๊ฐ™์ด ์›ํ˜•์œผ๋กœ ๋œ ๋„ํ˜•์„ \(A\)์— ์˜ํ•ด ๋ณ€ํ™˜์‹œํ‚ค๋ฉด ๋จผ์ € \(V^T\)์— ์˜ํ•ด์„œ๋Š” ํšŒ์ „๋งŒ ์ผ์–ด๋‚˜๋ฏ€๋กœ ๋„ํ˜•์˜ ํ˜•ํƒœ๋Š” ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋Ÿฐ๋ฐ \(\Sigma\)์— ์˜ํ•ด์„œ๋Š” ํŠน์ด๊ฐ’์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ์„œ ์›์ด ํƒ€์›์ด ๋˜๊ฑฐ๋‚˜ ๊ตฌ๊ฐ€ ๋Ÿญ๋น„๊ณต์ด ๋˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์‹์˜ ํ˜•ํƒœ๋ณ€ํ™˜์ด ์ผ์–ด๋‚œ๋‹ค (\(n\)์ด 2์ฐจ์›์ธ ์›์˜ ๊ฒฝ์šฐ ์ฒซ๋ฒˆ์งธ ํŠน์ด๊ฐ’ \(\sigma_1\)์€ ๋ณ€ํ™˜๋œ ํƒ€์›์˜ ์ฃผ์ถ•์˜ ๊ธธ์ด, ๋‘๋ฒˆ์งธ ํŠน์ด๊ฐ’ \(\sigma_2\)๋Š” ๋‹จ์ถ•์˜ ๊ธธ์ด์— ๋Œ€์‘๋œ๋‹ค). ์ดํ›„ \(U\)์— ์˜ํ•œ ๋ณ€ํ™˜๋„ ํšŒ์ „๋ณ€ํ™˜์ด๋ฏ€๋กœ ๋„ํ˜•์˜ ํ˜•ํƒœ์—๋Š” ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ๋ชปํ•œ๋‹ค. ๋งŒ์ผ \(m > n\)์ด๋ผ๋ฉด 0์„ ๋ง๋ถ™์—ฌ์„œ ์ฐจ์›์„ ํ™•์žฅํ•œ ํ›„์— \(U\)๋กœ ํšŒ์ „์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ  \(m < n\)์ด๋ผ๋ฉด ์ผ๋ถ€ ์ฐจ์›์„ ์—†์• ๋ฒ„๋ฆฌ๊ณ (์ผ์ข…์˜ ํˆฌ์˜) ํšŒ์ „์„ ์‹œํ‚ค๋Š” ์…ˆ์ด๋‹ค. ๊ฒฐ๊ตญ ์„ ํ˜•๋ณ€ํ™˜ \(A\)์— ์˜ํ•œ ๋„ํ˜•์˜ ๋ณ€ํ™˜๊ฒฐ๊ณผ๋Š” ํ˜•ํƒœ์ ์œผ๋กœ ๋ณด๋ฉด ์˜ค๋กœ์ง€ A์˜ ํŠน์ด๊ฐ’(singular value)๋“ค์— ์˜ํ•ด์„œ๋งŒ ๊ฒฐ์ •๋œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Information Theory & Optimization (chapter 3 in Deep Learning book)

ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ์ •๋Ÿ‰ํ™”

์ •๋ณด์ด๋ก ์˜ ๊ธฐ๋ณธ์›๋ฆฌ ๐Ÿ‘‰ ํ™•๋ฅ ์ด ์ž‘์„์ˆ˜๋ก ๋งŽ์€ ์ •๋ณด
unlikely event์˜ ์ •๋ณด๋Ÿ‰์ด ๋งŽ์Œ.

์ž๊ธฐ ์ •๋ณด (self information)

์‚ฌ๊ฑด(๋ฉ”์‹œ์ง€, \(e_i\))์˜ ์ •๋ณด๋Ÿ‰ (๋กœ๊ทธ ๋ฐ‘์ด 2์ธ ๊ฒฝ์šฐ bit, ์ž์—ฐ์ƒ์ˆ˜์ธ ๊ฒฝ์šฐ nat)
\(h(e_i) = -log_{2}P(e_{i})\) or \(h(e_i) = -log_{e}P(e_{i})\)

์˜ˆ, ๋™์ „ ์•ž๋ฉด์ด ๋‚˜์˜ค๋Š” ์‚ฌ๊ฑด์˜ ์ •๋ณด๋Ÿ‰:
\(log_{2}(\frac{1}{2}) = 1\)
1~6์ธ ์ฃผ์‚ฌ์œ„์—์„œ 1์ด ๋‚˜์˜ค๋Š” ์‚ฌ๊ฑด์˜ ์ •๋ณด๋Ÿ‰:
\(-log_{2}(\frac{1}{6}) \approx 2.58\)

ํ›„์ž์˜ ์‚ฌ๊ฑด์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ์ •๋ณด๋Ÿ‰์„ ๊ฐ–๋Š”๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Œ.

์—”ํŠธ๋กœํ”ผ (Entropy)

ํ™•๋ฅ  ๋ณ€์ˆ˜ \(x\)์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์—”ํŠธ๋กœํ”ผ
๋ชจ๋“  ์‚ฌ๊ฑด ์ •๋ณด๋Ÿ‰์˜ ๊ธฐ๋Œ“๊ฐ’์œผ๋กœ ํ‘œํ˜„

์ด์‚ฐํ™•๋ฅ ๋ถ„ํฌ \(H(x) = - \sum_{i=1, k}P(e_i)log_{2}P(e_i)\) ๋˜๋Š” \(H(x) = - \sum_{i=1, k}P(e_i)log_{e}P(e_i)\)
์—ฐ์†ํ™•๋ฅ ๋ถ„ํฌ \(H(x) = - \int_{\mathbb{R}}P(x)log_{2}P(x)\) ๋˜๋Š” \(H(x) = - \int_{\mathbb{R}}P(x)log_{e}P(x)\)

์˜ˆ, ๋™์ „์˜ ์•ž๋’ค์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์ด ๋™์ผํ•œ ๊ฒฝ์šฐ์˜ ์—”ํŠธ๋กœํ”ผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

\[\begin{align}H(x) &= - \sum_{i=1, k}P(e_i)logP(e_i) \\ &= - (0.5\times log_{2}0.5 + 0.5\times log_{2}0.5) \\ &= -log_{2}0.5 \\ &= -(-1) \end{align}\]

๋™์ „์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์— ๋”ฐ๋ฅธ ์—”ํŠธ๋กœํ”ผ ๋ณ€ํ™” (binary entrophy)
entropy-plot

  • ๊ณตํ‰ํ•œ ๋™์ „์ผ ๊ฒฝ์šฐ ๊ฐ€์žฅ ํฐ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ
  • ๋™์ „ ๋˜์ง€๊ธฐ ๊ฒฐ๊ณผ ์ „์†ก์—๋Š” ์ตœ๋Œ€ 1๋น„ํŠธ๊ฐ€ ํ•„์š”ํ•จ์„ ์˜๋ฏธ

๋ชจ๋“  ์‚ฌ๊ฑด์ด ๋™์ผํ•œ ํ™•๋ฅ ์„ ๊ฐ€์งˆ ๋•Œ, ์ฆ‰, ๋ถˆํ™•์‹ค์„ฑ์ด ๊ฐ€์žฅ ๋†’์€ ๊ฒฝ์šฐ, ์—”ํŠธ๋กœํ”ผ๊ฐ€ ์ตœ๋Œ€ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.

์˜ˆ, ์œท๊ณผ pair ์ฃผ์‚ฌ์œ„(1~6)์˜ ์—”ํŠธ๋กœํ”ผ ๊ฐ’์„ ๋น„๊ต.

์œท: \(H(x) = - (\frac{4}{16}log_{2}\frac{4}{16} + \frac{6}{16}log_{2}\frac{6}{16} + \frac{4}{16}log_{2}\frac{4}{16} + \frac{1}{16}log_{2}\frac{1}{16} + \frac{1}{16}log_{2}\frac{1}{16}) \approx 2.0306 \text{๋น„ํŠธ}\)

์ฃผ์‚ฌ์œ„: \(H(x) = - (\frac{1}{6}log_{2}\frac{1}{6} + \frac{1}{6}log_{2}\frac{1}{6} + \frac{1}{6}log_{2}\frac{1}{6} + \frac{1}{6}log_{2}\frac{1}{6} + \frac{1}{6}log_{2}\frac{1}{6} + \frac{1}{6}log_{2}\frac{1}{6}) \approx 2.585 \text{๋น„ํŠธ}\)

๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ (Cross Entropy)

๋‘ ๊ฐœ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์–ผ๋งˆ ๋งŒํผ์˜ ์ •๋ณด๋ฅผ ๊ณต์œ ํ•˜๋Š” ๊ฐ€.

\[H(P,Q) = - \sum_{x}P(x)log_{2}Q(x) = - \sum_{i=1, k} P(e_{i})log_{2}Q(e_{i})\]

P๋ผ๋Š” ํ™•๋ฅ ๋ถ„ํฌ์— ๋Œ€ํ•ด์„œ Q์˜ ๋ถ„ํฌ์˜ cross entropy

๋”ฅ๋Ÿฌ๋‹์—์„œ output์€ ํ™•๋ฅ ๊ฐ’์ž„.
์†์‹คํ•จ์ˆ˜๋Š” ์ •๋‹ต(label or target)๊ณผ ์˜ˆ์ธก๊ฐ’(prediction)์„ ๋น„๊ตํ•˜๊ธฐ ๋•Œ๋ฌธ์—
์ด๋ฅผ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์ž„.

label ๊ฐ™์€ ๊ฒฝ์šฐ๋„ OHE๋กœ ํ•˜์ง€๋งŒ ์ด๊ฒƒ๋„ 1๋กœ ๋˜์–ด์žˆ๋Š” ํ™•๋ฅ  ๋ถ„ํฌ๊ณ  output๋„ ํ™•๋ฅ  ๋ถ„ํฌ์ด๊ธฐ ๋•Œ๋ฌธ์—
์ด ์ฒ™๋„๋กœ ๋น„๊ต๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ CE์ž„.

์œ„์˜ ์‹์„ ์ „๊ฐœํ•˜๋ฉด,

\[\begin{align}H(P,Q) &= - \sum_{x}P(x)log_{2}Q(x) \\ &= - \sum_{x}P(x)log_{2}P(x) + \sum_{x}P(x)log_{2}P(x) - \sum_{x}P(x)log_{2}Q(x) \\ &= H(P) + \sum_{x}P(x)log_{2}\frac{P(x)}{Q(x)} \\ \end{align}\]

\(- \sum_{x}P(x)log_{2}P(x) + \sum_{x}P(x)log_{2}P(x)\) ์ด ์‹์„ ์ถ”๊ฐ€ํ•ด์„œ ๋ณ€ํ˜•ํ•œ ๊ฒƒ์ž„.
์ด ์‹์„ ํ•ฉ์น˜๋ฉด

\[\begin{align}\sum_{x}P(x)log_{2}P(x) - \sum_{x}P(x)log_{2}Q(x) \\ \Rightarrow \sum_{x}P(x)log_{2}\frac{P(x)}{Q(x)}\end{align}\]

์ด๋ ‡๊ฒŒ ๋˜๋Š” ๋ฐ, ์ด ์‹์„ KL Divergence ๋ผ๊ณ  ํ•จ.

์—ฌ๊ธฐ์„œ \(P\)๋ฅผ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ผ๊ณ  ํ•˜๋ฉด, ์ด๋Š” ํ•™์Šต๊ณผ์ •์—์„œ ๋ณ€ํ™”ํ•˜์ง€ ์•Š์Œ.
\(P\)๋Š” ๊ณ ์ •์ด๊ธฐ ๋•Œ๋ฌธ์— \(Q\)๋ฅผ ์กฐ์ •ํ•ด์„œ cross entropy๊ฐ’์„ ์ตœ์†Œํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์ž„.

Cross Entropy๋ฅผ ์†์‹คํ•จ์ˆ˜๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ,

\[H(P,Q) = H(P) + \sum_{x}P(x)log_{2}\frac{P(x)}{Q(x)}\]

์ด ์‹์—์„œ, \(P\)๋Š” ๊ณ ์ •์ด๊ธฐ ๋•Œ๋ฌธ์— KLD๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•จ.

์ฆ‰, ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ P(x)์™€ ์ถ”์ •ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ Q(x)๊ฐ„์˜ ์ฐจ์ด ์ตœ์†Œํ™” ํ•˜๋Š”๋ฐ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์‚ฌ์šฉํ•จ.

KLD (Kullbackโ€“Leibler divergence)

  • P์™€ Q ์‚ฌ์ด์˜ KLD
  • ๋‘ ํ™•๋ฅ ๋ถ„ํฌ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ.
\[KL(P\Vert Q) = \sum_{x}P(x)log_{2}\frac{P(x)}{Q(x)}\]

P์™€ Q์˜ cross entrophy๋Š” p์˜ ์—”ํŠธ๋กœํ”ผ + P์™€ Q๊ฐ„์˜ KL ๋‹ค์ด๋ฒ„์ „์Šค์ž„.

Logit

DL์˜ output์€ probability๊ฐ€ ๋˜์–ด์•ผ ํ•˜๋Š”๋ฐ,
๋„คํŠธ์›Œํฌ๋ฅผ ํ†ต๊ณผํ•ด์„œ ๋‚˜์˜จ ๊ฐ’์˜ ๋ฒ”์œ„๋Š” \(\left[-\infty, \infty \right]\)์ด๊ธฐ ๋•Œ๋ฌธ์—

activation function(sigmoid)์„ ์ ์šฉํ•ด์„œ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋งŒ๋“ค์–ด ์ฃผ๋Š”๋ฐ, ์ด๋Ÿฐ ๊ฒฝ์šฐ ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์ž„ (๋ชจ๋“  ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์„ ๋”ํ–ˆ์„ ๋•Œ 1์ด ์•„๋‹˜)
multilabel classification์ผ ๊ฒฝ์šฐ์— ๊ฐ€๋Šฅํ•˜์ง€๋งŒ.
multiclass์˜ ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ์›ํ•˜๋Š” ๊ฒƒ์ž„(๋ชจ๋“  ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์„ ๋”ํ–ˆ์„ ๋•Œ 1)
์ด ์—ญํ• ์„ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด softmax function์ž„

ํ™•๋ฅ ์—์„œ์˜ logit function์€ \(log_{e}(\frac{p}{1-p})\) ์ž„.
p๊ฐ€ 0%์— ๊ฐ€๊นŒ์šธ ๋•Œ logit์€ \(-\infty\)์ด๊ณ , 100% ์— ๊ฐ€๊นŒ์šธ ๋•Œ logit์€ \(\infty\) ์ž„.

logit

์ฆ‰ ๋กœ์ง“์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด(\(\left[-\infty, \infty \right]\)), ์ด ๋ฒ”์œ„๋ฅผ \(\left[0, 1\right]\)๋กœ ๋ฐ”๊ฟ”์ค„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž„.

ํ™•๋ฅ ์ด ์ปค์ง€๋ฉด ๋กœ์ง“๋„ ์ปค์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋”ฅ๋Ÿฌ๋‹์—์„œ ํ™•๋ฅ ๋Œ€์‹  ๋กœ์ง“์„ ์Šค์ฝ”์–ด๋กœ๋„ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์˜๋ฏธ์ž„.

ouput์œผ๋กœ ๋กœ์ง“์ด ๋‚˜์˜ค๊ณ  ์ด๊ฑธ sigmoid๋กœ ๋ฐ”๊พธ๋ฉด ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์ด ๋‚˜์˜ด(๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•ด์„œ ๊ฐ ํด๋ž˜์Šค๊ฐ€ ์•„๋‹˜, ํ•ฉํ•˜๋ฉด 1์„ ๋„˜์Œ)
softmax๋Š” ๋กœ์ง“์„ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ์–ป๊ฒŒ ํ•ด์คŒ.
\(e^{logit}\)์„ ์‚ฌ์šฉํ•จ.
\(e^{logit}\) ์—ญ์‹œ ํ™•๋ฅ ์ด ์ปค์ง€๋ฉด ์ปค์ง„๋‹ค๋Š” ์„ฑ์งˆ์ด ์žˆ์Œ.
\(e^{logit}\)์„ ์‚ฌ์šฉํ•˜๋ฉด ๋†’์€ ์ ์ˆ˜๋Š” ์•„์ฃผ ๋†’๊ฒŒ ๋˜๊ณ  ๋‚ฎ์€ ์ ์ˆ˜๋Š” ์•„์ฃผ ๋‚ฎ๊ฒŒ ๋จ.

์ด๊ฒƒ์ด ๋”ฅ๋Ÿฌ๋‹์— ๋„์›€์ด ๋˜๋Š” ์ด์œ ๋Š” label์„ OHE ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž„.
ํ™•๋ฅ ๊ฐ’์ด ๋†’์€ ๊ฒƒ๋“ค์„ ์•„์ฃผ ๋†’๊ฒŒ, ๋‚ฎ์€ ๊ฐ’๋“ค์„ ์•„์ฃผ ๋‚ฎ๊ฒŒ ํ•˜๋ฉด OHE์™€ ๋น„์Šทํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ.
๋”ฐ๋ผ์„œ CrossEntropy ํ•™์Šต์— ๋„์›€์ด ๋˜๋Š” ๊ฒƒ์ž„.

why e in log?
0๋ณด๋‹ค ํฐ ์•„๋ฌด ๊ฐ’์ด๋‚˜ ์จ๋„ ์ƒ๊ด€์€ ์—†์ง€๋งŒ(์ˆ˜ํ•™์ ์œผ๋กœ๋Š”),
๋”ฅ๋Ÿฌ๋‹์˜ output์„ ๋กœ์ง“์ด๋ผ๊ณ  ๊ฐ€์ •ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋กœ์ง“์€ log์— e base๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Œ.
๋”ฐ๋ผ์„œ, \(e^{logit} = e^{log_{e}(\frac{p}{1-p})} = \frac{p}{1-p}\) ๊ณ„์‚ฐ์ด ์ƒ๋‹นํžˆ ๊ฐ„ํŽธํ•ด์ง.
๋‹ค๋ฅธ ์ˆซ์ž๋ฅผ ์‚ฌ์šฉํ•ด๋„ softmax์— ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋Š” ๋‚˜์˜ค์ง€๋งŒ, ๋”ฅ๋Ÿฌ๋‹ ์•„์ด๋””์–ด์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋กœ์ง“์ด๊ธฐ ๋•Œ๋ฌธ์— e๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž„.
์˜ค์ผ๋Ÿฌ ๋„˜๋ฒ„๋ผ๊ณ  ํ•จ.

e-logit

logit-softmax

Summary

  • Softmax gives probability distribution over predicted output classes.
  • The final layer in deep learning has logit values which are raw values for prediction by softmax.
  • Logit is the input to softmax

ํŽธ๋ฏธ๋ถ„ (partial derivative)

  • ๋ณ€์ˆ˜๊ฐ€ ๋ณต์ˆ˜์ธ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„
  • ๋ฏธ๋ถ„ ๊ฐ’์ด ์ด๋ฃจ๋Š” ๋ฒกํ„ฐ๋ฅผ gradient๋ผ๊ณ  ๋ถ€๋ฆ„.

Jacobian Matrix

ํ–‰๋ ฌ์„ ๋ฏธ๋ถ„ํ•œ ๊ฒƒ.
1์ฐจ ํŽธ๋„ ํ•จ์ˆ˜

์‹ ๊ฒฝ๋ง์—์„œ ์—ฐ์‚ฐ์€ ํ–‰๋ ฌ๋กœ ์ด๋ฃจ์–ด์ง€๊ณ , ๋ฏธ๋ถ„์ด ํ•„์š”ํ•œ๋ฐ jacobian์„ ํ†ตํ•ด ๋ฏธ๋ถ„ํ•จ.

์˜ˆ,

\[\boldsymbol{f}: \mathbb{R}^2 \mapsto \mathbb{R}^3 \ \boldsymbol{f(x)} = (2x_1+x_2^2,\ -x_1^2+3x_2,\ 4x_1x_2)^T\] \[\boldsymbol{J} = \begin{pmatrix} 2 & 2x_2 \\ -2x_1 & 3 \\ 4x_2 & 4x_1 \end{pmatrix}\ \ \ \boldsymbol{J}\vert_{2,1}^{T} = \begin{pmatrix} 2 & 2 \\ -4 & 3 \\ 4 & 8 \end{pmatrix}\]

์ž์ฝ”๋น„์•ˆ์ด ๋งํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ ๋ฏธ์†Œ ์˜์—ญ์—์„œ โ€˜๋น„์„ ํ˜• ๋ณ€ํ™˜โ€™์„ โ€˜์„ ํ˜• ๋ณ€ํ™˜์œผ๋กœ ๊ทผ์‚ฌโ€™ ์‹œํ‚จ ๊ฒƒ.

Appendix

Positive Definite Matrices

Geometric meaning of Determinant

Reference

Manifold Learning: https://deepinsight.tistory.com/124
Representation Learning: https://ratsgo.github.io/deep%20learning/2017/04/25/representationlearning/
Representation Learning: https://velog.io/@tobigs-gnn1213/7.-Graph-Representation-Learning
cs231n slides: http://cs231n.stanford.edu/slides/2021/
SVD: https://darkpgmr.tistory.com/106
logit: https://youtu.be/K7HTd_Zgr3w
partial derivative: https://youtu.be/ly4S0oi3Yz8, https://youtu.be/GkB4vW16QHI, https://youtu.be/AXqhWeUEtQU
jacobian matrix: https://angeloyeo.github.io/2020/07/24/Jacobian.html

Leave a comment