NN basics - ML & Math

10 minute read

To doPermalink

ํŽธ๋ฏธ๋ถ„ (partial derivative)Permalink

Jacobian MatrixPermalink

https://angeloyeo.github.io/2020/07/24/Jacobian.html
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/jacobian/v/jacobian-prerequisite-knowledge

Hessian MatrixPermalink

2์ฐจ ํŽธ๋„ ํ•จ์ˆ˜

Linear AlgebraPermalink

Vector & MatrixPermalink

Matrix Multiplication & Vector Tranformation(function or mapping)Permalink

Ax=b [4โˆ’3132051][1111]=[58] xโˆˆR4โ†’bโˆˆR2

x๋ผ๋Š” vector์— A๋ผ๋Š” ํ–‰๋ ฌ์„ ๊ณฑํ–ˆ์„ ๋•Œ ์ƒˆ๋กœ์šด ๊ณต๊ฐ„์— b๋ผ๋Š” vector๋กœ ํˆฌ์˜ ๋จ. (x๋Š” 4์ฐจ์› ์‹ค์ˆ˜ ๊ณต๊ฐ„์—์„œ 2์ฐจ์› ์‹ค์ˆ˜ ๊ณต๊ฐ„์œผ๋กœ)

matrix-multiplication-transformation

Examples:
matrix-multiplication-transformation-2

ํ–‰๋ ฌ์˜ ๊ณฑ์…‰์€ ๊ทธ ๋Œ€์ƒ์ด ๋ฒกํ„ฐ๋“  ํ–‰๋ ฌ์ด๋“  ๊ณต๊ฐ„์˜ ์„ ํ˜•์  ๋ณ€ํ™˜.

๊ฒฐ๊ตญ ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด representation learning์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ.

Representaion LearningPermalink

๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ์™€ Represention Learning

๊ธฐ์กด๋Œ€๋กœ๋ผ๋ฉด ์„ ํ˜•์œผ๋กœ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์„ ํ˜• ๋ถ„๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ฒŒ๋” ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณ€ํ˜•๋๋‹ค๋Š” ์–˜๊ธฐ์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ์˜ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ์˜ representaion์ด (x1,x2) ์—์„œ (z1,z2) ๋กœ ๋ฐ”๋€ ๊ฒƒ.

representation-learning-vectors

์ด ๊ธ€์—์„œ๋Š” ์„ค๋ช…์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด ๋‹จ์ˆœ ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋ฅผ ์˜ˆ๋กœ ๋“ค์—ˆ์œผ๋‚˜, ๊นŠ๊ณ  ๋ฐฉ๋Œ€ํ•œ ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ๊ฐ€ ๊ฝค ๋ณต์žกํ•œ represention์ด์–ด๋„ ์ด๋ฅผ ์„ ํ˜• ๋ถ„๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•  ์ •๋„๋กœ ๋‹จ์ˆœํ™”ํ•˜๋Š” ๋ฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋ฅผ representation learner๋ผ๊ณ  ๋ถ€๋ฅด๋Š” ์‚ฌ๋žŒ๋“ค๋„ ์žˆ์Šต๋‹ˆ๋‹ค.


representation learning์ด๋ž€, ์–ด๋–ค task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ์ ์ ˆํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ์˜ representation์„ ๋ณ€ํ˜•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฆ‰ ์–ด๋–ค task๋ฅผ ๋” ์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ํ‘œํ˜„์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Raw data์— ๋งŽ์€ feature engineering๊ณผ์ •์„ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ๋”ฅ๋Ÿฌ๋‹ ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ์š”์†Œ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์ตœ์ ์˜ representation์„ ๊ฒฐ์ •ํ•ด์ฃผ๊ณ  ์ด ์ž ์žฌ๋œ representation์„ ์ฐพ๋Š” ๊ฒƒ์„ representation learning ๋˜๋Š” feature learning์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

Size of Vector & Matrix (Distance)Permalink

์œ ์‚ฌ๋„๋Š” ๋‚ด์ ์„ ํ†ตํ•ด ์–ด๋Š ์ •๋„ ๊ตฌํ• ์ˆ˜ ์žˆ์Œ.

๊ฑฐ๋ฆฌ(size)๋Š” Norm ์œผ๋กœ ๊ตฌํ•จ.

A norm is a function from a real or complex vector space to the nonnegative real numbers that behaves in certain ways like the distance from the origin.

Vector P NormPermalink

Norm์€ ๋ฒกํ„ฐ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•

P NormPermalink

โ€–xโ€–=(โˆ‘i=1,d|xip|)1p

Absolute-value Norm(1์ฐจ ๋†ˆ)Permalink

|x1|1

Euclidean Norm(2์ฐจ ๋†ˆ)Permalink

|x1|2

Max NormPermalink

โ€–xโˆžโ€–=max(|x1|,โ€ฆ,|xd|)

์˜ˆ, x = (3, -4, 1)์ผ ๋•Œ, 2์ฐจ ๋†ˆ์€ โ€–x1โ€–2=(32+(โˆ’4)2+12)1/2โ‰ˆ5.099
์ด๊ฑด ์›์ ์œผ๋กœ ๋ถ€ํ„ฐ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธํ•จ.

๋ฒกํ„ฐ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋„ ๊ฐ€๋Šฅํ•จ.
โ€–z1โ€–2=((3โˆ’a)2+(โˆ’4โˆ’b)2+(1โˆ’c)2)

type-of-norms

Matrix Frobenious NormPermalink

ํ–‰๋ ฌ์˜ ํฌ๊ธฐ๋ฅผ ์ธก์ •

โ€–Aโ€–F=(โˆ‘i=1,nโˆ‘j=1,maij2)12

์˜ˆ, A=[2164]์ผ๋•Œ,

โ€–Aโ€–F=22+12+62+42=7.550

Norm์˜ ํ™œ์šฉPermalink

  • ๊ฑฐ๋ฆฌ(ํฌ๊ธฐ)์˜ ๊ฒฝ์šฐ
  • Regularization์˜ ๊ฒฝ์šฐ
    • norm-regularization

    • optimal point์—์„œ ์ƒ๊ธธ์ˆ˜ ์žˆ๋Š” overfitting์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด optimal point์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•˜๋„๋ก l2 norm์˜ boundary์•ˆ์—์„œ optimal point์™€์˜ ์ตœ์†Œ๊ฐ’์„ ๊ฐ€์ง€๋„๋ก Gradient์— Norm์„ ์ถ”๊ฐ€ํ•˜๋Š” ํ˜•ํƒœ๋กœ ์‚ฌ์šฉํ•จ.

ํผ์…‰ํŠธ๋ก  (Perceptron)Permalink

input x์™€ W๋ฅผ ๋‚ด์ ํ•œ ๊ฒƒ์ด ๊ฒฐ๊ณผ๊ฐ’์ธ๋ฐ ์ด ๊ฒฐ๊ณผ๊ฐ’์— activation function์„ ์ ์šฉํ•ด์„œ threshold ๊ธฐ์ค€์œผ๋กœ + or - ํ˜•ํƒœ๋กœ ๋„์ถœํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๋‚ด์ ์ด ์ค‘์š”ํ•œ ์ด์œ ๋Š” ๋‚ด์ ์„ ํ†ตํ•ด ๋‚˜์˜จ scalar๊ฐ’์ด activation function์˜ input์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ณต๊ฐ„์ ์œผ๋กœ ๋ณด๋ฉด,
ํ•™์Šต์„ ํ†ตํ•ด ์–ป์–ด์ง„ W์™€ input ์ด ๋‚ด์ ์„ ํ•œ ๊ฐ’์„ W์— ๋Œ€ํ•ด Projํ–ˆ์„ ๋•Œ์˜ ์œ„์น˜๊ฐ€, decision boundary ์ƒ์—์„œ ์–ด๋””์— ์œ„์น˜ํ•ด ์žˆ๋Š๋ƒ.
์ด Proj ๊ฐ’์ด ํผ์…‰ํŠธ๋ก ์˜ activation function์ด ์ •ํ•œ threshold(T) ๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜ ๊ฐ™์„ ๊ฒฝ์šฐ 1, ์•„๋‹ˆ๋ฉด -1์„ ouput์œผ๋กœ ๋‚˜์™€์„œ ๋ถ„ํ•  ํ•ด์ฃผ๋Š” ๊ฒƒ์ž„.

W์— ์ˆ˜์ง์ธ Decision boundary๋Š” 2์ฐจ์›์ผ๋•Œ๋Š” decision line, 3์ฐจ์›์ผ๋•Œ๋Š” decision plane, 4์ฐจ์› ์ด์ƒ์€ decision hyperplane.

๊ฒฐ์ • ๊ฒฝ๊ณ„ (decision boundary)๋Š” ์„ ํ˜•ํŒ๋ณ„ํ•จ์ˆ˜ y(x)=wTx+w0๊ฐ€ 0์„ ๋งŒ์กฑ์‹œํ‚ค๋Š” x์˜ ์ง‘ํ•ฉ
๋งŒ์•ฝ bias๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ bias(w0)๊ฐ€ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์š”์†Œ๊ฐ€ ๋จ(๊ธฐ์šธ๊ธฐ๋Š” ๊ฐ™์ง€๋งŒ ์œ„์น˜ ๋‹ค๋ฅธ).
์—ฌ๊ธฐ๋ฅผ ์ฐธ์กฐ.

Linear Classifier (cs231n)Permalink

linear-classifier-1
linear-classifier-2

์ด๋ฏธ์ง€ X(tensor)๋ฅผ flatten ์‹œ์ผœ์„œ 32 x 32 x 3์ธ ๋ฐ์ดํ„ฐ๋ฅผ 3072 x 1์˜ shape์ธ vector์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋„ฃ์–ด์ค„ ๊ฒƒ์ด๊ณ ,
W๋Š” 10 x 3072์˜ shape์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ์ด ํ–‰๋ ฌ๊ณผ input vector์™€์˜ ๋‚ด์ ์œผ๋กœ ๋‚˜์˜จ ๊ฐ’์ด 10 x 1์˜ shape์„ ์ทจํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์คŒ. (10๊ฐœ์˜ ํผ์…‰ํŠธ๋ก ์ด ์žˆ๋‹ค๊ณ  ๋ณด๋ฉด ๋จ)
10๊ฐœ์˜ target shape์„ ๋งŒ๋“  ๊ฒƒ์€ ์ž„์˜์ž„. 3๊ฐœ์˜ target class๋ฅผ ์›ํ•œ๋‹ค๋ฉด 3 x 1์˜ shape์œผ๋กœ ์ ์šฉํ•˜๋ฉด ๋จ.

์•„๋ž˜์˜ slide๋ฅผ ๋ณด๋ฉด,
cat, dog and ship์ธ 3๊ฐœ์˜ ํด๋ž˜์Šค๋กœ ๊ตฌ๋ถ„ํ•˜๋Š” ์˜ˆ์ œ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด์„œ W์˜ shape์ด 3 x 4์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Œ.

linear-classifier-3

ํผ์…‰ํŠธ๋ก  ํ•˜๋‚˜๊ฐ€ ์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ 1๊ฐœ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ.

cs231n:
The problem is that the linear classifier is only learning one template for each class.
So, if thereโ€™s sort of variations in how that class might appear, itโ€™s trying to average out all those different variations, all those different appearances, and use just one single template to recognize each of those categories.

linear-classifier-4

Images as points in high dimensional space. The linear classifier is putting in these linear decision boundaries to try to draw linear separation between one category and the rest of the categories.

์—ญํ–‰๋ ฌ (Inverse Matrix)Permalink

vector๋ฅผ transform ์‹œํ‚จ ๋‹ค์Œ ๋‹ค์‹œ ์›๊ณต๊ฐ„์œผ๋กœ ๋˜๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋จ.

๊ณต๊ฐ„์˜ ๋ณ€ํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์ด๋‚˜ ๊ฐ€์„ค(PCA ๊ฐ™์€ ๊ฒฝ์šฐ)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์— ์—ญํ–‰๋ ฌ์ด ๊ฐ€๋Šฅํ•˜๋ฉด ํŽธ๋ฆฌํ•ด์ง€๋Š” ๋ถ€๋ถ„์ด ์กด์žฌํ•จ.

์„ ํ˜•๋Œ€์ˆ˜์—์„œ ์—ญํ–‰๋ ฌ์„ ํ†ตํ•ด ๋ฐฉ์ •์‹์„ ๋ณด๋‹ค ํŽธ๋ฆฌํ•˜๊ฒŒ ํ’€๊ธฐ ์œ„ํ•จ.

  • ๊ฐ€์—ญํ–‰๋ ฌ(Invertible Matrix)์˜ ์กฐ๊ฑด ์ค‘ ์ค‘์š”ํ•œ ๊ฒƒ.
    • A์˜ ๋ชจ๋“  ํ–‰๊ณผ ์—ด์ด ์„ ํ˜•๋…๋ฆฝ์ด๋‹ค.
    • det(A)โ‰ 0.
    • ATA๋Š” positive definite ๋Œ€์นญํ–‰๋ ฌ์ž„.
    • A์˜ eigenvalue๋Š” ๋ชจ๋‘ 0์ด ์•„๋‹ˆ๋‹ค.

ํ–‰๋ ฌ์‹ (Determinant)Permalink

๊ธฐํ•˜ํ•™์  ์˜๋ฏธ: ํ–‰๋ ฌ์˜ ๊ณฑ์— ์˜ํ•œ ๊ณต๊ฐ„์˜ ํ™•์žฅ ๋˜๋Š” ์ถ•์†Œ ํ•ด์„

  • det(A)=0: ํ•˜๋‚˜์˜ ์ฐจ์›์„ ๋”ฐ๋ผ ์ถ•์†Œ๋˜์–ด ๋ถ€ํ”ผ๋ฅผ ์žƒ๊ฒŒ ๋จ
  • det(A)=1: ๋ถ€ํ”ผ ์œ ์ง€ํ•œ ๋ณ€ํ™˜/๋ฐฉํ–ฅ ๋ณด์กด ๋จ
  • det(A)=โˆ’1: ๋ถ€ํ”ผ ์œ ์ง€ํ•œ ๋ณ€ํ™˜/๋ฐฉํ–ฅ ๋ณด์กด ์•ˆ๋จ
  • det(A)=5: 5๋ฐฐ ๋ถ€ํ”ผ ํ™•์žฅ๋˜๋ฉฐ ๋ฐฉํ–ฅ ๋ณด์กด

์›๊ณต๊ฐ„์— ๋Œ€ํ•œ ๋ณ€ํ™˜์˜ ๋ถ€ํ”ผ์˜ ๋ณ€ํ™”๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ž„.

์ •๋ถ€ํ˜ธ ํ–‰๋ ฌ (positive definite matrices)Permalink

ํ–‰๋ ฌ์˜ ๊ณต๊ฐ„์˜ ๋ชจ์Šต์„ ํŒ๋‹จํ•˜๊ธฐ ์œ„ํ•ด?
์–‘์˜ ์ •๋ถ€ํ˜ธ ํ–‰๋ ฌ: 0์ด ์•„๋‹Œ ๋ชจ๋“  ๋ฒกํ„ฐ x์— ๋Œ€ํ•ด, xTAx>0
์„ฑ์งˆ

  • ๊ณ ์œ ๊ฐ’ ๋ชจ๋‘ ์–‘์ˆ˜
  • ์—ญํ–‰๋ ฌ๋„ ์ •๋ถ€ํ˜ธ ํ–‰๋ ฌ
  • det(A)=0.

๋ถ„ํ•ด (Decomposition)Permalink

๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด (Eigen-decomposition)Permalink

ML-Basics-Linear-Algebra ์ฐธ์กฐ.

์ •๋ฐฉํ–‰๋ ฌ AโˆˆRnร—n ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, Ax=ฮปx,xโ‰ 0์„ ๋งŒ์กฑํ•˜๋Š” ฮปโˆˆC๋ฅผ A์˜ ๊ณ ์œ ๊ฐ’(eigenvalue) ๊ทธ๋ฆฌ๊ณ  xโˆˆCn์„ ์—ฐ๊ด€๋œ ๊ณ ์œ ๋ฒกํ„ฐ(eigenvector)๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

Eigenvectors: ์„ ํ˜•๋ณ€ํ™˜(T)์ด ์ผ์–ด๋‚œ ํ›„์—๋„ ๋ฐฉํ–ฅ์ด ๋ณ€ํ•˜์ง€ ์•Š๋Š” ์˜๋ฒกํ„ฐ๊ฐ€ ์•„๋‹Œ ๋ฒกํ„ฐ.

Eigenvalues: Eigenvectors์˜ ๊ธธ์ด๊ฐ€ ๋ณ€ํ•˜๋Š” ๋ฐฐ์ˆ˜(scale), reversed๋‚˜ scaled๊ฐ€ ๋  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ฐฉํ–ฅ์€ ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค.
They make for interesting basis vectors. Basis vectors whos transformation matrices are maybe computationally more simpler or ones that make for better coordinate systems.

numpy.linalg ๋ชจ๋“ˆ์˜ eig ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์œ ๊ฐ’๊ณผ ๊ณ ์œ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

the-effect-of-eigenvectors-and-eigenvalues

Figure 2.3: An example of the effect of eigenvectors and eigenvalues. Here, we have a matrix A with two orthonormal eigenvectors, v(1) with eigenvalue ฮป1 and v(2) with eigenvalue ฮป2. (Left) We plot the set of all unit vectors u โˆˆ R2 as a unit circle. (Right) We plot the set of all points Au. By observing the way that A distorts the unit circle, we can see that it scales space in direction v(i) by ฮปi. Deep Learning. Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด์„œ ํ–‰๋ ฌ์˜ ์—ญํ–‰๋ ฌ๋„ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ , PCA์—์„œ๋„ ํ™œ์šฉํ•จ.

๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด๋Š” ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ์—๋งŒ ์ ์šฉ๋จ.
ํ•˜์ง€๋งŒ, ML์—์„œ ํ•ญ์ƒ ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ๋งŒ ์กด์žฌํ•œ๋‹ค๋Š” ๋ณด์žฅ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— SVD๋ฅผ ์‚ฌ์šฉ.

ํŠน์ž‡๊ฐ’ ๋ถ„ํ•ด (SVD: Singular Value Decomposition)Permalink

์ •์‚ฌ๊ฐ ํ–‰๋ ฌ์ด ์•„๋‹Œ ํ–‰๋ ฌ์˜ ์—ญํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋จ

ํŠน์ด๊ฐ’๋ถ„ํ•ด(SVD)์˜ ๊ธฐํ•˜ํ•™์  ์˜๋ฏธPermalink

ํ–‰๋ ฌ์„ xโ€ฒ=Ax์™€ ๊ฐ™์ด ์ขŒํ‘œ๊ณต๊ฐ„์—์„œ์˜ ์„ ํ˜•๋ณ€ํ™˜์œผ๋กœ ๋ดค์„ ๋•Œ ์ง๊ตํ–‰๋ ฌ(orthogonal matrix)์˜ ๊ธฐํ•˜ํ•™์  ์˜๋ฏธ๋Š” ํšŒ์ „๋ณ€ํ™˜(rotation transformation) ๋˜๋Š” ๋ฐ˜์ „๋œ(reflected) ํšŒ์ „๋ณ€ํ™˜, ๋Œ€๊ฐํ–‰๋ ฌ(diagonal maxtrix)์˜ ๊ธฐํ•˜ํ•™์  ์˜๋ฏธ๋Š” ๊ฐ ์ขŒํ‘œ์„ฑ๋ถ„์œผ๋กœ์˜ ์Šค์ผ€์ผ๋ณ€ํ™˜(scale transformation)์ด๋‹ค.

ํ–‰๋ ฌ R์ด ์ง๊ตํ–‰๋ ฌ(orthogonal matrix)์ด๋ผ๋ฉด RRT=E์ด๋‹ค. ๋”ฐ๋ผ์„œ det(RRT)=det(R)det(RT)=det(R2)=1์ด๋ฏ€๋กœ det(R)๋Š” ํ•ญ์ƒ +1, ๋˜๋Š” -1์ด๋‹ค. ๋งŒ์ผ det(R)=1๋ผ๋ฉด ์ด ์ง๊ตํ–‰๋ ฌ์€ ํšŒ์ „๋ณ€ํ™˜์„ ๋‚˜ํƒ€๋‚ด๊ณ  det(R)=โˆ’1๋ผ๋ฉด ๋’ค์ง‘ํ˜€์ง„(reflected) ํšŒ์ „๋ณ€ํ™˜์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

๐Ÿ‘‰ ๋”ฐ๋ผ์„œ ์‹ (1),A=UฮฃVT์—์„œ U,V๋Š” ์ง๊ตํ–‰๋ ฌ, ฮฃ๋Š” ๋Œ€๊ฐํ–‰๋ ฌ์ด๋ฏ€๋กœ Ax๋Š” x๋ฅผ ๋จผ์ € VT์— ์˜ํ•ด ํšŒ์ „์‹œํ‚จ ํ›„ ฮฃ๋กœ ์Šค์ผ€์ผ์„ ๋ณ€ํ™”์‹œํ‚ค๊ณ  ๋‹ค์‹œ U๋กœ ํšŒ์ „์‹œํ‚ค๋Š” ๊ฒƒ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

svd

<๊ทธ๋ฆผ1> ์ถœ์ฒ˜: ์œ„ํ‚คํ”ผ๋””์•„

๐Ÿ‘‰ ์ฆ‰, ํ–‰๋ ฌ์˜ ํŠน์ด๊ฐ’(singular value)์ด๋ž€ ์ด ํ–‰๋ ฌ๋กœ ํ‘œํ˜„๋˜๋Š” ์„ ํ˜•๋ณ€ํ™˜์˜ ์Šค์ผ€์ผ ๋ณ€ํ™˜์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ‘‰ ๊ณ ์œ ๊ฐ’๋ถ„ํ•ด(eigendecomposition)์—์„œ ๋‚˜์˜ค๋Š” ๊ณ ์œ ๊ฐ’(eigenvalue)๊ณผ ๋น„๊ตํ•ด ๋ณด๋ฉด ๊ณ ์œ ๊ฐ’์€ ๋ณ€ํ™˜์— ์˜ํ•ด ๋ถˆ๋ณ€์ธ ๋ฐฉํ–ฅ๋ฒกํ„ฐ(-> ๊ณ ์œ ๋ฒกํ„ฐ)์— ๋Œ€ํ•œ ์Šค์ผ€์ผ factor์ด๊ณ , ํŠน์ด๊ฐ’์€ ๋ณ€ํ™˜ ์ž์ฒด์˜ ์Šค์ผ€์ผ factor๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ‘‰ ์ด ์ฃผ์ œ์™€ ๊ด€๋ จํ•˜์—ฌ ์กฐ๊ธˆ ๋” ์ƒ์ƒ์˜ ๋‚˜๋ž˜๋ฅผ ํŽด ๋ณด๋ฉด, mร—n ํ–‰๋ ฌ A๋Š” n์ฐจ์› ๊ณต๊ฐ„์—์„œ m์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ์˜ ์„ ํ˜•๋ณ€ํ™˜์ด๋‹ค. n์ฐจ์› ๊ณต๊ฐ„์— ์žˆ๋Š” ์›, ๊ตฌ ๋“ฑ๊ณผ ๊ฐ™์ด ์›ํ˜•์œผ๋กœ ๋œ ๋„ํ˜•์„ A์— ์˜ํ•ด ๋ณ€ํ™˜์‹œํ‚ค๋ฉด ๋จผ์ € VT์— ์˜ํ•ด์„œ๋Š” ํšŒ์ „๋งŒ ์ผ์–ด๋‚˜๋ฏ€๋กœ ๋„ํ˜•์˜ ํ˜•ํƒœ๋Š” ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ฮฃ์— ์˜ํ•ด์„œ๋Š” ํŠน์ด๊ฐ’์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ์„œ ์›์ด ํƒ€์›์ด ๋˜๊ฑฐ๋‚˜ ๊ตฌ๊ฐ€ ๋Ÿญ๋น„๊ณต์ด ๋˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์‹์˜ ํ˜•ํƒœ๋ณ€ํ™˜์ด ์ผ์–ด๋‚œ๋‹ค (n์ด 2์ฐจ์›์ธ ์›์˜ ๊ฒฝ์šฐ ์ฒซ๋ฒˆ์งธ ํŠน์ด๊ฐ’ ฯƒ1์€ ๋ณ€ํ™˜๋œ ํƒ€์›์˜ ์ฃผ์ถ•์˜ ๊ธธ์ด, ๋‘๋ฒˆ์งธ ํŠน์ด๊ฐ’ ฯƒ2๋Š” ๋‹จ์ถ•์˜ ๊ธธ์ด์— ๋Œ€์‘๋œ๋‹ค). ์ดํ›„ U์— ์˜ํ•œ ๋ณ€ํ™˜๋„ ํšŒ์ „๋ณ€ํ™˜์ด๋ฏ€๋กœ ๋„ํ˜•์˜ ํ˜•ํƒœ์—๋Š” ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ๋ชปํ•œ๋‹ค. ๋งŒ์ผ m>n์ด๋ผ๋ฉด 0์„ ๋ง๋ถ™์—ฌ์„œ ์ฐจ์›์„ ํ™•์žฅํ•œ ํ›„์— U๋กœ ํšŒ์ „์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ  m<n์ด๋ผ๋ฉด ์ผ๋ถ€ ์ฐจ์›์„ ์—†์• ๋ฒ„๋ฆฌ๊ณ (์ผ์ข…์˜ ํˆฌ์˜) ํšŒ์ „์„ ์‹œํ‚ค๋Š” ์…ˆ์ด๋‹ค. ๊ฒฐ๊ตญ ์„ ํ˜•๋ณ€ํ™˜ A์— ์˜ํ•œ ๋„ํ˜•์˜ ๋ณ€ํ™˜๊ฒฐ๊ณผ๋Š” ํ˜•ํƒœ์ ์œผ๋กœ ๋ณด๋ฉด ์˜ค๋กœ์ง€ A์˜ ํŠน์ด๊ฐ’(singular value)๋“ค์— ์˜ํ•ด์„œ๋งŒ ๊ฒฐ์ •๋œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Information Theory & Optimization (chapter 3 in Deep Learning book)Permalink

ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ์ •๋Ÿ‰ํ™”

์ •๋ณด์ด๋ก ์˜ ๊ธฐ๋ณธ์›๋ฆฌ ๐Ÿ‘‰ ํ™•๋ฅ ์ด ์ž‘์„์ˆ˜๋ก ๋งŽ์€ ์ •๋ณด
unlikely event์˜ ์ •๋ณด๋Ÿ‰์ด ๋งŽ์Œ.

์ž๊ธฐ ์ •๋ณด (self information)Permalink

์‚ฌ๊ฑด(๋ฉ”์‹œ์ง€, ei)์˜ ์ •๋ณด๋Ÿ‰ (๋กœ๊ทธ ๋ฐ‘์ด 2์ธ ๊ฒฝ์šฐ bit, ์ž์—ฐ์ƒ์ˆ˜์ธ ๊ฒฝ์šฐ nat)
h(ei)=โˆ’log2P(ei) or h(ei)=โˆ’logeP(ei)

์˜ˆ, ๋™์ „ ์•ž๋ฉด์ด ๋‚˜์˜ค๋Š” ์‚ฌ๊ฑด์˜ ์ •๋ณด๋Ÿ‰:
log2(12)=1
1~6์ธ ์ฃผ์‚ฌ์œ„์—์„œ 1์ด ๋‚˜์˜ค๋Š” ์‚ฌ๊ฑด์˜ ์ •๋ณด๋Ÿ‰:
โˆ’log2(16)โ‰ˆ2.58

ํ›„์ž์˜ ์‚ฌ๊ฑด์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ์ •๋ณด๋Ÿ‰์„ ๊ฐ–๋Š”๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Œ.

์—”ํŠธ๋กœํ”ผ (Entropy)Permalink

ํ™•๋ฅ  ๋ณ€์ˆ˜ x์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์—”ํŠธ๋กœํ”ผ
๋ชจ๋“  ์‚ฌ๊ฑด ์ •๋ณด๋Ÿ‰์˜ ๊ธฐ๋Œ“๊ฐ’์œผ๋กœ ํ‘œํ˜„

์ด์‚ฐํ™•๋ฅ ๋ถ„ํฌ H(x)=โˆ’โˆ‘i=1,kP(ei)log2P(ei) ๋˜๋Š” H(x)=โˆ’โˆ‘i=1,kP(ei)logeP(ei)
์—ฐ์†ํ™•๋ฅ ๋ถ„ํฌ H(x)=โˆ’โˆซRP(x)log2P(x) ๋˜๋Š” H(x)=โˆ’โˆซRP(x)logeP(x)

์˜ˆ, ๋™์ „์˜ ์•ž๋’ค์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์ด ๋™์ผํ•œ ๊ฒฝ์šฐ์˜ ์—”ํŠธ๋กœํ”ผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

H(x)=โˆ’โˆ‘i=1,kP(ei)logP(ei)=โˆ’(0.5ร—log20.5+0.5ร—log20.5)=โˆ’log20.5=โˆ’(โˆ’1)

๋™์ „์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์— ๋”ฐ๋ฅธ ์—”ํŠธ๋กœํ”ผ ๋ณ€ํ™” (binary entrophy)
entropy-plot

  • ๊ณตํ‰ํ•œ ๋™์ „์ผ ๊ฒฝ์šฐ ๊ฐ€์žฅ ํฐ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ
  • ๋™์ „ ๋˜์ง€๊ธฐ ๊ฒฐ๊ณผ ์ „์†ก์—๋Š” ์ตœ๋Œ€ 1๋น„ํŠธ๊ฐ€ ํ•„์š”ํ•จ์„ ์˜๋ฏธ

๋ชจ๋“  ์‚ฌ๊ฑด์ด ๋™์ผํ•œ ํ™•๋ฅ ์„ ๊ฐ€์งˆ ๋•Œ, ์ฆ‰, ๋ถˆํ™•์‹ค์„ฑ์ด ๊ฐ€์žฅ ๋†’์€ ๊ฒฝ์šฐ, ์—”ํŠธ๋กœํ”ผ๊ฐ€ ์ตœ๋Œ€ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.

์˜ˆ, ์œท๊ณผ pair ์ฃผ์‚ฌ์œ„(1~6)์˜ ์—”ํŠธ๋กœํ”ผ ๊ฐ’์„ ๋น„๊ต.

์œท: H(x)=โˆ’(416log2416+616log2616+416log2416+116log2116+116log2116)โ‰ˆ2.0306๋น„ํŠธ

์ฃผ์‚ฌ์œ„: H(x)=โˆ’(16log216+16log216+16log216+16log216+16log216+16log216)โ‰ˆ2.585๋น„ํŠธ

๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ (Cross Entropy)Permalink

๋‘ ๊ฐœ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์–ผ๋งˆ ๋งŒํผ์˜ ์ •๋ณด๋ฅผ ๊ณต์œ ํ•˜๋Š” ๊ฐ€.

H(P,Q)=โˆ’โˆ‘xP(x)log2Q(x)=โˆ’โˆ‘i=1,kP(ei)log2Q(ei)

P๋ผ๋Š” ํ™•๋ฅ ๋ถ„ํฌ์— ๋Œ€ํ•ด์„œ Q์˜ ๋ถ„ํฌ์˜ cross entropy

๋”ฅ๋Ÿฌ๋‹์—์„œ output์€ ํ™•๋ฅ ๊ฐ’์ž„.
์†์‹คํ•จ์ˆ˜๋Š” ์ •๋‹ต(label or target)๊ณผ ์˜ˆ์ธก๊ฐ’(prediction)์„ ๋น„๊ตํ•˜๊ธฐ ๋•Œ๋ฌธ์—
์ด๋ฅผ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์ž„.

label ๊ฐ™์€ ๊ฒฝ์šฐ๋„ OHE๋กœ ํ•˜์ง€๋งŒ ์ด๊ฒƒ๋„ 1๋กœ ๋˜์–ด์žˆ๋Š” ํ™•๋ฅ  ๋ถ„ํฌ๊ณ  output๋„ ํ™•๋ฅ  ๋ถ„ํฌ์ด๊ธฐ ๋•Œ๋ฌธ์—
์ด ์ฒ™๋„๋กœ ๋น„๊ต๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ CE์ž„.

์œ„์˜ ์‹์„ ์ „๊ฐœํ•˜๋ฉด,

H(P,Q)=โˆ’โˆ‘xP(x)log2Q(x)=โˆ’โˆ‘xP(x)log2P(x)+โˆ‘xP(x)log2P(x)โˆ’โˆ‘xP(x)log2Q(x)=H(P)+โˆ‘xP(x)log2P(x)Q(x)

โˆ’โˆ‘xP(x)log2P(x)+โˆ‘xP(x)log2P(x) ์ด ์‹์„ ์ถ”๊ฐ€ํ•ด์„œ ๋ณ€ํ˜•ํ•œ ๊ฒƒ์ž„.
์ด ์‹์„ ํ•ฉ์น˜๋ฉด

โˆ‘xP(x)log2P(x)โˆ’โˆ‘xP(x)log2Q(x)โ‡’โˆ‘xP(x)log2P(x)Q(x)

์ด๋ ‡๊ฒŒ ๋˜๋Š” ๋ฐ, ์ด ์‹์„ KL Divergence ๋ผ๊ณ  ํ•จ.

์—ฌ๊ธฐ์„œ P๋ฅผ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ผ๊ณ  ํ•˜๋ฉด, ์ด๋Š” ํ•™์Šต๊ณผ์ •์—์„œ ๋ณ€ํ™”ํ•˜์ง€ ์•Š์Œ.
P๋Š” ๊ณ ์ •์ด๊ธฐ ๋•Œ๋ฌธ์— Q๋ฅผ ์กฐ์ •ํ•ด์„œ cross entropy๊ฐ’์„ ์ตœ์†Œํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์ž„.

Cross Entropy๋ฅผ ์†์‹คํ•จ์ˆ˜๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ,

H(P,Q)=H(P)+โˆ‘xP(x)log2P(x)Q(x)

์ด ์‹์—์„œ, P๋Š” ๊ณ ์ •์ด๊ธฐ ๋•Œ๋ฌธ์— KLD๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•จ.

์ฆ‰, ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ P(x)์™€ ์ถ”์ •ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ Q(x)๊ฐ„์˜ ์ฐจ์ด ์ตœ์†Œํ™” ํ•˜๋Š”๋ฐ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์‚ฌ์šฉํ•จ.

KLD (Kullbackโ€“Leibler divergence)Permalink

  • P์™€ Q ์‚ฌ์ด์˜ KLD
  • ๋‘ ํ™•๋ฅ ๋ถ„ํฌ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ.
KL(Pโ€–Q)=โˆ‘xP(x)log2P(x)Q(x)

P์™€ Q์˜ cross entrophy๋Š” p์˜ ์—”ํŠธ๋กœํ”ผ + P์™€ Q๊ฐ„์˜ KL ๋‹ค์ด๋ฒ„์ „์Šค์ž„.

LogitPermalink

DL์˜ output์€ probability๊ฐ€ ๋˜์–ด์•ผ ํ•˜๋Š”๋ฐ,
๋„คํŠธ์›Œํฌ๋ฅผ ํ†ต๊ณผํ•ด์„œ ๋‚˜์˜จ ๊ฐ’์˜ ๋ฒ”์œ„๋Š” [โˆ’โˆž,โˆž]์ด๊ธฐ ๋•Œ๋ฌธ์—

activation function(sigmoid)์„ ์ ์šฉํ•ด์„œ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋งŒ๋“ค์–ด ์ฃผ๋Š”๋ฐ, ์ด๋Ÿฐ ๊ฒฝ์šฐ ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์ž„ (๋ชจ๋“  ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์„ ๋”ํ–ˆ์„ ๋•Œ 1์ด ์•„๋‹˜)
multilabel classification์ผ ๊ฒฝ์šฐ์— ๊ฐ€๋Šฅํ•˜์ง€๋งŒ.
multiclass์˜ ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ์›ํ•˜๋Š” ๊ฒƒ์ž„(๋ชจ๋“  ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์„ ๋”ํ–ˆ์„ ๋•Œ 1)
์ด ์—ญํ• ์„ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด softmax function์ž„

ํ™•๋ฅ ์—์„œ์˜ logit function์€ loge(p1โˆ’p) ์ž„.
p๊ฐ€ 0%์— ๊ฐ€๊นŒ์šธ ๋•Œ logit์€ โˆ’โˆž์ด๊ณ , 100% ์— ๊ฐ€๊นŒ์šธ ๋•Œ logit์€ โˆž ์ž„.

logit

์ฆ‰ ๋กœ์ง“์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด([โˆ’โˆž,โˆž]), ์ด ๋ฒ”์œ„๋ฅผ [0,1]๋กœ ๋ฐ”๊ฟ”์ค„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž„.

ํ™•๋ฅ ์ด ์ปค์ง€๋ฉด ๋กœ์ง“๋„ ์ปค์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋”ฅ๋Ÿฌ๋‹์—์„œ ํ™•๋ฅ ๋Œ€์‹  ๋กœ์ง“์„ ์Šค์ฝ”์–ด๋กœ๋„ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์˜๋ฏธ์ž„.

ouput์œผ๋กœ ๋กœ์ง“์ด ๋‚˜์˜ค๊ณ  ์ด๊ฑธ sigmoid๋กœ ๋ฐ”๊พธ๋ฉด ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์ด ๋‚˜์˜ด(๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•ด์„œ ๊ฐ ํด๋ž˜์Šค๊ฐ€ ์•„๋‹˜, ํ•ฉํ•˜๋ฉด 1์„ ๋„˜์Œ)
softmax๋Š” ๋กœ์ง“์„ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ์–ป๊ฒŒ ํ•ด์คŒ.
elogit์„ ์‚ฌ์šฉํ•จ.
elogit ์—ญ์‹œ ํ™•๋ฅ ์ด ์ปค์ง€๋ฉด ์ปค์ง„๋‹ค๋Š” ์„ฑ์งˆ์ด ์žˆ์Œ.
elogit์„ ์‚ฌ์šฉํ•˜๋ฉด ๋†’์€ ์ ์ˆ˜๋Š” ์•„์ฃผ ๋†’๊ฒŒ ๋˜๊ณ  ๋‚ฎ์€ ์ ์ˆ˜๋Š” ์•„์ฃผ ๋‚ฎ๊ฒŒ ๋จ.

์ด๊ฒƒ์ด ๋”ฅ๋Ÿฌ๋‹์— ๋„์›€์ด ๋˜๋Š” ์ด์œ ๋Š” label์„ OHE ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž„.
ํ™•๋ฅ ๊ฐ’์ด ๋†’์€ ๊ฒƒ๋“ค์„ ์•„์ฃผ ๋†’๊ฒŒ, ๋‚ฎ์€ ๊ฐ’๋“ค์„ ์•„์ฃผ ๋‚ฎ๊ฒŒ ํ•˜๋ฉด OHE์™€ ๋น„์Šทํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ.
๋”ฐ๋ผ์„œ CrossEntropy ํ•™์Šต์— ๋„์›€์ด ๋˜๋Š” ๊ฒƒ์ž„.

why e in log?
0๋ณด๋‹ค ํฐ ์•„๋ฌด ๊ฐ’์ด๋‚˜ ์จ๋„ ์ƒ๊ด€์€ ์—†์ง€๋งŒ(์ˆ˜ํ•™์ ์œผ๋กœ๋Š”),
๋”ฅ๋Ÿฌ๋‹์˜ output์„ ๋กœ์ง“์ด๋ผ๊ณ  ๊ฐ€์ •ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋กœ์ง“์€ log์— e base๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Œ.
๋”ฐ๋ผ์„œ, elogit=eloge(p1โˆ’p)=p1โˆ’p ๊ณ„์‚ฐ์ด ์ƒ๋‹นํžˆ ๊ฐ„ํŽธํ•ด์ง.
๋‹ค๋ฅธ ์ˆซ์ž๋ฅผ ์‚ฌ์šฉํ•ด๋„ softmax์— ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋Š” ๋‚˜์˜ค์ง€๋งŒ, ๋”ฅ๋Ÿฌ๋‹ ์•„์ด๋””์–ด์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋กœ์ง“์ด๊ธฐ ๋•Œ๋ฌธ์— e๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž„.
์˜ค์ผ๋Ÿฌ ๋„˜๋ฒ„๋ผ๊ณ  ํ•จ.

e-logit

logit-softmax

Summary

  • Softmax gives probability distribution over predicted output classes.
  • The final layer in deep learning has logit values which are raw values for prediction by softmax.
  • Logit is the input to softmax

ํŽธ๋ฏธ๋ถ„ (partial derivative)Permalink

  • ๋ณ€์ˆ˜๊ฐ€ ๋ณต์ˆ˜์ธ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„
  • ๋ฏธ๋ถ„ ๊ฐ’์ด ์ด๋ฃจ๋Š” ๋ฒกํ„ฐ๋ฅผ gradient๋ผ๊ณ  ๋ถ€๋ฆ„.

Jacobian MatrixPermalink

ํ–‰๋ ฌ์„ ๋ฏธ๋ถ„ํ•œ ๊ฒƒ.
1์ฐจ ํŽธ๋„ ํ•จ์ˆ˜

์‹ ๊ฒฝ๋ง์—์„œ ์—ฐ์‚ฐ์€ ํ–‰๋ ฌ๋กœ ์ด๋ฃจ์–ด์ง€๊ณ , ๋ฏธ๋ถ„์ด ํ•„์š”ํ•œ๋ฐ jacobian์„ ํ†ตํ•ด ๋ฏธ๋ถ„ํ•จ.

์˜ˆ,

f:R2โ†ฆR3 f(x)=(2x1+x22, โˆ’x12+3x2, 4x1x2)T J=(22x2โˆ’2x134x24x1)   J|2,1T=(22โˆ’4348)

์ž์ฝ”๋น„์•ˆ์ด ๋งํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ ๋ฏธ์†Œ ์˜์—ญ์—์„œ โ€˜๋น„์„ ํ˜• ๋ณ€ํ™˜โ€™์„ โ€˜์„ ํ˜• ๋ณ€ํ™˜์œผ๋กœ ๊ทผ์‚ฌโ€™ ์‹œํ‚จ ๊ฒƒ.

AppendixPermalink

Positive Definite MatricesPermalink

Geometric meaning of DeterminantPermalink

ReferencePermalink

Manifold Learning: https://deepinsight.tistory.com/124
Representation Learning: https://ratsgo.github.io/deep%20learning/2017/04/25/representationlearning/
Representation Learning: https://velog.io/@tobigs-gnn1213/7.-Graph-Representation-Learning
cs231n slides: http://cs231n.stanford.edu/slides/2021/
SVD: https://darkpgmr.tistory.com/106
logit: https://youtu.be/K7HTd_Zgr3w
partial derivative: https://youtu.be/ly4S0oi3Yz8, https://youtu.be/GkB4vW16QHI, https://youtu.be/AXqhWeUEtQU
jacobian matrix: https://angeloyeo.github.io/2020/07/24/Jacobian.html

Leave a comment