ML basics - Probability Part 2

10 minute read

Deep Learning Modelโ€™s Outcome is the Probability of the Variable X

Rules of Probability

Product Rule

lambda \[ p(s,y) = p(y|s)p(s) \] \[ p(y,s) = p(s|y)p(y) \]

Eventhough we donโ€™t know the intersection probability of R and O,
if there are p(y=O|s=R) and p(s=R), we can get the prob of p(s=R, y=O)

Sum Rule

lambda
\[ p(X) = \sum_{Y}\ p(X,Y) \]

Bayes Theorem

  • posterior
  • likelihood
  • prior
  • normalization
  • marginal prob dist
  • Conditional prob dist

Why Bayes Theorem is useful?

Itโ€™s all about dependency.
What I learned about intersection in school is that \(P(A \cap B) = P(A)\ P(B)\).
This is only true when those two event A and B are independent.
However, in reality this is not pleasible, this is why bayes theorm comes.

\[ P(H\mid E) = \frac{P(H)P(E\mid H)}{P(E)} \]

3B1B example (Librarian or Farmer)

Prior state:
The ratio is 20:1 (farmer:librarian).
lambda

Is Steve more likely to be a librarian or a farmer?
โ€œSteve is very shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail.โ€
Before given the information about meek and tidy, people decribe steve as more likely a farmer.
However, after that, people view as a librarian.
Accoding to Kahneman and Tversky, this is irrational.
Whatโ€™s important is that almost no one thinks to incorporate information about the ratio of farmers to librarians into their judgements.

  • Sample 10 librarians and 200 farmers.
  • Hear meek and tidy description.
  • Your instinct: 40% of librarians and 10% of farmers
  • Estimate: 4 librarians and 20 farmers

The prob of random person who fits this description is a librarian is 4/24, or 16.7%.
\[ P(Librarian\ given\ Description) = \frac{4}{4+20} \thickapprox 16.7%\]

Even if you think a librarian is 4 times as likely as a farmer to fit this description(40% Vs. 10%), thatโ€™s not enough to overcome the fact that there are way more farmers (prior state ratio).
The key underlying Bayesโ€™ theorem is that new evidence does not completely determine your beliefs. It should update prior beliefs.
lambda

Heart of Bayesโ€™ Theorem

lambda

When to use Bayesโ€™ Theorem

lambda

Prior and likelihood

lambda

Posterior

lambda

Normalization

In probability theory, a normalizing constant is a constant by which an everywhere non-negative function must be multiplied so the area under its graph is 1, e.g., to make it a probability density function or a probability mass function.
๋ฒ ์ด์ฆˆ ์ •๋ฆฌ์—์„œ \(P(\mathbb E)\) ๊ฐ€ normalizing constant ๋ผ๊ณ  ๋ณด๋ฉด๋œ๋‹ค.

Marginal Probaility Distribution & Conditional Probaility Distribution

Marginal stands for the marginal of the table.
Marginal probability doesnโ€™t tell about the specific probability(i.e. each cell.)

lambda

Functions of Random Variables

Joint PDF

k์ฐจ์›์˜ ํ™•๋ฅ ๋ณ€์ˆ˜ ๋ฒกํ„ฐ \(\mathbf x = (x_{1} , โ€ฆ , x_{k})\) ๊ฐ€ ์ฃผ์–ด์กŒ์„๋•Œ,
ํ™•๋ฅ ๋ณ€์ˆ˜๋ฒกํ„ฐ \(\mathbf y = (y_{1} , โ€ฆ , y_{k})\) ๋ฅผ ์ •์˜ํ•œ๋‹ค.
\(\mathbf y =\mathbf g(\mathbf x)\) ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
ex. \(\mathbf y_{1} =\mathbf g_{1}(x_{1} , โ€ฆ , x_{k})\)
๋งŒ์•ฝ \(\mathbf y =\mathbf g(\mathbf x)\) ๊ฐ€ ์ผ๋Œ€์ผ ๋ณ€ํ™˜์ธ ๊ฒฝ์šฐ,
\(\mathbf y\)์˜ Joint PDF๋Š”
\[ P_{\mathbf y}\ (y_{1} , โ€ฆ , y_{k}) = P_{\mathbf x}\ (x_{1} , โ€ฆ , x_{k})\ |J| \]
\(J\) is the matrix of all its first-order partial derivatives.(wikipedia)

Jacobian Matrix

Jacobian Matrix: is the matrix of all first-order partial derivatives of a multiple variables vector-valued function. When the matrix is a square matrix, both the matrix and its determinantare referred to as the Jacobian in literature. The Jacobian is then the generalization of the gradient for vector-valued functions of several variables.

first-order partial derivatives: The first order derivative of a function represents the rate of change of one variable with respect to another variable. For example, in Physics we define the velocity of a body as the rate of change of the location of the body with respect to time. Here location is the dependent variable on the other hand time is the independent variable. To find the velocity, we need to compute the first order derivative of the location. Similarly, we can this concept for computing rate of dependency of one variable over the other. https://www.toppr.com/guides/fundamentals-of-business-mathematics-and-statistics/calculus/derivative-first-order/

Note that when m=1 the Jacobian is same as the gradient because it is a generalization of the gradient. https://math.stackexchange.com/questions/1519367/difference-between-gradient-and-jacobian

์•„๋ฌด๋ฆฌ ๋ณต์žกํ•œ ํ•จ์ˆ˜๋ผ๋„ Jacobian Matrix๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์ƒˆ๋กœ ์ฃผ์–ด์ง„ ํ™•๋ฅ  ๋ณ€์ˆ˜์— ๋Œ€ํ•ด์„œ ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ๊ฐ•๋ ฅํ•œ ํž˜์ด๋‹ค.

๋น„์„ ํ˜•๋ณ€ํ™˜์ด์ง€๋งŒ ์ž‘์€ ์˜์—ญ์—์„œ๋Š” ์„ ํ˜•๋ณ€ํ™˜์ด์ง€ ์•Š์„๊นŒ.
๊ตญ์†Œ์  ์˜์—ญ์—์„œ ์„ ํ˜•๋ณ€ํ™˜์œผ๋กœ approxํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ.

Inverse CDF Technique

In probability and statistics, the quantile function, associated with a probability distribution of a random variable, specifies the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability. It is also called the percent-point function or inverse cumulative distribution function.

By cumulative distribution function we denote the function that returns probabilities of \(X\) being smaller than or equal to some value \(x\),
\[ Pr(X \le x) = F(X) \]
This function takes as input \(x\) and returns values from the \([0,1]\) interval (probabilities)โ€”letโ€™s denote them as \(p\). The inverse of the cumulative distribution function (or quantile function) tells you what \(x\) would make \(F(x)\) return some value \(p\), \[ F^{โˆ’1}(p) = x \]
lambda

In short:
CDF: x -> prob
ICDF: prob -> x

Expectations

Discrete Distribution

Expectation of Discrete Distribution can be interpreted as weighted sum of\(p(x_{i})\).
\[ \mathbb E[X] = \sum_{x_{i}\in \omega}x_{i}\ p(x_{i}) \]
or
\[ \mathbb E[f] = \sum_{x}p(x)\ f(x) \]

Continuous Distribution

\[ \mathbb E[X] = \int\ x\ f(x)\ dx \]
or
\[ \mathbb E[f] = \int\ p(x)\ f(x)\ dx \]

Multiple Random Variables

๋‘ ๊ฐœ์˜ ๋ณ€์ˆ˜ x, y์˜ ๊ฒฝ์šฐ:
\[ \mathbb E_{x}{[}f(x,y){]} = \sum_{x}f(x,y)p(x) \]
์—ฌ๊ธฐ์„œ x์— ๊ด€ํ•ด์„œ ํ•ฉ์„ ํ•ด๋ฒ„๋ฆฌ๊ฑฐ๋‚˜ ์ ๋ถ„์„ ํ•ด๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ๋Š” y์— ๊ด€ํ•œ ๊ฒƒ๋“ค๋งŒ ๋‚จ๊ฒŒ ๋œ๋‹ค.
๋”ฐ๋ผ์„œ y์— ๋Œ€ํ•œ ํ•จ์ˆ˜์ด๋‹ค.

๋‘ ๊ฐœ์˜ ๋ณ€์ˆ˜ ๋ชจ๋‘์— ๋Œ€ํ•ด ๊ธฐ๋Œ€๊ฐ’์„ ๊ตฌํ• ๋•Œ๋Š” joint pdf๊ฐ€ ์ฃผ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋”ํ•˜๋ฉด ๋œ๋‹ค. \[ \mathbb E_{x,y}{[}f(x,y){]} = \sum_{y}\sum_{x}f(x,y)p(x,y)\]

Conditional Expectation

\[ \mathbb E_{x}{[}f|y{]} = \sum_{x}f(x)p(x|y) \]

Variance

Variance is the expectation of the squared deviation of a random variable from its mean. The variance is the square of the standard deviation.
\[ var[f] = \mathbb E{[}(f(x)-\mathbb E{[}f(x){]})^{2}{]} = \mathbb E{[}f(x)^{2}{]} - \mathbb E{[}f(x){]} ^{2} \]
\[ var[x] = \mathbb E{[}x^{2}{]} - \mathbb E{[}x{]} ^{2} \]

Covariance

Covariance is a measure of the joint variability of two random variables. The sign of the covariance shows the tendency in the linear relationship between the variables.

for random variable x and y:

\[\begin{align} cov {[}x,y{]} &= \mathbb E_{x,y}{[}{x-\mathbb E{[}x{]} }{y-\mathbb E{[}y{]} }{]} \\ &= \mathbb E_{x,y}{[}xy{]} - \mathbb E{[}x{]} \mathbb E{[}y{]} \\ \end{align}\]

for vector of random variable \(\mathbf x\) and \(\mathbf y\):

\[\begin{align} cov {[}\mathbf x,\mathbf y{]} &= \mathbb E_{\mathbf x,\mathbf y}{[}{\mathbf x-\mathbb E{[}\mathbf x{]} }{y^{T} -\mathbb E{[}\mathbf y^{T}{]} }{]} \\ &= \mathbb E_{\mathbf x,\mathbf y}{[}\mathbf x\mathbf y^{T}{]} - \mathbb E{[}\mathbf x{]} \mathbb E{[}\mathbf y^{T}{]} \\ cov {[}\mathbf x,\mathbf y{]} &= cov {[}\mathbf x,\mathbf x{]} \end{align}\]

Gaussian Distribution

์ •๊ทœ ๋ถ„ํฌ ๋˜๋Š” ๊ฐ€์šฐ์Šค ๋ถ„ํฌ(Gaussian distribution)๋Š” ์—ฐ์† ํ™•๋ฅ  ๋ถ„ํฌ์˜ ํ•˜๋‚˜์ด๋‹ค. ์ •๊ทœ๋ถ„ํฌ๋Š” ์ˆ˜์ง‘๋œ ์ž๋ฃŒ์˜ ๋ถ„ํฌ๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐ์— ์ž์ฃผ ์‚ฌ์šฉ๋˜๋ฉฐ, ์ด๊ฒƒ์€ ์ค‘์‹ฌ๊ทนํ•œ์ •๋ฆฌ์— ์˜ํ•˜์—ฌ ๋…๋ฆฝ์ ์ธ ํ™•๋ฅ ๋ณ€์ˆ˜๋“ค์˜ ํ‰๊ท ์€ ์ •๊ทœ๋ถ„ํฌ์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ์„ฑ์งˆ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
์ •๊ทœ๋ถ„ํฌ๋Š” 2๊ฐœ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜ ํ‰๊ท  \(\mu\) ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ \(\sigma\) ์— ๋Œ€ํ•ด ๋ชจ์–‘์ด ๊ฒฐ์ •๋˜๊ณ , ์ด๋•Œ์˜ ๋ถ„ํฌ๋ฅผ \(N(\mu ,\sigma ^{2})\)๋กœ ํ‘œ๊ธฐํ•œ๋‹ค. ํŠนํžˆ, ํ‰๊ท ์ด 0์ด๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ธ ์ •๊ทœ๋ถ„ํฌ \(N(0,1)\)์„ ํ‘œ์ค€ ์ •๊ทœ ๋ถ„ํฌ(standard normal distribution)๋ผ๊ณ  ํ•œ๋‹ค.

๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ์˜ ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜(PDF):
\[ \mathcal{N}(x|/mu, \sigma ^{2}) = \frac{1}{(2\pi\sigma^{2})^{1/2}} exp \left(-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right) \]

ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๊ทธ ํ•จ์ˆ˜๋ฅผ \(-\infty\)๋ถ€ํ„ฐ \(\infty\)๊นŒ์ง€ ์ ๋ถ„ํ–ˆ์„๋•Œ ๊ทธ ๊ฐ’์ด 1์ด ๋˜๋ฉด ์ด ๋ฐ€๋„ํ•จ์ˆ˜๋Š” normalized.
\[ \int_{-\infty}^{\infty}\ \mathcal{N}(x|\mu, \sigma^{2})dx = 1 \]

Expectation

\[ \mathbb E{[}x{]} = \mu \]

Variance

\[ var{[}x{]} = \sigma^{2} \]

Maximum Likelihood Solution

\(\mathbf X = (x_{1},...,x_{N})^{T}\)๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ๊ฐ™์€ ๊ฐ€์šฐ์‹œ์•ˆ๋ถ„ํฌ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœ๋œ N๊ฐœ์˜ ์ƒ˜ํ”Œ๋“ค์ด๋ผ๊ณ  ํ•  ๋–„,

\[\begin{align} p(\mathbf X|\mu,\sigma^{2}) & = p(x_{1},...,x_{N}|\mu, \sigma^{2}) \\\\ & = \Pi_{n=1}^{N}\ \mathcal{N}(x|\mu, \sigma^{2}) \\\\ & = N(x_{1}|\mu, \sigma^{2})\times\ N(x_{1}|\mu, \sigma^{2})\times\ ...\ \times\ N(x_{N}|\mu, \sigma^{2}) \end{align}\]
$$p(\mathbf X \mu,\sigma^{2})\(์ด ๊ฐ’์„ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋Š”\)\mu\(์™€\)\sigma$$๋ฅผ ์ฐพ์œผ๋ ค๊ณ  ํ•œ๋‹ค.

\(\mu\)์˜ Maximum Likelihood Solution

์ž์—ฐ๋กœ๊ทธ \(\ln\)๋ฅผ ์”Œ์›Œ์„œ ํ‘ผ๋‹ค:
\[ \ln\ p(\mathbf X|\mu,\sigma^{2}) = -\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} -\ \frac{N}{2}\ln\sigma^{2} -\ \frac{N}{2}\ln(2\pi) \]

์ตœ๋Œ€์šฐ๋„ํ•ด๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋กœ๊ทธ๋ฅผ ์”Œ์šด ํ›„ \(\mu\)๊ฐ’์œผ๋กœ ๋ฏธ๋ถ„ํ•œ๋‹ค.
\(\begin{align} \frac{\partial}{\partial\mu}\ln\ p(\mathbf X|\mu,\sigma^{2}) &= \frac{\partial}{\partial\mu} \left\{-\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} - \frac{N}{2}\ln\ \sigma^{2} - \frac{N}{2}\ln(2\pi) \right\} \\\\ &= -\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}2(x_{n}-\mu)\cdot(-1) \\\\ &= \frac{1}{\sigma^{2}}\left\{ \left(\sum_{n=1}^{N}x_{n}\right) - N\mu \right\} \end{align}\)

\[ \mu_{ML} = \frac{1}{N} \sum_{n=1}{N}x_{n} \]
*ML: Maximum Likelihood
๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ์˜ ํ‰๊ท ๊ฐ’์„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ฐพ๊ณ  ์‹ถ์€๋ฐ N๊ฐ’์„ ๊ด€์ฐฐํ–ˆ์„๋•Œ N๊ฐœ์˜ ๊ฐ’์„ ํ‰๊ท ์„ ๋‚ด๋ฉด ์ด ํ•จ์ˆ˜์˜ ํ‰๊ท ์ด ๋˜์ง€ ์•Š์„๊นŒ.

\(\sigma^{2}\)์˜ Maximum Likelihood Solution

\(y=\sigma^{2}\)
์ตœ๋Œ€์šฐ๋„ํ•ด๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋กœ๊ทธ๋ฅผ ์”Œ์šด ํ›„ \(\sigma^{2}\)๊ฐ’์œผ๋กœ ๋ฏธ๋ถ„ํ•œ๋‹ค.

\[\begin{align} \frac{\partial}{\partial y}\ln\ p(\mathbf X|y,\sigma^{2}) &= \frac{\partial}{\partial y} \left\{-\frac{1}{2}y^{-1} \sum_{n=1}^{N}(x_{n}-\mu_{ML})^{2} - \frac{N}{2}\ln y\ \sigma^{2} - \frac{N}{2}\ln(2\pi) \right\} \\\\ &= \frac{1}{2}y^{-2} \sum_{n=1}^{N}2(x_{n}-\mu_{ML})^{2} - \frac{N}{2}y^{-1} \\\\ \end{align}\]

\[ y_{ML} = \sigma_{ML}^{2} = \frac{1}{N} \sum_{n=1}{N}(x_{n}-\mu_{ML})^{2} \]

Curve Fitting: Probabilistic Prospect

Train data:
\(\mathbf x = (x_{1} , โ€ฆ , x_{N})^{T}, \mathbf t = (t_{1} , โ€ฆ , t_{N})\)

๋ชฉํ‘œ๊ฐ’ \(\mathbf t\)์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ™•๋ฅ ๋ถ„ํฌ๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.
\[ p(t|x, \mathbf w, \beta) = \mathcal{N}(t|y(x,\mathbf w),\beta^{-1})\]

lambda
\(x_{0}\)๋ผ๋Š” ๊ฐ’์— ๋Œ€ํ•ด์„œ,
\(\mathbf w\)๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•˜๋Š” \(y\) ํ•จ์ˆ˜์˜ ๊ฐ’์ด ์žˆ์ง€๋งŒ,
๊ทธ ๊ฐ’์— ๋Œ€ํ•œ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ํ™•๋ฅ ์„ ๊ฐ€์ •ํ•œ๋‹ค.
\(x_{0}\)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ \(t\)์˜ ํ™•๋ฅ ์€ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.(๊ทธ๋ฆผ์—์„œ ํŒŒ๋ž€์„ )

\(\mathcal{N}(t\mid y(x,\mathbf w),\beta^{-1})\):
\(t\mid y(x,\mathbf w)\)์ธ ๊ทผ์‚ฌ์‹์„ ํ‰๊ท ์œผ๋กœ ๊ฐ€์ง€๊ณ , \(\beta^{-1}\)์„ ๋ถ„์‚ฐ์œผ๋กœ ๊ฐ€์ง€๋Š” ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ.
๋”ฐ๋ผ์„œ ์ด๊ฒƒ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋Š” \(p(t\mid x, \mathbf w, \beta)\)์ด๋‹ค.

Maximum Likelihood of \(\mathbf w\)

ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” \(\mathbf w, \beta\)์ด๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์˜ ์ตœ๋Œ€์šฐ๋„ํ•ด๋ฅผ ๊ตฌํ•ด๋ณด์ž.

  • Likelihood Function:
    \[ p(\mathbf t|\mathbf X, \mathbf w, \beta) = \Pi_{n=1}^{N}\ \mathcal{N}(t_{n}|y(x_{n}, \mathbf w), \beta^{-1}) \]
  • Log Likelihood Function:
    \[ \ln\ p(\mathbf t|\mathbf X,\mathbf w,\beta) = -\frac{\beta}{2} \sum_{n=1}^{N} (y(x_{n}, \mathbf w)-t_{n})^{2} -\ \frac{N}{2}\ln\beta -\ \frac{N}{2}\ln(2\pi) \]

\(\mathbf w\)์— ๊ด€ํ•ด์„œ ์šฐ๋„ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”์‹œํ‚ค๋Š” ๊ฒƒ์€
์ œ๊ณฑํ•ฉ ์˜ค์ฐจํ•จ์ˆ˜(sum-of-squares)(\(\sum_{n=1}^{N} (y(x_{n}, \mathbf w)-t_{n})^{2}\))๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค.
\[ \sum_{n=1}^{N} (y(x_{n}, \mathbf w)-t_{n})^{2} \]

Maximum Likelihood of \(\beta\)

\[ \frac{1}{\beta_{ML}} = \frac{1}{N} \sum_{n=1}{N}(y(x_{n},\mathbf w_{ML})-t_{n})^{2} \]

Predictive Distribution

๊ฐ€์ •:
\[ p(t|x, \mathbf w, \beta) = \mathcal{N}(t|y(x,\mathbf w),\beta^{-1})\]

Maximum Likelihood Solution:
\[ p(t|x, \mathbf w_{ML}, \beta_{ML}) = \mathcal{N}(t|y(x,\mathbf w_{ML}),\beta_{ML}^{-1})\]

Bayesian Curve Fitting

Prior

ํŒŒ๋ผ๋ฏธํ„ฐ \(\mathbf w\)์˜ ์‚ฌ์ „ํ™•๋ฅ (prior) ๊ฐ€์ •:

\[p(\mathbf w\mid \alpha) = \mathcal{N}(\mathbf w\mid 0, \alpha^{-1}I) = \left(\frac{\alpha}{2\pi}\right)^{(M+1)/2}\ exp \left\{ -\frac{\alpha}{2}\mathbf w^{T}\mathbf w \right\}\]

\(\mathbf w\)์˜ ์‚ฌํ›„ํ™•๋ฅ (posterior)์€ ์šฐ๋„ํ•จ์ˆ˜(likelihood)์™€ ์‚ฌ์ „ํ™•๋ฅ ์˜ ๊ณฑ์— ๋น„๋ก€ํ•œ๋‹ค.
\[ p(\mathbf w\mid \mathbf X,\ \mathbf t,\ \alpha,\ \beta) \propto p(\mathbf t\mid \mathbf X, \mathbf w,\beta)p(\mathbf w\mid \alpha)\]

  • \(A \propto B\): A is directly proportional to B

์ด ์‚ฌํ›„ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”์‹œํ‚ค๋Š” ๊ฒƒ์€ ์•„๋ž˜ ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”์‹œํ‚ค๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค.

\[\frac{\beta}{2}\sum_{n=1}{N}\{ y(x_{n}, \mathbf w)-t_{n} \}^{2} + \frac{\alpha}{2}\mathbf w^{T}\mathbf w\]

์ด๊ฒƒ์€ regularization์—์„œ ์ œ๊ณฑํ•ฉ ์˜ค์ฐจํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค.
Regularization:

\(\begin{align}\tilde{E}(w) = \frac{1}{2}(\ \sum_{n=1}^{N} \{y(x_{n},w)-t_{n}\} ^{2} + \lambda\sum_{n=1}^{N}w_{n}^{2}\ )\end{align}\)
์ด๋ ‡๊ฒŒ ๋˜๋Š”๋ฐ ์—ฌ๊ธฐ์„œ \(\lambda\) ๋ฅผ ํ†ตํ•ด์„œ ๊ณ ์ฐจํ•ญ์˜ parameters๋ฅผ 0์— ๊ฐ€๊นŒ์šด ๊ฐ’์œผ๋กœ ๋งŒ๋“ค์–ด ์ฃผ์–ด ์ ๋‹นํ•œ 2์ฐจํ•ญ์˜ ํ•จ์ˆ˜๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์ •๊ทœํ™”๋ฅผ ์‹คํ–‰.
\(\lambda\) ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก \(\lVert \mathbf{w} \rVert^{2}\)์˜ ๊ฐ’์„ ์ž‘๊ฒŒ ๋งŒ๋“ค์–ด์ค€๋‹ค. https://marquis08.github.io/devcourse2/probability/mathjax/ML-basics-Probability-1/

ํ†ต์ฐฐ:
t์— ๊ด€ํ•ด์„œ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ–ˆ์„๋•Œ, ์ตœ๋Œ€์šฐ๋„๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” \(\mathbf w\)๊ฐ€ ๊ฒฐ๊ตญ ์ œ๊ณฑํ•ฉ ์˜ค์ฐจํ•จ์ˆ˜๋ฅผ ํ‘ธ๋Š” ํ•ด์™€ ๋™์ผ.
์ถ”๊ฐ€์ ์œผ๋กœ \(\mathbf w\)์— ๊ด€ํ•ด์„œ๋„ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ–ˆ์„๋•Œ, ์‚ฌํ›„ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์€ ๊ทœ์ œํ™”๋ถ€๋ถ„์„ ํฌํ•จ์‹œ์ผœ์„œ \(\mathbf w\) ๊ตฌํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค.

Bayesian Curve Fitting

์ด์ œ๊นŒ์ง€ \(t\)์˜ ์˜ˆ์ธก ๋ถ„ํฌ๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ์ „ํžˆ \(\mathbf w\)์˜ ์ ์ถ”์ •์— ์˜์กดํ•ด ์™”๋‹ค. ์™„์ „ํ•œ ๋ฒ ์ด์ง€์•ˆ ๋ฐฉ๋ฒ•์€ \(\mathbf w\)์˜ ๋ถ„ํฌ๋กœ๋ถ€ํ„ฐ ํ™•๋ฅ ์˜ ๊ธฐ๋ณธ๋ฒ•์น™๋งŒ์„ ์‚ฌ์šฉํ•ด์„œ \(t\)์˜ ์˜ˆ์ธก๋ถ„ํฌ๋ฅผ ์œ ๋„ํ•œ๋‹ค.
\[ p(t\mid x, \mathbf X, \mathbf t) = \int\ p(t\mid x,\mathbf w)p(\mathbf w\mid \mathbf X, \mathbf t)d\mathbf w \]

์ด ์˜ˆ์ธก๋ถ„ํฌ๋„ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๊ณ , ํ‰๊ท ๋ฒกํ„ฐ์™€ ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

Appendix

MathJax

\(\approx\):

\approx

\(\thickapprox\):

\thickapprox

\({[}escape-bracket{]}\):

$${[}escape-bracket{]}$$:

\(\mathbf y\):

$$\mathbf y$$

\(\left[\frac ab,c\right]\):

$$\left[\frac ab,c\right]$$

๊ฐ€๋ณ€ ๊ด„ํ˜ธ with escape curly brackets
\(\left\{-\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} \right\}\):

$$\left\{-\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} \right\}$$ 

๊ฐ€๋ณ€ ๊ด„ํ˜ธ with brackets
\(\left[-\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} \right]\):

$$\left[-\frac{1}{2\sigma^{2}} \sum_{n=1}^{N}(x_{n}-\mu)^{2} \right]$$

\(\cdot\):

$$\cdot$$

Mutually Exclusive Vs. Independent

Mutually exclusive events cannot happen at the same time. For example: when tossing a coin, the result can either be heads or tails but cannot be both.
Events are independent if the occurrence of one event does not influence (and is not influenced by) the occurrence of the other(s). For example: when tossing two coins, the result of one flip does not affect the result of the other.

Mutually Exclusive:
\(\begin{align} P(A \cap B) & = 0 \\ P(A \cup B) & = P(A) + P(B) \\ P(A|B) & = 0 \\ p(A|\neg B) & = \frac{P(A)}{1-P(B)} \\ \end{align}\)

Independent:
\(\begin{align} P(A \cap B) & = P(A)\ P(B) \\ P(A \cup B) & = P(A) + P(B) - P(A)\ P(B) \\ P(A|B) & = P(A) \\ p(A|\neg B) & = P(A) \\ \end{align}\)

Gradient, Jacobian, Hessian

Gradient: Is a multi-variable generalization of the derivative. While a derivative can be defined on functions of a single variable, for scalar functions of several variables, the gradient takes its place. The gradient is a vector-valued function, as opposed to a derivative, which is scalar-valued.
Jacobian Matrix: is the matrix of all first-order partial derivatives of a multiple variables vector-valued function. When the matrix is a square matrix, both the matrix and its determinantare referred to as the Jacobian in literature. The Jacobian is then the generalization of the gradient for vector-valued functions of several variables.
Hessian Matrix: is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. It is the Jacobian of the gradient of a scalar function of several variables.

References

Product Rule & Sum Rule: https://www.youtube.com/watch?v=u7P9hg1dVDU
MathJax Tex command: https://www.onemathematicalcat.org/MathJaxDocumentation/MathJaxKorean/TeXSyntax_ko.html

mutually exclusive: https://math.stackexchange.com/questions/941150/what-is-the-difference-between-independent-and-mutually-exclusive-events
Probability Theory: https://blog.naver.com/rokpilot/220603888519
Gradient, Jacobian, Hessian: https://www.quora.com/What-is-the-difference-between-the-Jacobian-Hessian-and-the-gradient-in-machine-learning
Jacobian and gradient: https://math.stackexchange.com/questions/1519367/difference-between-gradient-and-jacobian
Jacobian Matrix: https://angeloyeo.github.io/2020/07/24/Jacobian.html
inverse CDF: https://www.youtube.com/watch?v=-Fg6KEXIlVU
inverse CDF: https://stats.stackexchange.com/questions/212813/help-me-understand-the-quantile-inverse-cdf-function
Expectations: https://datascienceschool.net/02%20mathematics/07.02%20%EA%B8%B0%EB%8C%93%EA%B0%92%EA%B3%BC%20%ED%99%95%EB%A5%A0%EB%B3%80%EC%88%98%EC%9D%98%20%EB%B3%80%ED%99%98.html#id9

Leave a comment