NN basics - DL Optimization
Objective Function
Objective Funtion Vs. Loss Function(alias: Cost Function)
โThe function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. In this book, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these terms.โ (โDeep Learningโ - Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Loss function is usually a function defined on a data point.
MSE
\(e = \frac{1}{2}\Vert \boldsymbol{y} - \boldsymbol{o}\Vert_{2}^{2}\)
์ค์ฐจ๊ฐ ํด ์๋ก \(e\) ๊ฐ์ด ํฌ๋ฏ๋ก ๋ฒ์ ์ผ๋ก ํ์ฉ๋จ
MSE bad for NN(with sigmoid)
- ์ ๊ฒฝ๋ง ํ์ต๊ณผ์ ์์ ํ์ต์ ์ค๋ฅ๋ฅผ ์ค์ด๋ ๋ฐฉํฅ์ผ๋ก ๊ฐ์ค์น์ ํธํฅ์ ๊ต์ ํ๋๋ฐ ํฐ ๊ต์ ์ด ํ์ํจ์๋ gradient๊ฐ ์๊ฒ ๊ฐฑ์ ๋จ.
-์ด์ : error๊ฐ ์ปค์ ธ๋ logistic sigmoid์ gradient๋ ๊ฐ์ด ์ปค์ง์๋ก vanishing๋๊ธฐ ๋๋ฌธ. - good for linear regression
What if NN uses other activation function except sigmoid or tanh? Still considered as useful or not?
CrossEntropy
ํ๋ฅ ๋ถํฌ์ ์ฐจ์ด๋ฅผ ๋น๊ตํ๋ ํจ์
Softmax & Negtative Log Likelihood (NLL)
Softmax:
๋ชจ๋ ๋ํ๋ฉด 1์ด ๋๊ธฐ ๋๋ฌธ์ ํ๋ฅ ์ ๋ชจ๋ฐฉ.
NLL์ ๊ฒฝ์ฐ target๊ฐ์ index์ ์์ธก๊ฐ์ index๋ง์ ์ฌ์ฉ
Softmax Vs. Hinge Loss
Pre-Processing
Normalization
- ํน์ง๊ฐ์ด ๋ชจ๋ ์์(๋๋ ๋ชจ๋ ์์)์ด๊ฑฐ๋ ํน์ง๋ง๋ค ๊ฐ์ ๊ท๋ชจ๊ฐ ๋ค๋ฅด๋ฉด, ์๋ ด ์๋๊ฐ ๋๋ ค์ง ์ ์๋ค.
- ๋ชจ๋ x๊ฐ ์์ธ ์ํฉ์์ ์ด๋ค ์ค์ฐจ์ญ์ ํ ฯ๊ฐ์ ์์์ธ ์ํฉ์๋ ์์ ๋ฐฉํฅ์ผ๋ก ์ ๋ฐ์ดํธ ๋์ง๋ง, ฯ๊ฐ์ด ์์์ธ ์ํฉ์์๋ ์์ ๋ฐฉํฅ์ผ๋ก ์ ๋ฐ์ดํธ ๋๋ค. ์ด์ฒ๋ผ ์ฌ๋ฌ ๊ฐ์ค์น๊ฐ ๋ญ์น๋ก ๊ฐ์ด ์ฆ๊ฐํ๊ฑฐ๋, ๊ฐ์ํ๋ฉด ์ต์ ์ ์ ์ฐพ์๊ฐ๋ ๊ฒฝ๋ก๋ฅผ ๊ฐํก์งํกํ์ฌ ์๋ ด ์๋ ์ ํ๋ก ์ด์ด์ง๋ค.
- ๋ํ ํน์ง์ ๊ท๋ชจ๊ฐ ๋ฌ๋ผ (์๋ฅผ ๋ค์ด, 1.75m ๋ฑ์ ๊ฐ์ ๊ฐ์ง๋ ํค ํน์ง๊ณผ, 60kg ๋ฑ์ ๊ฐ์ ๊ฐ์ง๋ ๋ชธ๋ฌด๊ฒ ํน์ง์ ์ค์ผ์ผ์ด ๋ค๋ฅด๋ค) ํน์ง๋ง๋ค ํ์ต์๋๊ฐ ๋ค๋ฅผ ์ ์์ผ๋ฏ๋ก, ์ ์ฒด์ ์ธ ํ์ต ์๋๋ ์ ํ๋ ๊ฒ์ด๋ค.
ํน์ง๋ณ๋ก ๋ ๋ฆฝ์ ์ผ๋ก ์ ์ฉ.
- ์ ๊ท๋ถํฌ๋ฅผ ํ์ฉํด์ Normalization.
- Max & Min Scaler
One Hot Encoding
Weight Initialization
Symmetric Weight
When some machine learning models have weights all initialized to the same value, it can be difficult or impossible for the weights to differ as the model is trained. This is the โsymmetryโ. Initializing the model to small random values breaks the symmetry and allows different weights to learn independently of each other.
๊ฐ์ ๊ฐ์ผ๋ก ๊ฐ์ค์น๋ฅผ ์ด๊ธฐํ ์ํค๋ฉด, ๋ ๋
ธ๋๊ฐ ๊ฐ์ ์ผ์ ํ๊ฒ ๋๋ ์ค๋ณต์ฑ ๋ฌธ์ ๊ฐ ์๊ธธ ์ ์๋ค. ๋ฐ๋ผ์ ์ด๋ฐ ์ผ์ ํผํ๋ ๋์นญ ํ๊ดด(Symmetry break)๊ฐ ํ์ํ๋ค.
Random Initialization
๋์๋ ๊ฐ์ฐ์์๋ถํฌ(Gaussian Distribution ๋๋ ๊ท ์ผ ๋ถํฌ(Uniform Distribution) ์ถ์ถํ ์ ์๊ณ , ์คํ ๊ฒฐ๊ณผ ๋์ ์ฐจ์ด๋ ๋ณ๋ก ์๋ค๊ณ ํจ.
๋ฐ์ด์ด์ค๋ ๋ณดํต 0์ผ๋ก ์ด๊ธฐํ ํ๋ค.
์ค์ํ ๊ฒ์ ๋์์ ๋ฒ์์ธ๋ฐ, ๊ทน๋จ์ ์ผ๋ก ๊ฐ์ค์น๊ฐ 0์ ๊ฐ๊น๊ฒ ๋๋ฉด ๊ทธ๋ ๋์ธํธ๊ฐ ์์ฃผ ์์์ ธ ํ์ต์ด ๋๋ ค์ง๊ณ , ๋ฐ๋๋ก ๋๋ฌด ํฌ๋ฉด ๊ณผ์์ ํฉ์ ์ํ์ด ์๋ค.
Visualization
- ์ด๊ธฐํ๊ฐ ๋๋ฌด ์์ผ๋ฉด, ๋ชจ๋ ํ์ฑ ๊ฐ์ด 0์ด ๋จ. ๊ฒฝ์ฌ๋๋ ์ญ์ ์. ํ์ต ์๋จ
- ์ด๊ธฐํ๊ฐ ๋๋ฌด ํฌ๋ฉด, ํ์ฑ๊ฐ ํฌํ, ๊ฒฝ์ฌ๋ 0, ํ์ต์๋จ.
- ์ด๊ธฐํ๊ฐ ์ ๋นํ๋ค๋ฉด, ๋ชจ๋ ์ธต์์ ํ์ฑ ๊ฐ์ ๋ถํฌ๊ฐ ์ข์.
์ถ๊ฐ์ ์ผ๋ก,
- conclusion on Weight Initialization
- sigmoid, tanh : Xavier
- ReLU : He
Weight initialization visualization code:
import numpy as np
from matplotlib.pylab import plt
# assume some unit gaussian 10-D input data
def weight_init(types, activation):
D = np.random.randn(1000, 500)
hidden_layer_sizes = [500]*10
if activation == "relu":
nonlinearities = ['relu']*len(hidden_layer_sizes)
else:
nonlinearities = ['tanh']*len(hidden_layer_sizes)
act = {'relu': lambda x:np.maximum(0,x), 'tanh':lambda x: np.tanh(x)}
Hs = {}
for i in range(len(hidden_layer_sizes)):
X = D if i==0 else Hs[i-1] # input at this layer
fan_in = X.shape[1]
fan_out = hidden_layer_sizes[i]
if types == "small":
W = np.random.randn(fan_in, fan_out) * 0.01 # layer initialization
elif types == "large":
W = np.random.randn(fan_in, fan_out) * 1 # layer initialization
elif types == "xavier":
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in) # layer initialization
elif types == "he":
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in/2) # layer initialization
H = np.dot(X, W) # matrix multiply
H = act[nonlinearities[i]](H) # nonlinearity
Hs[i] = H # cache result on this layer
print('input layer had mean', np.mean(D), 'and std', np.std(D))
# look at distributions at each layer
layer_means = [np.mean(H) for i,H in Hs.items()]
layer_stds = [np.std(H) for i,H in Hs.items()]
# print
for i,H in Hs.items() :
print('hidden layer', i+1, 'had mean', layer_means[i], 'and std', layer_stds[i])
plt.figure();
plt.subplot(1,2,1);
plt.title("layer mean");
plt.plot(range(10), layer_means, 'ob-');
plt.subplot(1,2,2);
plt.title("layer std");
plt.plot(range(10), layer_stds, 'or-');
plt.show();
plt.figure(figsize=(30,10));
for i,H in Hs.items() :
plt.subplot(1,len(Hs), i+1)
plt.hist(H.ravel(), 30, range=(-1,1))
plt.title("{}_{}".format(types, activation), fontsize = 15)
plt.show();
save_path = "images/{}_{}.png".format(types, activation)
print("saved at : {}".format(save_path))
plt.savefig(save_path);
weight_init('xavier', 'tanh') # small, large, xavier, he; tanh, relu
Momentum (Problem with SGD)
๊ณผ๊ฑฐ์ ์ด๋ํ๋ ๋ฐฉ์์ ๊ธฐ์ตํ๋ฉด์ ๊ธฐ์กด ๋ฐฉํฅ์ผ๋ก ์ผ์ ์ด์ ์ถ๊ฐ ์ด๋ํจ.
Gradient Noise๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํจ.
- \(\rho\) ๊ฐ momentum์ ๊ฒฐ์ ํ๋ ๊ฒ์. \(\rho = 0\)์ด๋ฉด ๋ชจ๋ฉํ
์ด ์๋ ๊ฒ. ์๋๋ฅผ ๋ํ๋ ๊ฒ(cs231์์๋ friction์ด๋ผ๊ณ ํํํจ ๋ง์ฐฐ์ ์ค์).
Jitter along steep direction
local minima or saddle point
- Saddle points much more common in high dimension.
Minibatch could be more noisy
Nesterov Momentum
์๋ SGD:
gradient ๊ตฌํ๊ณ momentum ๋ํด์ ์ด๋
Nesterov Momentum:
momentum ๋ํํ gradient ๊ตฌํ๊ณ ์ด๋
If your velocity was actually a little bit wrong, it lets you incorporate gradient information from a little bit larger parts of the objective landscape. (momentum์ด ์ฃผ๋ ์๋๊ฐ ํ๋ ธ์ ๊ฒฝ์ฐ ์ต์ ์ landscape, ์ฌ๊ธฐ์๋ loss๊ฐ optimal minima๋ก ํฅํ๋ ๊ณณ์ธ๋ฏ,์ ๊ฝค๋ ํฐ ์ ๋ณด๋ฅผ ์ ๊ณตํ๋ค๋ ์๋ฏธ์ธ๋ฏ.).
Adaptive Learning Rates
์ด์ ๊ฒฝ์ฌ๋์ ํ์ฌ ๊ฒฝ์ฌ๋์ ๋ถํธ๊ฐ ๊ฐ์ ๊ฒฝ์ฐ ๊ฐ์ ํค์ฐ๊ณ ๋ค๋ฅธ ๊ฒฝ์ฐ ๊ฐ์ ์ค์ด๋ ์ ๋ต
AdaGrad(Adaptive Gradient)
์๋ ๋์ ์ squared term์ ์ฌ์ฉํจ.
๋ ๊ฐ์ coordinate๊ฐ ์๋ค๊ณ ํ์๋, ํ ์ชฝ์ high gradient, ๋ฐ๋์ชฝ์ small gradient, sum of squares of the small gradient ํ๊ณ small gradient์ผ๋ก ๋๋๋ค๋ฉด,
accelerate the movement along the long dimension. Then, along the other dimension, where the gradients tend to be very large, then weโll be dividing by a large number, so slow down the progress.
- ์ง๋ ๊ฒฝ์ฌ๋์ ์ต๊ทผ ๊ฒฝ์ฌ๋๋ ๊ฐ์ ๋น์ค
- \(r\)์ด ์ ์ ์ปค์ ธ ์๋ ด์ ๋ฐฉํดํ ๊ฐ๋ฅ์ฑ.
- convex case, slow down and converge
- non-convex case, can be stuck in saddle point making no longer progress
RMSProp (variation of AdaGrad)
WMA(weighted moving average): ์ต๊ทผ ๊ฒ์ ๋น์ค์ ๋ .
Adam (Adaptive Moment Estimation: RMSProp with momentum)
Q: What happens at first timestep?
At the very first timestep, you can see that at the beginning, weโve initialized
second_moment
with zero. After one update of thesecond_moment
, typicallybeta2
,second_moment
โs decay rate, is something like0.9
or0.99
something very close to1
. So after one update, oursecond_moment
is still very close to zero. When weโre making update step, and divide by oursecond_moment
, now weโre dividing by very small number. So, weโre making a very large step at the beginning. This very large step at the beginnig is not really due to the geometry of the problem. Itโs kind of an artifact of the fact that we initialized oursecond_moment
estimate was zero.
Q: If your first moment is also very small, then youโre multiplying by small and dividing by square root of small squared, whatโs going to happen?
They might cancel each other out and this might be okay. Sometimes these cancel each other out. But, sometimes this ends up in taking very large steps right at the beginning. That can be quite bad. Maybe you initialize a little bit poorly, and you take a very large step. Now your initialization is completely messed up, and youโre in a very bad part of the objective landscape and you just canโt converge there.
Bias correction term:
Activation Function
Batch Normalization
Covariate Shift: train set, test set์ ๋ถํฌ๊ฐ ๋ค๋ฆ.
Internal Covariate Shift: ํ์ต์ด ์งํ๋๋ฉด์ ๋ค์ ์ธต์์๋ ๋ฐ์ดํฐ์ ๋ถํฌ๊ฐ ์์๋ก ๋ฐ๋๊ธฐ ๋๋ฌธ์ ํ์ต์ ๋ฐฉํด๊ฐ ๋ ์ ์์.
์ ๊ทํ๋ฅผ ์ธต ๋จ์ ์ ์ฉํ๋ ๊ธฐ๋ฒ
์ธต์ ์ ํ์ ์ฐ์ฐ๊ณผ ๋น์ ํ์ ์ฐ์ฐ์ผ๋ก ๊ตฌ์ฑ๋๋ ๋ฐ,
์ฌ๊ธฐ์์์ ์ด๋์ ์ ๊ทํ๋ฅผ ์ ์ฉํ ๊ฒ์ธ๊ฐ?
- ์ ํ ์ฐ์ฐ ํ์ ๋ฐ๋ก ์ ๊ทํ๋ฅผ ์ ์ฉ(์ฆ, activation ์ ์)
- ์ผ๋ฐ์ ์ผ๋ก fc, conv ํ ๋๋ activation ์ ์ ์ ์ฉ
๋ฏธ๋๋ฐฐ์น์ ์ ์ฉํ๋ ๊ฒ์ด ์ ๋ฆฌ.
BN Process
- ๋ฏธ๋ ๋ฐฐ์น๋ก ํ๊ท ๊ณผ ๋ถ์ฐ ๊ณ์ฐ (\(\mu, \sigma\))
- \(\mu, \sigma\)๋ฅผ ํตํด ์ ๊ทํ
- ๋น๋ก(scale)์ ์ด๋(shift) ์ผ๋ก ์ธ๋ถ ์กฐ์ (\(\gamma, \beta\), ์ด ๋๋ ํ์ต์ ์ํด ๊ฒฐ์ ๋จ)
- Maybe in particular part of the network gives flexibility in normalization.
BN inference
Advantages
- Gradient ๊ฐ์
- Large LR ํ์ฉ
- ์ด๊ธฐํ ์์กด์ฑ ๊ฐ์
- ๊ท์ ํ ํจ๊ณผ๋ก dropout ํ์์ฑ ๊ฐ์.
Types of Normalizations
Regularization
Methods
Add extra Term to Loss
- Regularization term์ ํ๋ จ์งํฉ๊ณผ ๋ฌด๊ด, ๋ฐ์ดํฐ ์์ฑ ๊ณผ์ ์ ๋ด์ฌํ ์ฌ์ ์ง์์ ํด๋น.
- ํ๋ผ๋ฏธํฐ๋ฅผ ์์ ๊ฐ์ผ๋ก ์ ์งํ๋ฏ๋ก ๋ชจ๋ธ์ ์ฉ๋์ ์ ํํ๊ธฐ๋ ํจ.
- ์์ ๊ฐ์ค์น๋ฅผ ์ ์งํ๋ ค๊ณ l2 norm, l1 norm์ ์ฃผ๋ก ์ฌ์ฉ
L2 Norm
Early Stopping
Data Augmentation
Drop Out
fc์์ ๋ฐ์ํ๋ overfitting์ ์ค์ด๊ธฐ ์ํด์ ์.
Co-Adaptation
๋ฐ์ดํฐ์ ๊ฐ๊ฐ์ network์ weight๋ค์ด ์๋ก ๋์กฐํ ๋๋ ํ์์ด ๋ฐ์(์ ๊ฒฝ๋ง ๊ตฌ์กฐ์ inherentํ ํน์ฑ).
randomํ๊ฒ dropout์ ์ฃผ๋ฉด์ ์ด๋ฌํ ํ์์ ๊ท์ ํ๋ ํจ๊ณผ.
In a standard neural network, the derivative received by each parameter tells it how itshould change so the final loss function is reduced, given what all other units are doing.Therefore, units may change in a way that they fix up the mistakes of the other units.This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable.Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It mustperform well in a wide variety of different contexts provided by the other hidden units. paper
Figure 7a shows features learned by an autoencoder on MNIST with a single hiddenlayer of 256 rectified linear units without dropout. Figure 7b shows the features learned byan identical autoencoder which used dropout in the hidden layer withp= 0.5. Both au-toencoders had similar test reconstruction errors. However, it is apparent that the featuresshown in Figure 7a have co-adapted in order to produce good reconstructions. Each hiddenunit on its own does not seem to be detecting a meaningful feature. On the other hand, inFigure 7b, the hidden units seem to detect edges, strokes and spots in different parts of theimage. This shows that dropout does break up co-adaptations, which is probably the mainreason why it leads to lower generalization errors.
Sparsity
hidden ๋ด๋ฐ๋ค์ ํ์ฑ๋๊ฐ ์ข ๋ sparse ๋จ.
sparseํ๊ฒ ๋๋ฉด์ ์๋ฏธ๊ฐ ์๋ ๋ถ๋ถ๋ง ๋ ๋จ๊ฒ ๋๋ ํจ๊ณผ๋ฅผ ๊ฐ์ง๊ฒ ๋๋ ๊ฒ์(sparseํ์ง ์๊ณ ๊ณจ๊ณ ๋ฃจ ๋ถํฌ๋์ด์๋ค๋ฉด ์๋ฏธ๊ฐ ๋๋๋ฌ์ง์ง ์์)
We found that as a side-effect of doing dropout, the activations of the hidden unitsbecome sparse, even when no sparsity inducing regularizers are present. Thus, dropout au-tomatically leads to sparse representations. In a good sparse model, there should only be a few highly activated unitsfor any data case.
Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to Figure 8a, as seen by the significant mass away from zero for the net that does not use dropout. (8-b activation shows about 10k activation compared to 1.2k in 8-a.)
Ensemble Effects
Dropout is training a large ensemble of models (that share parameters).
Ensemble Method
์๋ก ๋ค๋ฅธ ์ฌ๋ฌ ๊ฐ์ ๋ชจ๋ธ์ ๊ฒฐํฉํ์ฌ ์ผ๋ฐํ ์ค๋ฅ๋ฅผ ์ค์
Bagging (Bootstrap Aggregating)
ํ๋ จ์งํฉ์ ์ฌ๋ฌ ๋ฒ ์ถ์ถํ์ฌ ์๋ก ๋ค๋ฅธ ํ๋ จ์งํฉ ๊ตฌ์ฑ
Boosting
\(i\)๋ฒ์งธ ์์ธก๊ธฐ๊ฐ ํ๋ฆฐ ์ํ์ \(i+1\)๋ฒ์งธ ์์ธก๊ธฐ๊ฐ ์ ์ธ์ํ๋๋ก ์ฐ๊ณ์ฑ ๊ณ ๋ ค
Hyper Parameter Tuning
๋ ์ข ๋ฅ์ parameter๊ฐ ์์
- ๋ด๋ถ ๋งค๊ฐ๋ณ์ ํน์ ๊ฐ์ค์น
- ์ ๊ฒฝ๋ง์ ๊ฒฝ์ฐ ๊ฐ์ค์น
- hyper parameter
- filter size, stepsize, LR ๋ฑ
ํ์ ๋ฐฉ์
- Grid Search
- Random Search
- Log space
์ฐจ์์ ์ ์ฃผ ๋ฌธ์
- random search๊ฐ ๋ ์ ๋ฆฌ.
2์ฐจ ๋ฏธ๋ถ์ ์ด์ฉํ ๋ฐฉ๋ฒ
๋ดํด ๋ฐฉ๋ฒ
- 2์ฐจ ๋ฏธ๋ถ ์ ๋ณด๋ฅผ ํ์ฉ
- 1์ฐจ ๋ฏธ๋ถ์ ์ต์ ํ
- ๊ฒฝ์ฌ๋ ์ฌ์ฉํ์ฌ ์ ํ ๊ทผ์ฌ ์ฌ์ฉ
- ๊ทผ์ฌ์น ์ต์ํ
- 2์ฐจ ๋ฏธ๋ถ์ ์ต์ ํ
- ๊ฒฝ์ฌ๋์ ํค์์์ ์ฌ์ฉํ์ฌ 2์ฐจ ๊ทผ์ฌ ์ฌ์ฉ
- ๊ทผ์ฌ์น์ ์ต์๊ฐ
- 2์ฐจ ๋ฏธ๋ถ์ ๋จ์
- ๋ณ๊ณก์ (fโโ=0) ๊ทผ์ฒ์์ ๋งค์ฐ ๋ถ์์ ํ ์ด๋ ํน์ฑ์ ๋ณด์ธ๋ค๋ ์
- ๋ณ๊ณก์ ๊ทผ์ฒ์์๋ fโโ = 0์ ๊ฐ๊น์ด ๊ฐ์ ๊ฐ๊ธฐ ๋๋ฌธ์ ์คํ (step)์ ํฌ๊ธฐ๊ฐ ๋๋ฌด ์ปค์ ธ์ ์์ ์๋ฑํ ๊ณณ์ผ๋ก ๋ฐ์ฐํด ๋ฒ๋ฆด ์๋ ์์
- ์ด๋ํ ๋ฐฉํฅ์ ๊ฒฐ์ ํ ๋ ๊ทน๋, ๊ทน์๋ฅผ ๊ตฌ๋ถํ์ง ์๋๋ค๋ ์
- ๊ทน๋์ ์ ํฅํด ์๋ ดํ ์๋ ์์
- ๋ณ๊ณก์ (fโโ=0) ๊ทผ์ฒ์์ ๋งค์ฐ ๋ถ์์ ํ ์ด๋ ํน์ฑ์ ๋ณด์ธ๋ค๋ ์
- ํ
์ผ๋ฌ ๊ธ์
- ์ฃผ์ด์ง ํจ์๋ฅผ ์ ์์ญ์์ ํน์ ์ ์ ๋ฏธ๋ถ๊ณ์๋ค์ ๊ณ์๋ก ๊ฐ์ง๋ ๋คํญ์์ ๊ทนํ(๋ฉฑ๊ธ์)์ผ๋ก ํํ.
- ๊ธฐ๊ณํ์ต์ด ์ฌ์ฉํ๋ ๋ชฉ์ ํจ์๋ 2์ฐจ ํจ์๋ณด๋ค ๋ณต์กํ ํจ์์ด๋ฏ๋ก ํ ๋ฒ์ ์ต์ ํด์ ๋๋ฌ ๋ถ๊ฐ๋ฅ
- ํค์์ ํ๋ ฌ์ ๊ตฌํด์ผ ํจ
- ์ฐ์ฐ๋์ด ๊ณผ๋คํ๊ฒ ํ์ํจ
- ์ผค๋ ๊ฒฝ์ฌ๋ ๋ฐฉ๋ฒ์ด ๋์์ผ๋ก ์ ์๋จ
- ํค์์ ํ๋ ฌ์ ๊ตฌํด์ผ ํจ
์ผค๋ ๊ฒฝ์ฌ๋ ๋ฐฉ๋ฒ (conjugate gradient method)
์ง์ ์ ๋ณด๋ฅผ ์ฌ์ฉํ์ฌ ํด์ ๋นจ๋ฆฌ ์ ๊ทผ
์ ์ฌ ๋ดํด ๋ฐฉ๋ฒ
- ๊ฒฝ์ฌํ๊ฐ๋ฒ: ์๋ ด ํจ์จ์ฑ ๋ฎ์
- ๋ดํด ๋ฐฉ๋ฒ: ํค์์ ํ๋ ฌ ์ฐ์ฐ ๋ถ๋ด
- ํค์์ H์ ์ญํ๋ ฌ์ ๊ทผ์ฌํ๋ ํ๋ ฌ M์ ์ฌ์ฉ
- ์ ์ง์ ์ผ๋ก ํค์์์ ๊ทผ์ฌํํ๋ LFGS๊ฐ ๋ง์ด ์ฌ์ฉ๋จ.
- ๊ธฐ๊ณํ์ต์์๋ M์ ์ ์ฅํ๋ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ ๊ฒ ์ฐ๋ L-BFGS๋ฅผ ์ฃผ๋ก ์ฌ์ฉ
- ์ ์ฒด ๋ฐฐ์น๋ฅผ ํตํ ๊ฐฑ์ ์ ํ ์ ์๋ค๋ฉด, L-BFGS ์ฌ์ฉ์ ๊ณ ๋ คํจ
Appendix
Reference
CNN cs231n lecture_8 Optimization: http://cs231n.stanford.edu/slides/2021/lecture_8.pdf
CNN cs231n lecture_3 Loss Functions and Optimization(SGD): http://cs231n.stanford.edu/slides/2021/lecture_3.pdf
CNN cs231n lecture_7 Activation Functions-Data Preprocessing-Weight Initialization-Batch Normalization-Transfer learning: http://cs231n.stanford.edu/slides/2021/lecture_7.pdf
weight initalization: https://silver-g-0114.tistory.com/79
SGD momentumL https://youtu.be/_JB0AO7QxSA
ML: https://wordbe.tistory.com/entry/MLDL-%EC%84%B1%EB%8A%A5-%ED%96%A5%EC%83%81%EC%9D%84-%EC%9C%84%ED%95%9C-%EC%9A%94%EB%A0%B9
Dropout: https://blog.naver.com/PostView.nhn?blogId=laonple&logNo=220818841217&parentCategoryNo=&categoryNo=16&viewDate=&isShowPopularPosts=true&from=search
Dropout paper: https://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf
2์ฐจ๋ฏธ๋ถ ์ต์ ํ: https://darkpgmr.tistory.com/149
Leave a comment