NN basics - DL Optimization

9 minute read

Objective Function

Objective Funtion Vs. Loss Function(alias: Cost Function)

โ€œThe function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. In this book, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these terms.โ€ (โ€œDeep Learningโ€ - Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Loss function is usually a function defined on a data point.

MSE

\(e = \frac{1}{2}\Vert \boldsymbol{y} - \boldsymbol{o}\Vert_{2}^{2}\)

์˜ค์ฐจ๊ฐ€ ํด ์ˆ˜๋ก \(e\) ๊ฐ’์ด ํฌ๋ฏ€๋กœ ๋ฒŒ์ ์œผ๋กœ ํ™œ์šฉ๋จ

MSE bad for NN(with sigmoid)

  • ์‹ ๊ฒฝ๋ง ํ•™์Šต๊ณผ์ •์—์„œ ํ•™์Šต์€ ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ค‘์น˜์™€ ํŽธํ–ฅ์„ ๊ต์ •ํ•˜๋Š”๋ฐ ํฐ ๊ต์ •์ด ํ•„์š”ํ•จ์—๋„ gradient๊ฐ€ ์ž‘๊ฒŒ ๊ฐฑ์‹ ๋จ.
    -์ด์œ : error๊ฐ€ ์ปค์ ธ๋„ logistic sigmoid์˜ gradient๋Š” ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก vanishing๋˜๊ธฐ ๋•Œ๋ฌธ.
  • good for linear regression

What if NN uses other activation function except sigmoid or tanh? Still considered as useful or not?

CrossEntropy

ํ™•๋ฅ ๋ถ„ํฌ์˜ ์ฐจ์ด๋ฅผ ๋น„๊ตํ•˜๋Š” ํ•จ์ˆ˜

Softmax & Negtative Log Likelihood (NLL)

Softmax:
๋ชจ๋‘ ๋”ํ•˜๋ฉด 1์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ™•๋ฅ ์„ ๋ชจ๋ฐฉ.

softmax-crossentrophy
softmax-nll

NLL์˜ ๊ฒฝ์šฐ target๊ฐ’์˜ index์™€ ์˜ˆ์ธก๊ฐ’์˜ index๋งŒ์„ ์‚ฌ์šฉ

Softmax Vs. Hinge Loss

softmax-hingeloss

Pre-Processing

Normalization

  • ํŠน์ง•๊ฐ’์ด ๋ชจ๋‘ ์–‘์ˆ˜(๋˜๋Š” ๋ชจ๋‘ ์Œ์ˆ˜)์ด๊ฑฐ๋‚˜ ํŠน์ง•๋งˆ๋‹ค ๊ฐ’์˜ ๊ทœ๋ชจ๊ฐ€ ๋‹ค๋ฅด๋ฉด, ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ๋‹ค.
    • ๋ชจ๋“  x๊ฐ€ ์–‘์ธ ์ƒํ™ฉ์—์„œ ์–ด๋–ค ์˜ค์ฐจ์—ญ์ „ํŒŒ ฯƒ๊ฐ’์€ ์Œ์ˆ˜์ธ ์ƒํ™ฉ์—๋Š” ์–‘์˜ ๋ฐฉํ–ฅ์œผ๋กœ ์—…๋ฐ์ดํŠธ ๋˜์ง€๋งŒ, ฯƒ๊ฐ’์ด ์–‘์ˆ˜์ธ ์ƒํ™ฉ์—์„œ๋Š” ์Œ์˜ ๋ฐฉํ–ฅ์œผ๋กœ ์—…๋ฐ์ดํŠธ ๋œ๋‹ค. ์ด์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ ๊ฐ€์ค‘์น˜๊ฐ€ ๋ญ‰์น˜๋กœ ๊ฐ™์ด ์ฆ๊ฐ€ํ•˜๊ฑฐ๋‚˜, ๊ฐ์†Œํ•˜๋ฉด ์ตœ์ €์ ์„ ์ฐพ์•„๊ฐ€๋Š” ๊ฒฝ๋กœ๋ฅผ ๊ฐˆํŒก์งˆํŒกํ•˜์—ฌ ์ˆ˜๋ ด ์†๋„ ์ €ํ•˜๋กœ ์ด์–ด์ง„๋‹ค.
  • ๋˜ํ•œ ํŠน์ง•์˜ ๊ทœ๋ชจ๊ฐ€ ๋‹ฌ๋ผ (์˜ˆ๋ฅผ ๋“ค์–ด, 1.75m ๋“ฑ์˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ํ‚ค ํŠน์ง•๊ณผ, 60kg ๋“ฑ์˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ชธ๋ฌด๊ฒŒ ํŠน์ง•์€ ์Šค์ผ€์ผ์ด ๋‹ค๋ฅด๋‹ค) ํŠน์ง•๋งˆ๋‹ค ํ•™์Šต์†๋„๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ „์ฒด์ ์ธ ํ•™์Šต ์†๋„๋Š” ์ €ํ•˜๋  ๊ฒƒ์ด๋‹ค.

ํŠน์ง•๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ.

  • ์ •๊ทœ๋ถ„ํฌ๋ฅผ ํ™œ์šฉํ•ด์„œ Normalization.
  • Max & Min Scaler

One Hot Encoding

Weight Initialization

Symmetric Weight

When some machine learning models have weights all initialized to the same value, it can be difficult or impossible for the weights to differ as the model is trained. This is the โ€œsymmetryโ€. Initializing the model to small random values breaks the symmetry and allows different weights to learn independently of each other.
๊ฐ™์€ ๊ฐ’์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์ดˆ๊ธฐํ™” ์‹œํ‚ค๋ฉด, ๋‘ ๋…ธ๋“œ๊ฐ€ ๊ฐ™์€ ์ผ์„ ํ•˜๊ฒŒ ๋˜๋Š” ์ค‘๋ณต์„ฑ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฐ ์ผ์„ ํ”ผํ•˜๋Š” ๋Œ€์นญ ํŒŒ๊ดด(Symmetry break)๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

Random Initialization

๋‚œ์ˆ˜๋Š” ๊ฐ€์šฐ์‹œ์•ˆ๋ถ„ํฌ(Gaussian Distribution ๋˜๋Š” ๊ท ์ผ ๋ถ„ํฌ(Uniform Distribution) ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ณ , ์‹คํ—˜ ๊ฒฐ๊ณผ ๋‘˜์˜ ์ฐจ์ด๋Š” ๋ณ„๋กœ ์—†๋‹ค๊ณ  ํ•จ.

๋ฐ”์ด์–ด์Šค๋Š” ๋ณดํ†ต 0์œผ๋กœ ์ดˆ๊ธฐํ™” ํ•œ๋‹ค.

์ค‘์š”ํ•œ ๊ฒƒ์€ ๋‚œ์ˆ˜์˜ ๋ฒ”์œ„์ธ๋ฐ, ๊ทน๋‹จ์ ์œผ๋กœ ๊ฐ€์ค‘์น˜๊ฐ€ 0์— ๊ฐ€๊น๊ฒŒ ๋˜๋ฉด ๊ทธ๋ ˆ๋””์–ธํŠธ๊ฐ€ ์•„์ฃผ ์ž‘์•„์ ธ ํ•™์Šต์ด ๋Š๋ ค์ง€๊ณ , ๋ฐ˜๋Œ€๋กœ ๋„ˆ๋ฌด ํฌ๋ฉด ๊ณผ์ž‰์ ํ•ฉ์˜ ์œ„ํ—˜์ด ์žˆ๋‹ค.

Visualization

small_tanh

  • ์ดˆ๊ธฐํ™”๊ฐ€ ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด, ๋ชจ๋“  ํ™œ์„ฑ ๊ฐ’์ด 0์ด ๋จ. ๊ฒฝ์‚ฌ๋„๋„ ์—ญ์‹œ ์˜. ํ•™์Šต ์•ˆ๋จ

large_tanh

  • ์ดˆ๊ธฐํ™”๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด, ํ™œ์„ฑ๊ฐ’ ํฌํ™”, ๊ฒฝ์‚ฌ๋„ 0, ํ•™์Šต์•ˆ๋จ.

xavier_tanh

  • ์ดˆ๊ธฐํ™”๊ฐ€ ์ ๋‹นํ•˜๋‹ค๋ฉด, ๋ชจ๋“  ์ธต์—์„œ ํ™œ์„ฑ ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ์ข‹์Œ.

์ถ”๊ฐ€์ ์œผ๋กœ,
xavier_relu
he_relu

  • conclusion on Weight Initialization
    • sigmoid, tanh : Xavier
    • ReLU : He

Weight initialization visualization code:

import numpy as np
from matplotlib.pylab import plt

# assume some unit gaussian 10-D input data
def weight_init(types, activation):
    D = np.random.randn(1000, 500)
    hidden_layer_sizes = [500]*10
    if activation == "relu":
        nonlinearities = ['relu']*len(hidden_layer_sizes)
    else:
        nonlinearities = ['tanh']*len(hidden_layer_sizes)

    act = {'relu': lambda x:np.maximum(0,x), 'tanh':lambda x: np.tanh(x)}
    Hs = {}

    for i in range(len(hidden_layer_sizes)):
        X = D if i==0 else Hs[i-1] # input at this layer
        fan_in = X.shape[1]
        fan_out = hidden_layer_sizes[i]
        if types == "small":
            W = np.random.randn(fan_in, fan_out) * 0.01 # layer initialization
        elif types == "large":
            W = np.random.randn(fan_in, fan_out) * 1 # layer initialization
        elif types == "xavier":
            W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in) # layer initialization
        elif types == "he":
            W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in/2) # layer initialization
            

        H = np.dot(X, W) # matrix multiply
        H = act[nonlinearities[i]](H) # nonlinearity
        Hs[i] = H # cache result on this layer

    print('input layer had mean', np.mean(D), 'and std', np.std(D))

    # look at distributions at each layer
    layer_means = [np.mean(H) for i,H in Hs.items()]
    layer_stds = [np.std(H) for i,H in Hs.items()]

    # print
    for i,H in Hs.items() :
        print('hidden layer', i+1, 'had mean', layer_means[i], 'and std', layer_stds[i])

    plt.figure();
    plt.subplot(1,2,1);
    plt.title("layer mean");
    plt.plot(range(10), layer_means, 'ob-');
    plt.subplot(1,2,2);
    plt.title("layer std");
    plt.plot(range(10), layer_stds, 'or-');

    plt.show();

    plt.figure(figsize=(30,10));
    
    for i,H in Hs.items() :
        
        plt.subplot(1,len(Hs), i+1)
        plt.hist(H.ravel(), 30, range=(-1,1))
    plt.title("{}_{}".format(types, activation), fontsize = 15)
    plt.show();
    save_path = "images/{}_{}.png".format(types, activation)
    print("saved at : {}".format(save_path))
    plt.savefig(save_path);

weight_init('xavier', 'tanh') # small, large, xavier, he; tanh, relu

Momentum (Problem with SGD)

๊ณผ๊ฑฐ์— ์ด๋™ํ–ˆ๋˜ ๋ฐฉ์‹์„ ๊ธฐ์–ตํ•˜๋ฉด์„œ ๊ธฐ์กด ๋ฐฉํ–ฅ์œผ๋กœ ์ผ์ • ์ด์ƒ ์ถ”๊ฐ€ ์ด๋™ํ•จ.
Gradient Noise๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•จ.
sgd-momentum-4

  • \(\rho\) ๊ฐ€ momentum์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ž„. \(\rho = 0\)์ด๋ฉด ๋ชจ๋ฉ˜ํ…€์ด ์—†๋Š” ๊ฒƒ. ์†๋„๋ฅผ ๋”ํ•˜๋Š” ๊ฒƒ(cs231์—์„œ๋Š” friction์ด๋ผ๊ณ  ํ‘œํ˜„ํ•จ ๋งˆ์ฐฐ์„ ์ค„์ž„).

    Jitter along steep direction

    sgd-momentum-1

    local minima or saddle point

    sgd-momentum-2

  • Saddle points much more common in high dimension.

    Minibatch could be more noisy

    sgd-momentum-3

Nesterov Momentum

์›๋ž˜ SGD:
gradient ๊ตฌํ•˜๊ณ  momentum ๋”ํ•ด์„œ ์ด๋™

Nesterov Momentum:
momentum ๋”ํ•œํ›„ gradient ๊ตฌํ•˜๊ณ  ์ด๋™

Nesterov-Momentum

If your velocity was actually a little bit wrong, it lets you incorporate gradient information from a little bit larger parts of the objective landscape. (momentum์ด ์ฃผ๋Š” ์†๋„๊ฐ€ ํ‹€๋ ธ์„ ๊ฒฝ์šฐ ์ตœ์ ์˜ landscape, ์—ฌ๊ธฐ์„œ๋Š” loss๊ฐ€ optimal minima๋กœ ํ–ฅํ•˜๋Š” ๊ณณ์ธ๋“ฏ,์— ๊ฝค๋‚˜ ํฐ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋Š” ์˜๋ฏธ์ธ๋“ฏ.).

Adaptive Learning Rates

์ด์ „ ๊ฒฝ์‚ฌ๋„์™€ ํ˜„์žฌ ๊ฒฝ์‚ฌ๋„์˜ ๋ถ€ํ˜ธ๊ฐ€ ๊ฐ™์€ ๊ฒฝ์šฐ ๊ฐ’์„ ํ‚ค์šฐ๊ณ  ๋‹ค๋ฅธ ๊ฒฝ์šฐ ๊ฐ’์„ ์ค„์ด๋Š” ์ „๋žต

AdaGrad(Adaptive Gradient)

์†๋„ ๋Œ€์‹ ์— squared term์„ ์‚ฌ์šฉํ•จ.

๋‘ ๊ฐœ์˜ coordinate๊ฐ€ ์žˆ๋‹ค๊ณ  ํ–ˆ์„๋•Œ, ํ•œ ์ชฝ์€ high gradient, ๋ฐ˜๋Œ€์ชฝ์€ small gradient, sum of squares of the small gradient ํ•˜๊ณ  small gradient์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค๋ฉด,
accelerate the movement along the long dimension. Then, along the other dimension, where the gradients tend to be very large, then weโ€™ll be dividing by a large number, so slow down the progress.

  • ์ง€๋‚œ ๊ฒฝ์‚ฌ๋„์™€ ์ตœ๊ทผ ๊ฒฝ์‚ฌ๋„๋Š” ๊ฐ™์€ ๋น„์ค‘
    • \(r\)์ด ์ ์  ์ปค์ ธ ์ˆ˜๋ ด์„ ๋ฐฉํ•ดํ•  ๊ฐ€๋Šฅ์„ฑ.
    • convex case, slow down and converge
    • non-convex case, can be stuck in saddle point making no longer progress

RMSProp (variation of AdaGrad)

WMA(weighted moving average): ์ตœ๊ทผ ๊ฒƒ์— ๋น„์ค‘์„ ๋‘ .

rmsprop

Adam (Adaptive Moment Estimation: RMSProp with momentum)

adam-1
Q: What happens at first timestep?

At the very first timestep, you can see that at the beginning, weโ€™ve initialized second_moment with zero. After one update of the second_moment, typically beta2, second_momentโ€™s decay rate, is something like 0.9 or 0.99 something very close to 1. So after one update, our second_moment is still very close to zero. When weโ€™re making update step, and divide by our second_moment, now weโ€™re dividing by very small number. So, weโ€™re making a very large step at the beginning. This very large step at the beginnig is not really due to the geometry of the problem. Itโ€™s kind of an artifact of the fact that we initialized our second_moment estimate was zero.


Q: If your first moment is also very small, then youโ€™re multiplying by small and dividing by square root of small squared, whatโ€™s going to happen?

They might cancel each other out and this might be okay. Sometimes these cancel each other out. But, sometimes this ends up in taking very large steps right at the beginning. That can be quite bad. Maybe you initialize a little bit poorly, and you take a very large step. Now your initialization is completely messed up, and youโ€™re in a very bad part of the objective landscape and you just canโ€™t converge there.


Bias correction term: adam-2

Activation Function

activation-functions

Batch Normalization

Covariate Shift: train set, test set์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฆ„.
Internal Covariate Shift: ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉด์„œ ๋‹ค์Œ ์ธต์—์„œ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ์ˆ˜์‹œ๋กœ ๋ฐ”๋€Œ๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์— ๋ฐฉํ•ด๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ.

์ •๊ทœํ™”๋ฅผ ์ธต ๋‹จ์œ„ ์ ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•

์ธต์€ ์„ ํ˜•์  ์—ฐ์‚ฐ๊ณผ ๋น„์„ ํ˜•์  ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋Š” ๋ฐ,
์—ฌ๊ธฐ์„œ์—์„œ ์–ด๋””์— ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•  ๊ฒƒ์ธ๊ฐ€?

  • ์„ ํ˜• ์—ฐ์‚ฐ ํ›„์— ๋ฐ”๋กœ ์ •๊ทœํ™”๋ฅผ ์ ์šฉ(์ฆ‰, activation ์ „์—)
  • ์ผ๋ฐ˜์ ์œผ๋กœ fc, conv ํ›„ ๋˜๋Š” activation ์ „์— ์ ์šฉ

bn-insertion-1 bn-insertion-2

๋ฏธ๋‹ˆ๋ฐฐ์น˜์— ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌ.

BN Process

  1. ๋ฏธ๋‹ˆ ๋ฐฐ์น˜๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ ๊ณ„์‚ฐ (\(\mu, \sigma\))
  2. \(\mu, \sigma\)๋ฅผ ํ†ตํ•ด ์ •๊ทœํ™”
  3. ๋น„๋ก€(scale)์™€ ์ด๋™(shift) ์œผ๋กœ ์„ธ๋ถ€ ์กฐ์ • (\(\gamma, \beta\), ์ด ๋‘˜๋„ ํ•™์Šต์— ์˜ํ•ด ๊ฒฐ์ •๋จ)
    • Maybe in particular part of the network gives flexibility in normalization.

bn-gamma-beta

BN inference

bn-gamma-beta-infer

Advantages

  • Gradient ๊ฐœ์„ 
  • Large LR ํ—ˆ์šฉ
  • ์ดˆ๊ธฐํ™” ์˜์กด์„ฑ ๊ฐ์†Œ
  • ๊ทœ์ œํ™” ํšจ๊ณผ๋กœ dropout ํ•„์š”์„ฑ ๊ฐ์†Œ.

Types of Normalizations

types-of-normalizations

Regularization

Methods

Add extra Term to Loss

  • Regularization term์€ ํ›ˆ๋ จ์ง‘ํ•ฉ๊ณผ ๋ฌด๊ด€, ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๊ณผ์ •์— ๋‚ด์žฌํ•œ ์‚ฌ์ „ ์ง€์‹์— ํ•ด๋‹น.
  • ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ž‘์€ ๊ฐ’์œผ๋กœ ์œ ์ง€ํ•˜๋ฏ€๋กœ ๋ชจ๋ธ์˜ ์šฉ๋Ÿ‰์„ ์ œํ•œํ•˜๊ธฐ๋„ ํ•จ.
  • ์ž‘์€ ๊ฐ€์ค‘์น˜๋ฅผ ์œ ์ง€ํ•˜๋ ค๊ณ  l2 norm, l1 norm์„ ์ฃผ๋กœ ์‚ฌ์šฉ

    L2 Norm

l2-norm-term
l2-norm-term-2

Early Stopping

Data Augmentation

Drop Out

fc์—์„œ ๋ฐœ์ƒํ•˜๋Š” overfitting์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ์”€.

Co-Adaptation

๋ฐ์ดํ„ฐ์˜ ๊ฐ๊ฐ์˜ network์˜ weight๋“ค์ด ์„œ๋กœ ๋™์กฐํ™” ๋˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒ(์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์˜ inherentํ•œ ํŠน์„ฑ).
randomํ•˜๊ฒŒ dropout์„ ์ฃผ๋ฉด์„œ ์ด๋Ÿฌํ•œ ํ˜„์ƒ์„ ๊ทœ์ œํ•˜๋Š” ํšจ๊ณผ.

dropout-figure-1

In a standard neural network, the derivative received by each parameter tells it how itshould change so the final loss function is reduced, given what all other units are doing.Therefore, units may change in a way that they fix up the mistakes of the other units.This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable.Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It mustperform well in a wide variety of different contexts provided by the other hidden units. paper

Figure 7a shows features learned by an autoencoder on MNIST with a single hiddenlayer of 256 rectified linear units without dropout. Figure 7b shows the features learned byan identical autoencoder which used dropout in the hidden layer withp= 0.5. Both au-toencoders had similar test reconstruction errors. However, it is apparent that the featuresshown in Figure 7a have co-adapted in order to produce good reconstructions. Each hiddenunit on its own does not seem to be detecting a meaningful feature. On the other hand, inFigure 7b, the hidden units seem to detect edges, strokes and spots in different parts of theimage. This shows that dropout does break up co-adaptations, which is probably the mainreason why it leads to lower generalization errors.

Sparsity

hidden ๋‰ด๋Ÿฐ๋“ค์˜ ํ™œ์„ฑ๋„๊ฐ€ ์ข€ ๋” sparse ๋จ.
sparseํ•˜๊ฒŒ ๋˜๋ฉด์„œ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ๋ถ€๋ถ„๋งŒ ๋” ๋‚จ๊ฒŒ ๋˜๋Š” ํšจ๊ณผ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž„(sparseํ•˜์ง€ ์•Š๊ณ  ๊ณจ๊ณ ๋ฃจ ๋ถ„ํฌ๋˜์–ด์žˆ๋‹ค๋ฉด ์˜๋ฏธ๊ฐ€ ๋‘๋“œ๋Ÿฌ์ง€์ง€ ์•Š์Œ)

dropout-figure-2

We found that as a side-effect of doing dropout, the activations of the hidden unitsbecome sparse, even when no sparsity inducing regularizers are present. Thus, dropout au-tomatically leads to sparse representations. In a good sparse model, there should only be a few highly activated unitsfor any data case.

Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to Figure 8a, as seen by the significant mass away from zero for the net that does not use dropout. (8-b activation shows about 10k activation compared to 1.2k in 8-a.)

Ensemble Effects

Dropout is training a large ensemble of models (that share parameters).

dropout-ensemble

Ensemble Method

์„œ๋กœ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์˜ค๋ฅ˜๋ฅผ ์ค„์ž„

Bagging (Bootstrap Aggregating)

ํ›ˆ๋ จ์ง‘ํ•ฉ์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์ถ”์ถœํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ํ›ˆ๋ จ์ง‘ํ•ฉ ๊ตฌ์„ฑ

Boosting

\(i\)๋ฒˆ์งธ ์˜ˆ์ธก๊ธฐ๊ฐ€ ํ‹€๋ฆฐ ์ƒ˜ํ”Œ์„ \(i+1\)๋ฒˆ์งธ ์˜ˆ์ธก๊ธฐ๊ฐ€ ์ž˜ ์ธ์‹ํ•˜๋„๋ก ์—ฐ๊ณ„์„ฑ ๊ณ ๋ ค

Hyper Parameter Tuning

๋‘ ์ข…๋ฅ˜์˜ parameter๊ฐ€ ์žˆ์Œ

  • ๋‚ด๋ถ€ ๋งค๊ฐœ๋ณ€์ˆ˜ ํ˜น์€ ๊ฐ€์ค‘์น˜
    • ์‹ ๊ฒฝ๋ง์˜ ๊ฒฝ์šฐ ๊ฐ€์ค‘์น˜
  • hyper parameter
    • filter size, stepsize, LR ๋“ฑ

ํƒ์ƒ‰ ๋ฐฉ์‹

  • Grid Search
  • Random Search
  • Log space

์ฐจ์›์˜ ์ €์ฃผ ๋ฌธ์ œ

  • random search๊ฐ€ ๋” ์œ ๋ฆฌ.

2์ฐจ ๋ฏธ๋ถ„์„ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•

๋‰ดํ„ด ๋ฐฉ๋ฒ•

  • 2์ฐจ ๋ฏธ๋ถ„ ์ •๋ณด๋ฅผ ํ™œ์šฉ
  • 1์ฐจ ๋ฏธ๋ถ„์˜ ์ตœ์ ํ™”
    • ๊ฒฝ์‚ฌ๋„ ์‚ฌ์šฉํ•˜์—ฌ ์„ ํ˜• ๊ทผ์‚ฌ ์‚ฌ์šฉ
    • ๊ทผ์‚ฌ์น˜ ์ตœ์†Œํ™”
  • 2์ฐจ ๋ฏธ๋ถ„์˜ ์ตœ์ ํ™”
    • ๊ฒฝ์‚ฌ๋„์™€ ํ—ค์‹œ์•ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ 2์ฐจ ๊ทผ์‚ฌ ์‚ฌ์šฉ
    • ๊ทผ์‚ฌ์น˜์˜ ์ตœ์†Œ๊ฐ’
  • 2์ฐจ ๋ฏธ๋ถ„์˜ ๋‹จ์ 
    • ๋ณ€๊ณก์ (fโ€™โ€˜=0) ๊ทผ์ฒ˜์—์„œ ๋งค์šฐ ๋ถˆ์•ˆ์ •ํ•œ ์ด๋™ ํŠน์„ฑ์„ ๋ณด์ธ๋‹ค๋Š” ์ 
      • ๋ณ€๊ณก์  ๊ทผ์ฒ˜์—์„œ๋Š” fโ€™โ€™ = 0์— ๊ฐ€๊นŒ์šด ๊ฐ’์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— ์Šคํ…(step)์˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ปค์ ธ์„œ ์•„์— ์—‰๋šฑํ•œ ๊ณณ์œผ๋กœ ๋ฐœ์‚ฐํ•ด ๋ฒ„๋ฆด ์ˆ˜๋„ ์žˆ์Œ
    • ์ด๋™ํ•  ๋ฐฉํ–ฅ์„ ๊ฒฐ์ •ํ•  ๋•Œ ๊ทน๋Œ€, ๊ทน์†Œ๋ฅผ ๊ตฌ๋ถ„ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ 
      • ๊ทน๋Œ€์ ์„ ํ–ฅํ•ด ์ˆ˜๋ ดํ•  ์ˆ˜๋„ ์žˆ์Œ
  • ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜
    • ์ฃผ์–ด์ง„ ํ•จ์ˆ˜๋ฅผ ์ •์˜์—ญ์—์„œ ํŠน์ • ์ ์˜ ๋ฏธ๋ถ„๊ณ„์ˆ˜๋“ค์„ ๊ณ„์ˆ˜๋กœ ๊ฐ€์ง€๋Š” ๋‹คํ•ญ์‹์˜ ๊ทนํ•œ(๋ฉฑ๊ธ‰์ˆ˜)์œผ๋กœ ํ‘œํ˜„.
  • ๊ธฐ๊ณ„ํ•™์Šต์ด ์‚ฌ์šฉํ•˜๋Š” ๋ชฉ์ ํ•จ์ˆ˜๋Š” 2์ฐจ ํ•จ์ˆ˜๋ณด๋‹ค ๋ณต์žกํ•œ ํ•จ์ˆ˜์ด๋ฏ€๋กœ ํ•œ ๋ฒˆ์— ์ตœ์ ํ•ด์— ๋„๋‹ฌ ๋ถˆ๊ฐ€๋Šฅ
    • ํ—ค์‹œ์•ˆ ํ–‰๋ ฌ์„ ๊ตฌํ•ด์•ผ ํ•จ
      • ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ณผ๋‹คํ•˜๊ฒŒ ํ•„์š”ํ•จ
      • ์ผค๋ ˆ ๊ฒฝ์‚ฌ๋„ ๋ฐฉ๋ฒ•์ด ๋Œ€์•ˆ์œผ๋กœ ์ œ์‹œ๋จ

์ผค๋ ˆ ๊ฒฝ์‚ฌ๋„ ๋ฐฉ๋ฒ• (conjugate gradient method)

์ง์ „ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด์— ๋นจ๋ฆฌ ์ ‘๊ทผ

์œ ์‚ฌ ๋‰ดํ„ด ๋ฐฉ๋ฒ•

  • ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•: ์ˆ˜๋ ด ํšจ์œจ์„ฑ ๋‚ฎ์Œ
  • ๋‰ดํ„ด ๋ฐฉ๋ฒ•: ํ—ค์‹œ์•ˆ ํ–‰๋ ฌ ์—ฐ์‚ฐ ๋ถ€๋‹ด
    • ํ—ค์‹œ์•ˆ H์˜ ์—ญํ–‰๋ ฌ์„ ๊ทผ์‚ฌํ•˜๋Š” ํ–‰๋ ฌ M์„ ์‚ฌ์šฉ
  • ์ ์ง„์ ์œผ๋กœ ํ—ค์‹œ์•ˆ์„ ๊ทผ์‚ฌํ™”ํ•˜๋Š” LFGS๊ฐ€ ๋งŽ์ด ์‚ฌ์šฉ๋จ.
  • ๊ธฐ๊ณ„ํ•™์Šต์—์„œ๋Š” M์„ ์ €์žฅํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ๊ฒŒ ์“ฐ๋Š” L-BFGS๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉ
    • ์ „์ฒด ๋ฐฐ์น˜๋ฅผ ํ†ตํ•œ ๊ฐฑ์‹ ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, L-BFGS ์‚ฌ์šฉ์„ ๊ณ ๋ คํ•จ

Appendix

Reference

CNN cs231n lecture_8 Optimization: http://cs231n.stanford.edu/slides/2021/lecture_8.pdf
CNN cs231n lecture_3 Loss Functions and Optimization(SGD): http://cs231n.stanford.edu/slides/2021/lecture_3.pdf
CNN cs231n lecture_7 Activation Functions-Data Preprocessing-Weight Initialization-Batch Normalization-Transfer learning: http://cs231n.stanford.edu/slides/2021/lecture_7.pdf
weight initalization: https://silver-g-0114.tistory.com/79
SGD momentumL https://youtu.be/_JB0AO7QxSA
ML: https://wordbe.tistory.com/entry/MLDL-%EC%84%B1%EB%8A%A5-%ED%96%A5%EC%83%81%EC%9D%84-%EC%9C%84%ED%95%9C-%EC%9A%94%EB%A0%B9
Dropout: https://blog.naver.com/PostView.nhn?blogId=laonple&logNo=220818841217&parentCategoryNo=&categoryNo=16&viewDate=&isShowPopularPosts=true&from=search
Dropout paper: https://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf
2์ฐจ๋ฏธ๋ถ„ ์ตœ์ ํ™”: https://darkpgmr.tistory.com/149

Leave a comment