NLP - Language Model

July 27, 2021 5 minute read

언어모델Permalink

다음 문장 다음에 이어질 단어는?
- Please turn your homework
  - in or the ?
다음 두 문장 중 나타날 확률이 더 높은 것은?
- all of a sudden I notice three guys standing on the sidewalk.
- on guys all I of notice sidewalk three a sudden standing the.
목표: 문장이 일어날 확률을 구하는 것
왜 필요한가?
- 기계번역 (machine translation)
  - P(high winds tonight) > P(large winds tonight)
- 맞춤법 검사 (spell correction)
  - The office is about fifteen minuets from my house
    - P(about fifteen minutes from) > P(about fifteen minuets from)
- 음성인식 (speech recognition)
  - P(I saw a van) » P(eyes awe of an)
언어모델: 연속적인 단어들 (sequence of words)에 확률을 부여하는 모델 $P (W) = p (w_{1}, w_{2}, l d o t s, w_{n})$
관련된 일: 연속적인 단어들이 주어졌을 떄 그 다음 단어의 확률을 구하는 것 $P (w_{n} | w_{1}, w_{2}, l d o t s, w_{n - 1})$

P(W) 구하기Permalink

결합확률 구하기 $P (its, water, is, so, transparent, that)$
chain rule 을 사용해보자

Chain rulePermalink

조건부 확률
- $P (B | A) = \frac{P (A, B)}{P (A)}$
- $P (A, B) = P (A) P (B | A)$
두 개 이상의 확률변수들의 경우
- $P (A, B, C, D) = P (A) P (B | A) P (C | A, B) P (D | A, B, C)$
- 풀이
  - $\begin{aligned} P (A, B, C, D) & = P (A, B, C) P (D | A, B, C) \\ = P (A, B) P (C | A, B) P (D | A, B, C) \\ = \dots \end{aligned}$
일반적인 경우
- $P (x_{1}, x_{2}, \dots, x_{n}) = P (x_{1}) P (x_{2} | x_{1}) P (x_{3} | x_{1}, x_{2}) \dots P (x_{n} | x_{1}, \dots, x_{n - 1})$
$P (w_{1}, \dots, w_{b}) = \prod_{i} P (w_{i} | w_{1} w_{2} \dots, w_{i - 1})$
- $P ("its water is so transparent") = P (its) \times P (water | its) \times P (is | its water) \times P (so | its water is) \times P (transparent | its water is so)$
조건부 확률 $P (w | h)$
- $P (the | its water is so transparent that) = \frac{Count (its water is so transparent that the)}{Count (its water is so transparent that)}$
- 문제는?
  - 가능한 문장의 개수가 너무 많음
  - 이것을 계산할 수 있는 충분한 양의 데이터를 가지지 못할 것임.
Markov Assumption
- “한 단어의 확률은 그 단어 앞에 나타나는 몇 개의 단어들에만 의존한다”라는 가정
  - $P (the | its water is so transparent that) \approx P (the | that)$
  - $P (the | its water is so transparent that) \approx P (the | transparent that)$
- $P (w_{i} | w_{1} \dots, w_{i - 1}) \approx \prod_{i} P (w_{i} | w_{i - k} \dots, w_{i - 1})$
  - 바로 직전 $k$ 개의 단어에만 의존하는 조건부확률로 근사하겠다.
- $P (w_{1}, \dots, w_{n}) \approx \prod_{i} P (w_{i} | w_{i - k} \dots, w_{i - 1})$
  - 단순화 시키면 곱의 형태로 나타나게 됨.
Unigram 모델
- $P (w_{1}, \dots, w_{n}) \approx \prod_{i} P (w_{i})$
- 이 모델로 생성된 문장 예제들
  - fifth, an, of, futures, the, an incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass
Bigram 모델
- $P (w_{i} | w_{1} \dots, w_{i - 1}) \approx P (w_{i} | w_{i - 1})$
- 바로 직전의 단어에 대해서 조건부확률
N-gram 모델
- 이것을 trigrams, 4-grams, 5-grams로 확장할 수 있음
- 멀리 떨어진 단어들간의 관계를 완벽하게 모델링하진 못한다.
- 하지만 많은 경우 n-gram 만으로도 좋은 결과를 얻을 수 있음

Bigram 확률 계산Permalink

Maximum likelihood estimationPermalink

$P (w_{i} | w_{i - 1}) = \frac{Count (w_{i - 1}, w_{i})}{Count (w_{i - 1})}$
$P (w_{i} | w_{i - 1}) = \frac{c (w_{i - 1}, w_{i})}{c (w_{i - 1})}$

왜 이것이 MLE 인가Permalink

문장이 여러개가 주어진 Data: D
parameter: $θ$
parameter 가 주어졌을 때 data가 나타날 확률:

P (D | θ)

MLE 유도Permalink

$P (D | θ)$

우선, $θ$ 에 대해 정의를 해야함.

두 개의 단어, a와 b가 주어졌을 때.

$P (a | b) = α_{a b}$ : 단어 b가 주어졌을 때 그 다음에 나타나는 단어 a의 확률

$P (a) = β_{a}$ : 단어 a의 확률

$D = {W_{1}, \dots, W_{N}}$ : D = 문장 들의 집합. $W_{i}$ = 문장

문장일 경우 대문자, 단어일 경우 소문자로 사용할 것임.

P (D | θ) = \prod_{i = 1}^{N} P (W_{i})

여기서 $θ$ 는 ${α_{a b}, β_{a}}$ 를 다 포함한 것

두 문장을 예로 들면,

$W_{1} = "b a b"$
$W_{2} = "a b a c"$

\begin{aligned} P (D | θ) & = P (W_{1}) P (W_{2}) \\ = P ("b a b") P ("a b a c") \\ = β_{b} α_{a b} α_{b a} β_{a} α_{b a} α_{a b} α_{c a} \end{aligned}

$P ("b a b") = β_{b} α_{a b} α_{b a}$
$P ("a b a c") = β_{a} α_{b a} α_{a b} α_{c a}$

로그를 적용해보면,
$\begin{aligned} \ln P (D | θ) & = 2 \ln α_{a b} \\ + 2 \ln α_{b a} \\ + \ln α_{c a} \\ + β_{a} \\ + β_{b} \end{aligned}$

확률로 만드는 제약조건,

\sum_{a \in V} α_{a b} = 1

b 는 고정이고 a 값만을 Vocabulary에 속해있는 모든 단어들에 대해서 더하는 것.

\begin{aligned} \sum_{a \in V} P (a|b) & = \sum_{a \in V} \frac{P (a, b)}{P (b)} \\ = \frac{1}{P (b)} \sum_{a \in V} P (a, b) # 결합확률(P(a, b))을 a에 관해서 합하게 되면 b에 관한 Marginal Probability가 되는 것 \\ = \frac{1}{P (b)} P (b) \\ = 1 \end{aligned}

따라서 이러한 제약조건이 Parameter( $θ$ )에 적용되어야 함.

이 조건 하에서 $P (D | θ) = \prod_{i = 1}^{N} P (W_{i})$ 을 최대화 시키는 방법을 찾아야 함.

라그랑주 방법을 사용하면 쉽게 해결됨.

$α_{a b} = C (b, a) \ln α_{a b} + λ (\sum_{a^{'} \in V} α_{a^{'} b} - 1)$

이 식이 최대화되는 $α_{a b}$ 를 구하면 됨.
$C (b, a)$ : “b a” 가 나오는 빈도

\frac{\partial}{\partial α_{a b}} \ln_{a b} = \frac{C (b, a)}{α_{a b}} + λ = 0

0으로 놓고 풀면

λ α_{a b} = - C (b, a)

이렇게 되고,

모든 a 에 관해서 양변에 추가를 하면

\sum_{a \in V} λ α_{a b} = \sum_{a \in V} - C (b, a)

λ \sum_{a \in V} α_{a b} = - \sum_{a \in V} C (b, a)

여기서 $\sum_{a \in V} α_{a b} = 1$ 이므로, 좌변은 lambda만 남는다.

λ = - \sum_{a \in V} C (b, a)

이 값을 다시 위의 식( $\frac{\partial}{\partial α_{a b}} \ln_{a b} = \frac{C (b, a)}{α_{a b}} + λ = 0$ )에 대입

α_{a b} = \frac{C (b, a)}{\sum_{a \in V} C (b, a)}

이 식은 앞에서 MLE 처음에 언급한 $P (w_{i} | w_{i - 1}) = \frac{c (w_{i - 1}, w_{i})}{c (w_{i - 1})}$ 과 동일한 것을 알 수 있음.

$P (a) = β_{a}$ 을 문장시작 기호에 대해서도 정의하면,
$<s>: 문장 시작 기호 라고 했을때, α_{a < s >}$ 로 사용할 수 있다.

예제Permalink

$P (w_{i} | w_{i - 1}) = \frac{c (w_{i - 1}, w_{i})}{c (w_{i - 1})}$

<s> I am Sam</s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

$\begin{aligned} P (I | < s >) = \frac{2}{3} = 0.67 & P (S a m | < s >) = \frac{1}{3} = 0.33 & P (a m | I) = \frac{2}{3} = 0.67 \\ P (< s > | S a m) = \frac{1}{2} = 0.5 & P (S a m | a m) = \frac{1}{2} = 0.5 & P (d o | I) = \frac{1}{3} = 0.33 \end{aligned}$

Bigram 빈도수
Bigram 확률

\begin{aligned} P (<s> i want english food </s>) & = P (i|<s>) P (want|i) P (english|want) P (food|english) P (</s>|food) \\ = 0.25 \times 0.33 \times 0.0011 \times 00.5 \times 00.68 \\ = 0.000031 \end{aligned}

모델평가Permalink

외재적 평가 (extrinsic evaluation)
- 언어모델은 일반적으로 그 자체가 목표이기보다 특정 과제르 ㄹ위한 부분으로서 쓰여지게 됨
- 따라서 언어모델이 좋은지 판단하기 위해선 그 과제의 평가지표를 사용하는 경우가 많음
- 예를 들어, 맞춤법 검사를 위해서 두 개의 언어모델 A, B를 사용한다고 할 때
  - 각 모델을 사용해서 얼마나 정확하게 맞춤법 오류를 수정할 수 있는 지 계산
  - 정확도가 높은 언어모델을 최종적으로 사용
내재적 평가 (intrinsic evaluation)
- 외재적 평가는 시간이 많이 걸리는 단점
- 언어모델이 학습하는 확률자체를 평가 못함: perplexity(3.2.1 Perplexity in SLP book)
- 이 기준으로 최적의 언어모델이 최종 과제를 위해서는 최적이 아닐 수도 있음
- 하지만 언어모델의 학습과정에 버그가 있었는지 빨리 확인하는 용도로 사용 가능.

PerplexityPermalink

좋은 언어모델이란?
- 테스트 데이터를 높은 확률로 예측하는 모델
- perplexity: 확률의 역수를 단어의 개수로 정규화한 값
- perplexity를 최소화하는 것이 확률을 최대화 하는 것
$\begin{aligned} P P (W) & = P (w_{1} w_{2} \dots w_{N})^{\frac{1}{N}} \\ = \sqrt[N]{\frac{1}{P (w_{1} \dots w_{N})}} \end{aligned}$
use chain rule
- $P P (W) = \sqrt[N]{\prod_{i = 1}^{N} \frac{1}{P (w_{i} | w_{1} \dots w_{i - 1})}}$
bigram perplexity
- $P P (W) = \sqrt[N]{\prod_{i = 1}^{N} \frac{1}{P (w_{i} | w_{i - 1})}}$

AppendixPermalink

ReferencePermalink

Speech and Language Processing : https://web.stanford.edu/~jurafsky/slp3/ed3book_dec302020.pdf

Share on

Twitter Facebook LinkedIn

Yilguk Seo 🚀