This is an old revision of the document!

Gradient Descent

task: y hat = a + b*x 의 regression 에서 a 와 b를 구하기

regressin line으로 이루어지는 에측치로 결과되는 오차의 제곱의 합은 (sum of square residuals, ssr 혹은 sse) 최소값이 되어야 한다.
방법 1. (버릴 방법)
- 우선 b를 안다고 하고 정해두면 (가령 -2.4와 같이 – 비현실적이지만)
- 예측할 수 있는 a값의 range를 두고
- 하나씩 대입하여 그 때의 sum of square residual 을 (ssr 혹은 sse) 구해서
- 기록해 두면서 어디서 최소값인지 본다.
- 아래 r script는 이런 방법으로 구해 본것인데 아래처럼 두가지 단점이 있다.
  - 우선 정확한 b값을 알 수는 없다.
  - a 값을 장님 문고리 잡듯이 -50에서 50 사이의 점수를 0.1씩 증가시켜 보는 것이 너무 비효율적이다.
  - 더불어 이렇게 구한 a값은 1/10 단위의 숫자를 구하게 되지 정확한 a값을 구하는 것은 아니다 (예를 들면 1.4로 a값을 구했는데 정확한 값은 1.38976 일 수도 있다.

rs01

library(tidyverse)
library(data.table)
library(ggplot2)
library(ggpmisc)


rm(list=ls())

# data 만들기
set.seed(191)
nx <- 200
mx <- 4.5
sdx <- mx * 0.56
x <- rnorm(nx, mx, sdx)
slp <- 12
y <-  slp * x + rnorm(nx, 0, slp*sdx*3)
data <- data.frame(x, y)
# data 변인 완성

# regression summary shows 
# a and b
mo <- lm(y ~ x, data = data)
summary(mo)

ggplot(data = data, aes(x = x, y = y)) + 
  geom_point() +
  stat_poly_line() +
  stat_poly_eq(use_label(c("eq", "R2"))) +
  theme_classic() 

# 위에서 확인한 b값을 b로 고정하고 a만 변화시켜서 이해
b <- summary(mo)$coefficients[2]
a <- 0

b
a

# Predict function: y hat 값
predict <- function(x, a, b){
  return (a + b * x)
}

# And loss function is: residual 혹은 error 값
residuals <- function(predictions, y) {
  return(y - predictions)
}

# we use sum of square of error 
ssrloss <- function(predictions, y) {
  residuals <- (y - predictions)
  return(sum(residuals^2))
}

ssrs <- c() # for sum  of square residuals
srs <- c() # sum of residuals 
as <- c() # for as (intercepts)

# x 값을 -50 에서 50을 범위로 0.01씩 증가시켜서 
# for 문에 대입, i로 사용
for (i in seq(from = -50, to = 50, by = 0.01)) {
  pred <- predict(x, i, b)
  res <- residuals(pred, y)
  ssr <- ssrloss(pred, y)
  ssrs <- append(ssrs, ssr)
  srs <- append(srs, sum(res))
  as <- append(as, i) # i 값을 a로 사용했기에 as 변인에 기록
}
# 1에는 0.01이 100개 있고, -50 ~ 50 = 1이 101 개 있으니, 10100
length(ssrs)
length(srs)
length(as)

min(ssrs) # sum of square error 값 중 최소값
min.pos.ssrs <- which(ssrs == min(ssrs)) # 그 값이 몇 번째에 있는지 구함
min.pos.ssrs
print(as[min.pos.ssrs]) # 그 몇번째에 해당하는 a값을 구함
min.a <- as[min.pos.ssrs]
# 이 a값이 최소 ssr값을 갖도록 하는 a (소수점 2자리에서 구함)
summary(mo) # a값 확인

plot(as, ssrs, type='l', lwd=2)
text(min.a, mean(ssrs) ,paste("ssrs = ", min(ssrs), "\n" , "is minimum value when a = ", min.a), col='red')

ro01

> library(tidyverse)
> library(data.table)
> library(ggplot2)
> library(ggpmisc)
> 
> 
> rm(list=ls())
> 
> # data 만들기
> set.seed(191)
> nx <- 200
> mx <- 4.5
> sdx <- mx * 0.56
> x <- rnorm(nx, mx, sdx)
> slp <- 12
> y <-  slp * x + rnorm(nx, 0, slp*sdx*3)
> data <- data.frame(x, y)
> # data 변인 완성
> 
> # regression summary shows 
> # a and b
> mo <- lm(y ~ x, data = data)
> summary(mo)

Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> 
> ggplot(data = data, aes(x = x, y = y)) + 
+     geom_point() +
+     stat_poly_line() +
+     stat_poly_eq(use_label(c("eq", "R2"))) +
+     theme_classic() 
> 
> # 위에서 확인한 b값을 b로 고정하고 a만 변화시켜서 이해
> b <- summary(mo)$coefficients[2]
> a <- 0
> 
> b
[1] 12.90017
> a
[1] 0
> 
> # Predict function: y hat 값
> predict <- function(x, a, b){
+     return (a + b * x)
+ }
> 
> # And loss function is: residual 혹은 error 값
> residuals <- function(predictions, y) {
+     return(y - predictions)
+ }
> 
> # we use sum of square of error 
> ssrloss <- function(predictions, y) {
+     residuals <- (y - predictions)
+     return(sum(residuals^2))
+ }
> 
> ssrs <- c() # for sum  of square residuals
> srs <- c() # sum of residuals 
> as <- c() # for as (intercepts)
> 
> # x 값을 -50 에서 50을 범위로 0.01씩 증가시켜서 
> # for 문에 대입, i로 사용
> for (i in seq(from = -50, to = 50, by = 0.01)) {
+     pred <- predict(x, i, b)
+     res <- residuals(pred, y)
+     ssr <- ssrloss(pred, y)
+     ssrs <- append(ssrs, ssr)
+     srs <- append(srs, sum(res))
+     as <- append(as, i) # i 값을 a로 사용했기에 as 변인에 기록
+ }
> # 1에는 0.01이 100개 있고, -50 ~ 50 = 1이 101 개 있으니, 10100
> length(ssrs)
[1] 10001
> length(srs)
[1] 10001
> length(as)
[1] 10001
> 
> min(ssrs) # sum of square error 값 중 최소값
[1] 1747011
> min.pos.ssrs <- which(ssrs == min(ssrs)) # 그 값이 몇 번째에 있는지 구함
> min.pos.ssrs
[1] 4896
> print(as[min.pos.ssrs]) # 그 몇번째에 해당하는 a값을 구함
[1] -1.05
> min.a <- as[min.pos.ssrs]
> # 이 a값이 최소 ssr값을 갖도록 하는 a (소수점 2자리에서 구함)
> summary(mo) # a값 확인

Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> 
> plot(as, ssrs, type='l', lwd=2)
> text(min.a, mean(ssrs) ,paste("ssrs = ", min(ssrs), "\n" , "is minimum value when a = ", min.a), col='red')
>

위 방법은 dumb . . . . .
우선 a의 범위가 어디가 될지 몰라서 -50 에서 50까지로 한것
이 두 지점 사이를 0.01 단위로 증가시킨 값을 a값이라고 할 때의 sse값을 구하여 저장한 것
이렇게 구한 sse 값들 중 최소값일 때의 a값을 regression의 a값으로 추정한 것.

이렇게 해서 구한 값은 summary(mo)에서의 a와 값과 같다.

SSE 대신에 MSE를 쓰기

rs.01

#####
# with mean square error (mse) instead of sse 

b <- summary(mo)$coefficients[2]
a <- 0

# we use sum of square of error which oftentimes become big
msrloss <- function(predictions, y) {
  residuals <- (y - predictions)
  return(mean(residuals^2))
}

msrs <- c() # for sum  of square residuals
srs <- c() # sum of residuals 
as <- c() # for as (intercepts)

for (i in seq(from = -50, to = 50, by = 0.01)) {
  pred <- predict(x, i, b)
  res <- residuals(pred, y)
  msr <- msrloss(pred, y)
  msrs <- append(msrs, msr)
  srs <- append(srs, mean(res))
  as <- append(as, i)
}
length(msrs)
length(srs)
length(as)

min(msrs)
min.pos.msrs <- which(msrs == min(msrs))
min.pos.msrs
print(as[min.pos.msrs])
summary(mo)

min(msrs) # sum of square error 값 중 최소값
min.pos.msrs <- which(msrs == min(msrs)) # 그 값이 몇 번째에 있는지 구함
min.pos.msrs
print(as[min.pos.msrs]) # 그 몇번째에 해당하는 a값을 구함
min.a <- as[min.pos.msrs]
# 이 a값이 최소 ssr값을 갖도록 하는 a (소수점 2자리에서 구함)
summary(mo) # a값 확인

plot(as, msrs, type='l', lwd=2)
text(min.a, mean(msrs) ,paste("msrs = ", min(msrs), "\n" , "is minimum value when a = ", min.a), col='red')

ro.02

> 
> #####
> # with mean square error (mse) instead of sse 
> 
> b <- summary(mo)$coefficients[2]
> a <- 0
> 
> # we use sum of square of error which oftentimes become big
> msrloss <- function(predictions, y) {
+     residuals <- (y - predictions)
+     return(mean(residuals^2))
+ }
> 
> msrs <- c() # for sum  of square residuals
> srs <- c() # sum of residuals 
> as <- c() # for as (intercepts)
> 
> for (i in seq(from = -50, to = 50, by = 0.01)) {
+     pred <- predict(x, i, b)
+     res <- residuals(pred, y)
+     msr <- msrloss(pred, y)
+     msrs <- append(msrs, msr)
+     srs <- append(srs, mean(res))
+     as <- append(as, i)
+ }
> length(msrs)
[1] 10001
> length(srs)
[1] 10001
> length(as)
[1] 10001
> 
> min(msrs)
[1] 8735.055
> min.pos.msrs <- which(msrs == min(msrs))
> min.pos.msrs
[1] 4896
> print(as[min.pos.msrs])
[1] -1.05
> summary(mo)

Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> 
> min(msrs) # sum of square error 값 중 최소값
[1] 8735.055
> min.pos.msrs <- which(msrs == min(msrs)) # 그 값이 몇 번째에 있는지 구함
> min.pos.msrs
[1] 4896
> print(as[min.pos.msrs]) # 그 몇번째에 해당하는 a값을 구함
[1] -1.05
> min.a <- as[min.pos.msrs]
> # 이 a값이 최소 ssr값을 갖도록 하는 a (소수점 2자리에서 구함)
> summary(mo) # a값 확인

Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> 
> plot(as, msrs, type='l', lwd=2)
> text(min.a, mean(msrs) ,paste("msrs = ", min(msrs), "\n" , "is minimum value when a = ", min.a), col='red')
>

b값 구하기

이제는 a값을 고정하고 b값도 같은 방식으로 구해볼 수 있다

rs01

##############################################
# b값도 범위를 추측한 후에 0.01씩 증가시키면서
# 각 b값에서 mse값을 구해본후 가장 작은 값을 
# 가질 때의 b값을 구하면 된다. 
# 그러나 b값의 적절한 구간을 예측하는 것이
# 불가능하다 (그냥 추측뿐)
# 위의 y 데이터에서 y = 314*x + rnorm(. . .)
# 이라면 -30-30 구간은 적절하지 않은 구간이 된다.
# 더우기 a값을 정확히 알아야 b값을 추출할 수 있다. 
# 이는 적절한 방법이 아니다.

b <- 1
a <- summary(mo)$coefficients[1]

# Predict function:
predict <- function(x, a, b){
  return (a + b * x)
}

# And loss function is:
residuals <- function(predictions, y) {
  return(y - predictions)
}

# we use sum of square of error which oftentimes become big
msrloss <- function(predictions, y) {
  residuals <- (y - predictions)
  return(mean(residuals^2))
}

msrs <- c()
mrs <- c()
bs <- c()

for (i in seq(from = -50, to = 50, by = 0.01)) {
  pred <- predict(x, a, i) # i <- b 값
  res <- residuals(pred, y)
  msr <- msrloss(pred, y)
  msrs <- append(msrs, msr)
  mrs <- append(mrs, mean(res))
  bs <- append(bs,i)
}

min(msrs)
min.pos.msrs <- which(msrs == min(msrs))
print(bs[min.pos.msrs])
min.b <- bs[min.pos.msrs]
summary(mo)

plot(bs, msrs, type='l', lwd=2)
text(min.b, mean(msrs) ,paste("msrs = ", min(msrs), "\n" , "is minimum value when b = ", min.b), col='red')

ro01

> summary(mo)

Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> # b값도 범위를 추측한 후에 0.01씩 증가시키면서
> # 각 b값에서 mse값을 구해본후 가장 작은 값을 
> # 가질 때의 b값을 구하면 된다. 
> # 그러나 b값의 적절한 구간을 예측하는 것이
> # 불가능하다 (그냥 추측뿐)
> # 위의 y 데이터에서 y = 314*x + rnorm(. . .)
> # 이라면 -30-30 구간은 적절하지 않은 구간이 된다.
> # 더우기 a값을 정확히 알아야 b값을 추출할 수 있다. 
> # 이는 적절한 방법이 아니다.
> 
> b <- 1
> a <- summary(mo)$coefficients[1]
> 
> # Predict function:
> predict <- function(x, a, b){
+     return (a + b * x)
+ }
> 
> # And loss function is:
> residuals <- function(predictions, y) {
+     return(y - predictions)
+ }
> 
> # we use sum of square of error which oftentimes become big
> msrloss <- function(predictions, y) {
+     residuals <- (y - predictions)
+     return(mean(residuals^2))
+ }
> 
> msrs <- c()
> mrs <- c()
> bs <- c()
> 
> for (i in seq(from = -50, to = 50, by = 0.01)) {
+     pred <- predict(x, a, i)
+     res <- residuals(pred, y)
+     msr <- msrloss(pred, y)
+     msrs <- append(msrs, msr)
+     mrs <- append(mrs, mean(res))
+     bs <- append(bs,i)
+ }
> 
> min(msrs)
[1] 8735.055
> min.pos.msrs <- which(msrs == min(msrs))
> print(bs[min.pos.msrs])
[1] 12.9
> min.b <- bs[min.pos.msrs]
> summary(mo)

Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> 
> plot(bs, msrs, type='l', lwd=2)
> text(min.b, mean(msrs) ,paste("msrs = ", min(msrs), "\n" , "is minimum value when b = ", min.b), col='red')
>

a와 b를 동시에 구할 수 있는 방법은 없을까? 위의 방법으로는 어렵다. 일반적으로 우리는 a와 b값이 무엇이되는가를 미분을 이용해서 구할 수 있었다. R에서 미분의 해를 구하기 보다는 해에 접근하도록 하는 프로그래밍을 써서 a와 b의 근사값을 구한다. 이것을 gradient descent라고 부른다.

Gradient descend

위에서 a값이 무엇일 때 mse값이 최소가 될까를 봐서 이 때의 a값을 실제 $y = a + bx$ 의 a 값으로 삼았다. 이 때 a값의 추정을 위해서
seq(-10, 10, 0.01) 의 범위와 증가값을 가지고 일일이 대입하여 mse를 구아였다.

위에서의 0.01씩 증가시켜 대입하는 것이 아니라 처음 한 숫자에서 시작한 후 그 다음 숫자를 정한 후에 점진적으로 그 숫자 간격을 줄여가면서 보면 정율적으로 0.01씩 증가시키는 것 보다 효율적일 것 같다. 이 증가분을 구하기 위해서 미분을 사용한다.

점차하강 = 조금씩 깍아서 원하는 기울기 (미분값) 찾기
prerequisite:
표준편차 추론에서 평균을 사용하는 이유: 실험적_수학적_이해
deriviation of a and b in a simple regression

위의 문서는 a, b에 대한 값을 미분법을 이용해서 직접 구하였다. 컴퓨터로는 이렇게 하기가 쉽지 않다. 그렇다면 이 값을 반복계산을 이용해서 추출하는 방법은 없을까? gradient descent

\begin{eqnarray*} \text{for a (constant)} \\ \\ \text{SSE} & = & \text{Sum of Square Residuals} \\ \text{Residual} & = & (Y_i - (a + bX_i)) \\ \\ \frac{\text{dSSE}}{\text{da}} & = & \frac{\text{dResidual^2}}{\text{dResidual}} * \frac{\text{dResidual}}{\text{da}} \\ & = & 2 * \text{Residual} * \dfrac{\text{d}}{\text{da}} (Y_i - (a+bX_i)) \\ & \because & \dfrac{\text{d}}{\text{da}} (Y_i - (a+bX_i)) = -1 \\ & = & 2 * \sum{(Y_i - (a + bX_i))} * -1 \\ & = & -2 *\sum{\text{residual}} \\ & .. & -2 \frac{\sum{\text{residual}}}{n} \\ & = & -2 * \overline{\text{residual}} \\ \end{eqnarray*}
아래 R code에서 gradient function을 참조.

\begin{eqnarray*} \text{for b, (coefficient)} \\ \\ \dfrac{\text{d}}{\text{db}} \sum{(Y_i - (a + bX_i))^2} & = & \sum{\dfrac{\text{dResidual}^2}{\text{db}}} \\ & = & \sum{\dfrac{\text{dResidual}^2}{\text{dResidual}}*\dfrac{\text{dResidual}}{\text{db}} } \\ & = & \sum{2*\text{Residual} * \dfrac{\text{dResidual}}{\text{db}} } \\ & = & \sum{2*\text{Residual} * (-X_i) } \;\;\;\; \\ & \because & \dfrac{\text{dResidual}}{\text{db}} = (Y_i - (a+bX_i)) = -X_i \\ & = & -2 X_i \sum{(Y_i - (a + bX_i))} \\ & = & -2 * X_i * \sum{\text{residual}} \\ & .. & -2 * X_i * \frac{\sum{\text{residual}}}{n} \\ & = & -2 * \overline{X_i * \text{residual}} \\ \end{eqnarray*}

위의 설명은 Sum of Square값을 미분하는 것을 전제로 하였지만, Mean Square 값을 (Sum of Square값을 N으로 나눈 것) 대용해서 이해할 수도 있다. 아래의 code는 (미분을 이해한다는 것을 전제로) b값과 a값이 변할 때 msr (mean square residual) 값이 어떻게 변하는가를 알려주는 것이다.

gradient <- function(x, y, predictions){
  error = y - predictions
  db = -2 * mean(x * error)
  da = -2 * mean(error)
  return(list("b" = db, "a" = da))
}

위 펑션으로 얻은 da와 db값을 초기에 설정한 a, b 값에 더해 준 값을 다시 a, b값으로 하여 gradient 펑션을 통해서 다시 db, da값을 구하고 이를 다시 이전 단계에서 구한 a, b값에 더하여 그 값을 다시 a, b값을 하여 . . . .

위를 반복한다. 단, db값과 da값을 그냥 대입하기 보다는 초기에 설정한 learning.rate값을 (0.01 예를 들면) 곱하여 구한 값을 더하게 된다. 이것이 아래의 code이다.

gradient <- function(x, y, predictions){
  error = y - predictions
  db = -2 * mean(x * error)
  da = -2 * mean(error)
  return(list("b" = db, "a" = da))
}

mseloss <- function(predictions, y) {
  residuals <- (y - predictions)
  return(mean(residuals^2))
}

# Train the model with scaled features
learning_rate = 1e-2
lr = 1e-1

# Record Loss for each epoch:
logs = list()
as = c()
bs = c()
mse = c()
sse = c()
x.ori <- x
zx <- (x-mean(x))/sd(x)

nlen <- 50
for (epoch in 1:nlen) {
  predictions <- predict(zx, a, b)
  loss <- mseloss(predictions, y)
  mse <- append(mse, loss)
  
  grad <- gradient(zx, y, predictions)
  
  step.b <- grad$b * lr 
  step.a <- grad$a * lr
  b <- b-step.b
  a <- a-step.a
  
  as <- append(as, a)
  bs <- append(bs, b)
}

a 와 b 값을 gradient descent 방법을 이용하여 한꺼번에 구하기

rs01

# the above no gradient
# mse 값으로 계산 rather than sse 
# 후자는 값이 너무 커짐

a <- rnorm(1)
b <- rnorm(1)
cat(a, b)
a.ini <- a
b.ini <- b

gradient <- function(x, y, predictions){
  error = y - predictions
  db = -2 * mean(x * error)
  da = -2 * mean(error)
  return(list("b" = db, "a" = da))
}

mseloss <- function(predictions, y) {
  residuals <- (y - predictions)
  return(mean(residuals^2))
}

# Train the model with scaled features
learning.rate = 1e-1

# Record Loss for each epoch:
as = c()
bs = c()
mses = c()
sses = c()
mres = c()
zx <- (x-mean(x))/sd(x)

nlen <- 50
for (epoch in 1:nlen) {
  predictions <- predict(zx, a, b)
  residual <- residuals(predictions, y)
  loss <- mseloss(predictions, y)
  mres <- append(mres, mean(residual))
  mses <- append(mses, loss)
  
  grad <- gradient(zx, y, predictions)
  
  step.b <- grad$b * learning.rate 
  step.a <- grad$a * learning.rate
  b <- b-step.b
  a <- a-step.a
  
  as <- append(as, a)
  bs <- append(bs, b)
}
abmses <- data.frame(as, bs, mses)
abmses
min(abmses$mses)

# 위의 a와 b값은 
# x 값을 표준점수화 해서 대입하고 구한 기울기와 절편 값

# unscale coefficients to make them comprehensible
# see http://commres.net/wiki/gradient_descent#why_normalize_scale_or_make_z-score_xi
# and 
# http://commres.net/wiki/gradient_descent#how_to_unnormalize_unscale_a_and_b
#  
a =  a - (mean(x) / sd(x)) * b
b =  b / sd(x)

cat(a, b, "\n")
summary(mo)

# changes of estimators
as <- as - (mean(x) /sd(x)) * bs
bs <- bs / sd(x)

abmses <- data.frame(as, bs, mses)
abmses

cat(paste0("Intercept: ", a, "\n", "Slope: ", b, "\n"))
summary(lm(y ~ x))$coefficients

ch <- data.frame(mres, mses)
ch
max(y)
ggplot(data, aes(x = x, y = y)) + 
  geom_point(size = 2) + 
  geom_abline(aes(intercept = as, slope = bs),
              data = abmses, linewidth = 0.5, 
              color = 'green') + 
  stat_poly_line() +
  stat_poly_eq(use_label(c("eq", "R2"))) +
  theme_classic() +
  geom_abline(aes(intercept = as, slope = bs), 
              data = abmses %>% slice_head(), 
              linewidth = 1, color = 'blue') + 
  geom_abline(aes(intercept = as, slope = bs), 
              data = abmses %>% slice_tail(), 
              linewidth = 1, color = 'red') +
  labs(title = 'Gradient descent. blue: start, red: end, green: gradients')
summary(lm(y~x))
a.ini
b.ini
a
b

ro01

> # the above no gradient
> # mse 값으로 계산 rather than sse 
> # 후자는 값이 너무 커짐
> 
> a <- rnorm(1)
> b <- rnorm(1)
> cat(a, b)
2.242837 -0.7047126> a.ini <- a
> b.ini <- b
> 
> gradient <- function(x, y, predictions){
+     error = y - predictions
+     db = -2 * mean(x * error)
+     da = -2 * mean(error)
+     return(list("b" = db, "a" = da))
+ }
> 
> mseloss <- function(predictions, y) {
+     residuals <- (y - predictions)
+     return(mean(residuals^2))
+ }
> 
> # Train the model with scaled features
> learning.rate = 1e-1
> 
> # Record Loss for each epoch:
> as = c()
> bs = c()
> mses = c()
> sses = c()
> mres = c()
> zx <- (x-mean(x))/sd(x)
> 
> nlen <- 50
> for (epoch in 1:nlen) {
+     predictions <- predict(zx, a, b)
+     residual <- residuals(predictions, y)
+     loss <- mseloss(predictions, y)
+     mres <- append(mres, mean(residual))
+     mses <- append(mses, loss)
+     
+     grad <- gradient(zx, y, predictions)
+     
+     step.b <- grad$b * learning.rate 
+     step.a <- grad$a * learning.rate
+     b <- b-step.b
+     a <- a-step.a
+     
+     as <- append(as, a)
+     bs <- append(bs, b)
+ }
> abmses <- data.frame(as, bs, mses)
> abmses
         as        bs      mses
1  12.95735  5.391121 12538.721
2  21.52896 10.273884 11170.896
3  28.38624 14.184977 10294.952
4  33.87207 17.317763  9734.004
5  38.26074 19.827124  9374.777
6  41.77167 21.837123  9144.730
7  44.58041 23.447131  8997.410
8  46.82741 24.736748  8903.066
9  48.62501 25.769731  8842.649
10 50.06308 26.597151  8803.958
11 51.21354 27.259914  8779.180
12 52.13391 27.790787  8763.313
13 52.87021 28.216017  8753.151
14 53.45925 28.556625  8746.644
15 53.93048 28.829453  8742.476
16 54.30746 29.047988  8739.807
17 54.60905 29.223035  8738.098
18 54.85031 29.363247  8737.004
19 55.04333 29.475557  8736.303
20 55.19774 29.565517  8735.854
21 55.32127 29.637575  8735.567
22 55.42010 29.695294  8735.382
23 55.49916 29.741526  8735.265
24 55.56240 29.778559  8735.189
25 55.61300 29.808222  8735.141
26 55.65348 29.831982  8735.110
27 55.68586 29.851013  8735.090
28 55.71177 29.866258  8735.077
29 55.73249 29.878469  8735.069
30 55.74907 29.888249  8735.064
31 55.76234 29.896084  8735.061
32 55.77295 29.902359  8735.058
33 55.78144 29.907386  8735.057
34 55.78823 29.911412  8735.056
35 55.79366 29.914637  8735.056
36 55.79801 29.917221  8735.055
37 55.80148 29.919290  8735.055
38 55.80427 29.920947  8735.055
39 55.80649 29.922275  8735.055
40 55.80827 29.923338  8735.055
41 55.80970 29.924190  8735.055
42 55.81083 29.924872  8735.055
43 55.81175 29.925419  8735.055
44 55.81248 29.925857  8735.055
45 55.81306 29.926207  8735.055
46 55.81353 29.926488  8735.055
47 55.81390 29.926713  8735.055
48 55.81420 29.926893  8735.055
49 55.81444 29.927038  8735.055
50 55.81463 29.927153  8735.055
> min(abmses$mses)
[1] 8735.055
> 
> # 위의 a와 b값은 
> # x 값을 표준점수화 해서 대입하고 구한 기울기와 절편 값
> 
> # unscale coefficients to make them comprehensible
> # see http://commres.net/wiki/gradient_descent#why_normalize_scale_or_make_z-score_xi
> # and 
> # http://commres.net/wiki/gradient_descent#how_to_unnormalize_unscale_a_and_b
> #  
> a =  a - (mean(x) / sd(x)) * b
> b =  b / sd(x)
> 
> cat(a, b, "\n")
-1.046932 12.89997 
> summary(mo)

Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> 
> # changes of estimators
> as <- as - (mean(x) /sd(x)) * bs
> bs <- bs / sd(x)
> 
> abmses <- data.frame(as, bs, mses)
> abmses
            as        bs      mses
1   2.71422317  2.323819 12538.721
2   2.00858757  4.428512 11170.896
3   1.43480185  6.114371 10294.952
4   0.96834220  7.464745  9734.004
5   0.58922219  8.546394  9374.777
6   0.28115840  9.412795  9144.730
7   0.03088837 10.106782  8997.410
8  -0.17238667 10.662665  8903.066
9  -0.33745697 11.107928  8842.649
10 -0.47147588 11.464584  8803.958
11 -0.58026310 11.750265  8779.180
12 -0.66855213 11.979095  8763.313
13 -0.74019201 12.162389  8753.151
14 -0.79831185 12.309206  8746.644
15 -0.84545488 12.426808  8742.476
16 -0.88368767 12.521006  8739.807
17 -0.91468912 12.596459  8738.098
18 -0.93982287 12.656897  8737.004
19 -0.96019627 12.705308  8736.303
20 -0.97670839 12.744085  8735.854
21 -0.99008900 12.775145  8735.567
22 -1.00093040 12.800024  8735.382
23 -1.00971319 12.819953  8735.265
24 -1.01682726 12.835915  8735.189
25 -1.02258888 12.848701  8735.141
26 -1.02725453 12.858943  8735.110
27 -1.03103220 12.867147  8735.090
28 -1.03409049 12.873718  8735.077
29 -1.03656609 12.878981  8735.069
30 -1.03856977 12.883197  8735.064
31 -1.04019130 12.886574  8735.061
32 -1.04150341 12.889279  8735.058
33 -1.04256502 12.891446  8735.057
34 -1.04342385 12.893181  8735.056
35 -1.04411857 12.894571  8735.056
36 -1.04468048 12.895685  8735.055
37 -1.04513491 12.896577  8735.055
38 -1.04550239 12.897291  8735.055
39 -1.04579952 12.897863  8735.055
40 -1.04603974 12.898322  8735.055
41 -1.04623394 12.898689  8735.055
42 -1.04639092 12.898983  8735.055
43 -1.04651780 12.899219  8735.055
44 -1.04662035 12.899407  8735.055
45 -1.04670321 12.899559  8735.055
46 -1.04677017 12.899680  8735.055
47 -1.04682427 12.899777  8735.055
48 -1.04686798 12.899854  8735.055
49 -1.04690329 12.899916  8735.055
50 -1.04693181 12.899966  8735.055
> 
> cat(paste0("Intercept: ", a, "\n", "Slope: ", b, "\n"))
Intercept: -1.04693181048123
Slope: 12.8999662843508
> summary(lm(y ~ x))$coefficients
             Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -1.047051  14.289048 -0.0732765 9.416601e-01
x           12.900167   2.870198  4.4945217 1.185018e-05
> 
> ch <- data.frame(mres, mses)
> ch
           mres      mses
1  5.357256e+01 12538.721
2  4.285804e+01 11170.896
3  3.428644e+01 10294.952
4  2.742915e+01  9734.004
5  2.194332e+01  9374.777
6  1.755465e+01  9144.730
7  1.404372e+01  8997.410
8  1.123498e+01  8903.066
9  8.987983e+00  8842.649
10 7.190387e+00  8803.958
11 5.752309e+00  8779.180
12 4.601847e+00  8763.313
13 3.681478e+00  8753.151
14 2.945182e+00  8746.644
15 2.356146e+00  8742.476
16 1.884917e+00  8739.807
17 1.507933e+00  8738.098
18 1.206347e+00  8737.004
19 9.650774e-01  8736.303
20 7.720619e-01  8735.854
21 6.176495e-01  8735.567
22 4.941196e-01  8735.382
23 3.952957e-01  8735.265
24 3.162365e-01  8735.189
25 2.529892e-01  8735.141
26 2.023914e-01  8735.110
27 1.619131e-01  8735.090
28 1.295305e-01  8735.077
29 1.036244e-01  8735.069
30 8.289951e-02  8735.064
31 6.631961e-02  8735.061
32 5.305569e-02  8735.058
33 4.244455e-02  8735.057
34 3.395564e-02  8735.056
35 2.716451e-02  8735.056
36 2.173161e-02  8735.055
37 1.738529e-02  8735.055
38 1.390823e-02  8735.055
39 1.112658e-02  8735.055
40 8.901268e-03  8735.055
41 7.121014e-03  8735.055
42 5.696811e-03  8735.055
43 4.557449e-03  8735.055
44 3.645959e-03  8735.055
45 2.916767e-03  8735.055
46 2.333414e-03  8735.055
47 1.866731e-03  8735.055
48 1.493385e-03  8735.055
49 1.194708e-03  8735.055
50 9.557663e-04  8735.055
> max(y)
[1] 310.9823
> ggplot(data, aes(x = x, y = y)) + 
+     geom_point(size = 2) + 
+     geom_abline(aes(intercept = as, slope = bs),
+                 data = abmses, linewidth = 0.5, 
+                 color = 'green') + 
+     stat_poly_line() +
+     stat_poly_eq(use_label(c("eq", "R2"))) +
+     theme_classic() +
+     geom_abline(aes(intercept = as, slope = bs), 
+                 data = abmses %>% slice_head(), 
+                 linewidth = 1, color = 'blue') + 
+     geom_abline(aes(intercept = as, slope = bs), 
+                 data = abmses %>% slice_tail(), 
+                 linewidth = 1, color = 'red') +
+     labs(title = 'Gradient descent. blue: start, red: end, green: gradients')
> summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-245.291  -67.967   -3.722   63.440  242.174 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.047     14.289  -0.073    0.942    
x             12.900      2.870   4.495 1.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 93.93 on 198 degrees of freedom
Multiple R-squared:  0.09258,	Adjusted R-squared:  0.088 
F-statistic:  20.2 on 1 and 198 DF,  p-value: 1.185e-05

> a.ini
[1] 2.242837
> b.ini
[1] -0.7047126
> a
[1] -1.046932
> b
[1] 12.89997
>

Why normalize (scale or make z-score) xi

x 변인의 측정단위로 인해서 b 값이 결정되게 되는데 이 때의 b값은 상당하고 다양한 범위를 가질 수 있다. 가령 월 수입이 (인컴) X 라고 한다면 우리가 추정해야 (추적해야) 할 b값은 수백만이 될 수도 있다.이 값을 gradient로 추적하게 된다면 너무도 많은 iteration을 거쳐야 할 수 있다. 변인이 바뀌면 이 b의 추적범위도 드라마틱하게 바뀌게 된다. 이를 표준화한 x 점수를 사용하게 된다면 일정한 learning rate와 iteration만으로도 정확한 a와 b를 추적할 수 있게 된다.

How to unnormalize (unscale) a and b

\begin{eqnarray*} y & = & a + b * x \\ & & \text{we use z instead of x} \\ & & \text{and } \\ & & z = \frac{(x - \mu)}{\sigma} \\ & & \text{suppose that the result after calculation be } \\ y & = & k + m * z \\ & = & k + m * \frac{(x - \mu)}{\sigma} \\ & = & k + \frac{m * x}{\sigma} - \frac{m * \mu}{\sigma} \\ & = & k - \frac{m * \mu}{\sigma} + \frac{m * x}{\sigma} \\ & = & \underbrace{k - \frac{\mu}{\sigma} * m}_\text{ 1 } + \underbrace{\frac{m}{\sigma}}_\text{ 2 } * x \\ & & \text{therefore, a and be that we try to get are } \\ a & = & k - \frac{\mu}{\sigma} * m \\ b & = & \frac{m}{\sigma} \\ \end{eqnarray*}

COMMunication
RESearch.NET

Table of Contents

Gradient Descent

rs01

ro01

SSE 대신에 MSE를 쓰기

rs.01

ro.02

b값 구하기

rs01

ro01

Gradient descend

a 와 b 값을 gradient descent 방법을 이용하여 한꺼번에 구하기

rs01

ro01

Why normalize (scale or make z-score) xi

How to unnormalize (unscale) a and b