User Tools

Site Tools


statistical_regression_methods

Statistical Regression Methods

A part of selection method in multiple regression. Inshort,

Multiple Regression

  1. Enter method
  2. Selection method
    1. Statistical regression method
      1. forward selection: 인들 (predictors) 중 종속변인인 Y와 상관관계가 가장 높은 변인부터 먼저 투입되어 회귀계산이 수행된다. 먼저 투입된 변인은 (상관관계가 높으므로) 이론적으로 종속변인을 설명하는 중요한 요소로 여겨지게 된다. 또한 다음 변인은 우선 투입된 변인을 고려한 상태로 투입된다.
      2. backward deletion: 모든 독립변인들이 한꺼번에 투입되어 회귀계산이 시작된다. 이어서 회귀식에 통계학적으로 기여하지 못한다고 판단되는 X변인이 하나씩 제거되면서 회귀계산을 반복적으로 한다.
      3. stepwise selection: Forward와 같은 방식으로 회귀계산을 하되, 투입된 변인의 설명력을 계산하여 버릴 것인지 취할 것인지를 결정한다. 각 IV에 대한 t-test를 근거로 그 IV가 significant한 기여를 했는지를 판단하는 것을 말한다.
    2. Sequential regression method

See also

See also Stepwise regression in NCSS site


The below is from http://www.statisticssolutions.com/selection-process-for-multiple-regression/

Forward selection begins with an empty equation. Predictors are added one at a time beginning with the predictor with the highest correlation with the dependent variable. Variables of greater theoretical importance are entered first. Once in the equation, the variable remains there.

Backward elimination (or backward deletion) is the reverse process. All the independent variables are entered into the equation first and each one is deleted one at a time if they do not contribute to the regression equation.

Stepwise regression is a combination of the forward and backward selection techniques. . . . Stepwise regression is a modification of the forward selection so that after each step in which a variable was added, all candidate variables in the model are checked to see if their significance has been reduced below the specified tolerance level. If a nonsignificant variable is found, it is removed from the model. Stepwise regression requires two significance levels: one for adding variables and one for removing variables. The cutoff probability for adding variables should be less than the cutoff probability for removing variables so that the
procedure does not get into an infinite loop.

Sequential Regression Method of Entry:

Block-wise selection is a version of forward selection that is achieved in blocks or sets. The predictors are grouped into blocks based on psychometric consideration or theoretical reasons and a stepwise selection is applied. Each block is applied separately while the other predictor variables are ignored. Variables can be removed when they do not contribute to the prediction. In general, the predictors included in the blocks will be inter-correlated. Also, the order of entry has an impact on which variables will be selected; those that are entered in the earlier stages have a better chance of being retained than those entered at later stages.

Essentially, the multiple regression selection process enables the researcher to obtain a reduced set of variables from a larger set of predictors, eliminating unnecessary predictors, simplifying data, and enhancing predictive accuracy.

Two criterion are used to achieve the best set of predictors; these include meaningfulness to the situation and statistical significance. By entering variables into the equation in a given order, confounding variables can be investigated and variables that are highly correlated can be combined into blocks.

e.g. 1

backward elimination

lowbwt.csv read lowbwt dataset or see https://notendur.hi.is/birgirhr/lowbwt.txt

lbw <- read.csv("http://commres.net/wiki/_media/r/lowbwt.csv", sep=",")
names(lbw) <- tolower(names(lbw))
## Recoding
lbw <- within(lbw, {
    ## race relabeling
    race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

    ## ftv (frequency of visit) relabeling
    ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
    ftv.cat <- relevel(ftv.cat, ref = "Normal")

    ## ptl
    preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
})
lm.full <- lm(bwt ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, data = lbw)
lm.null <- lm(bwt ~ 1, data = lbw)
summary(lm.full)
Call:
lm(formula = bwt ~ age + lwt + race.cat + smoke + preterm + ht + 
    ui + ftv.cat, data = lbw)

Residuals:
     Min       1Q   Median       3Q      Max 
-1896.38  -445.54    53.58   466.07  1654.74 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2949.808    320.517   9.203  < 2e-16 ***
age             -2.928      9.674  -0.303 0.762483    
lwt              4.205      1.717   2.448 0.015316 *  
race.catBlack -467.043    149.797  -3.118 0.002125 ** 
race.catOther -323.144    117.411  -2.752 0.006532 ** 
smoke         -307.880    109.148  -2.821 0.005335 ** 
preterm1+     -207.757    136.364  -1.524 0.129394    
ht            -568.111    200.905  -2.828 0.005225 ** 
ui            -494.168    137.246  -3.601 0.000412 ***
ftv.catNone    -55.975    105.373  -0.531 0.595934    
ftv.catMany   -185.275    203.215  -0.912 0.363151    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 646.9 on 178 degrees of freedom
Multiple R-squared:  0.2544,	Adjusted R-squared:  0.2125 
F-statistic: 6.074 on 10 and 178 DF,  p-value: 6.27e-08
drop1(lm.full, test = "F")
Single term deletions

Model:
bwt ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat
         Df Sum of Sq      RSS    AIC F value    Pr(>F)    
<none>                74494960 2457.2                      
age       1     38343 74533303 2455.3  0.0916 0.7624834    
lwt       1   2508944 77003904 2461.4  5.9949 0.0153165 *  
race.cat  2   5560980 80055939 2466.8  6.6438 0.0016492 ** 
smoke     1   3329939 77824899 2463.4  7.9566 0.0053352 ** 
preterm   1    971457 75466416 2457.6  2.3212 0.1293944    
ht        1   3346518 77841478 2463.5  7.9962 0.0052247 ** 
ui        1   5425727 79920686 2468.5 12.9644 0.0004115 ***
ftv.cat   2    380072 74875032 2454.1  0.4541 0.6357678    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
drop1(update(lm.full, ~ . -age), test = "F")
Single term deletions

Model:
bwt ~ lwt + race.cat + smoke + preterm + ht + ui + ftv.cat
         Df Sum of Sq      RSS  AIC F value  Pr(>F)    
<none>                74533303 2455                    
lwt       1   2483344 77016647 2459    5.96 0.01557 *  
race.cat  2   5607620 80140923 2465    6.73 0.00151 ** 
smoke     1   3295772 77829075 2461    7.92 0.00545 ** 
preterm   1   1052971 75586274 2456    2.53 0.11355    
ht        1   3323302 77856605 2462    7.98 0.00526 ** 
ui        1   5390566 79923869 2466   12.95 0.00041 ***
ftv.cat   2    369667 74902970 2452    0.44 0.64224    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Now preterm is least significat at p = 0.12.
drop1(update(lm.full, ~ . -age -ftv.cat), test = "F")
Single term deletions

Model:
bwt ~ lwt + race.cat + smoke + preterm + ht + ui
         Df Sum of Sq      RSS  AIC F value  Pr(>F)    
<none>                74902970 2452                    
lwt       1   2413556 77316526 2456    5.83 0.01673 *  
race.cat  2   6248590 81151560 2463    7.55 0.00071 ***
smoke     1   3933172 78836142 2460    9.50 0.00237 ** 
preterm   1   1008759 75911729 2453    2.44 0.12020    
ht        1   3440574 78343544 2459    8.31 0.00441 ** 
ui        1   5376658 80279628 2463   12.99 0.00040 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Now all variables are significant at p < 0.1
drop1(update(lm.full, ~ . -age -ftv.cat -preterm), test = "F")
Single term deletions

Model:
bwt ~ lwt + race.cat + smoke + ht + ui
         Df Sum of Sq      RSS  AIC F value  Pr(>F)    
<none>                75911729 2453                    
lwt       1   2671613 78583342 2457    6.41 0.01223 *  
race.cat  2   6674129 82585858 2465    8.00 0.00047 ***
smoke     1   4911219 80822948 2463   11.77 0.00074 ***
ht        1   3583850 79495579 2459    8.59 0.00381 ** 
ui        1   6327025 82238754 2466   15.17 0.00014 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Show summary for final model
summary(update(lm.full, ~ . -age -ftv.cat -preterm))
Call:
lm(formula = bwt ~ lwt + race.cat + smoke + ht + ui, data = lbw)

Residuals:
   Min     1Q Median     3Q    Max 
 -1843   -433     67    461   1631 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2837.64     243.63   11.65  < 2e-16 ***
lwt               4.24       1.68    2.53  0.01223 *  
race.catBlack  -475.81     145.58   -3.27  0.00129 ** 
race.catOther  -350.00     112.34   -3.12  0.00213 ** 
smoke          -354.90     103.43   -3.43  0.00074 ***
ht             -585.11     199.61   -2.93  0.00381 ** 
ui             -524.44     134.65   -3.89  0.00014 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 646 on 182 degrees of freedom
Multiple R-squared: 0.24,   Adjusted R-squared: 0.215 
F-statistic: 9.59 on 6 and 182 DF,  p-value: 0.00000000366 

Forward selection

## ui is the most significant variable
add1(lm.null, scope = ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, test = "F")
Single term additions

Model:
bwt ~ 1
         Df Sum of Sq      RSS  AIC F value   Pr(>F)    
<none>                99917053 2493                     
age       1    806927 99110126 2493    1.52   0.2188    
lwt       1   3448881 96468171 2488    6.69   0.0105 *  
race.cat  2   5070608 94846445 2487    4.97   0.0079 ** 
smoke     1   3573406 96343646 2488    6.94   0.0092 ** 
preterm   1   4757523 95159530 2485    9.35   0.0026 ** 
ht        1   2132014 97785038 2491    4.08   0.0449 *  
ui        1   8028747 91888305 2479   16.34 0.000077 ***
ftv.cat   2   2082321 97834732 2493    1.98   0.1410    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Now race.cat is the most significant
add1(update(lm.null, ~ . +ui), scope = ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, test = "F")
Single term additions

Model:
bwt ~ ui
         Df Sum of Sq      RSS  AIC F value Pr(>F)   
<none>                91888305 2479                  
age       1    472355 91415950 2480    0.96 0.3282   
lwt       1   2076990 89811315 2477    4.30 0.0395 * 
race.cat  2   4767394 87120911 2473    5.06 0.0072 **
smoke     1   2949940 88938365 2475    6.17 0.0139 * 
preterm   1   2837049 89051257 2475    5.93 0.0159 * 
ht        1   3162469 88725836 2474    6.63 0.0108 * 
ftv.cat   2   1847816 90040489 2479    1.90 0.1527   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Now smoke is the most significant
add1(update(lm.null, ~ . +ui +race.cat), scope = ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, test = "F")
Single term additions

Model:
bwt ~ ui + race.cat
        Df Sum of Sq      RSS  AIC F value  Pr(>F)    
<none>               87120911 2473                    
age      1     57041 87063871 2475    0.12 0.72884    
lwt      1   2234424 84886488 2470    4.84 0.02900 *  
smoke    1   6079888 81041024 2461   13.80 0.00027 ***
preterm  1   2651610 84469302 2469    5.78 0.01724 *  
ht       1   2688781 84432130 2469    5.86 0.01646 *  
ftv.cat  2   1158673 85962238 2474    1.23 0.29373    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Now ht is the most significant
add1(update(lm.null, ~ . +ui +race.cat +smoke), scope = ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, test = "F")
Single term additions

Model:
bwt ~ ui + race.cat + smoke
        Df Sum of Sq      RSS  AIC F value Pr(>F)  
<none>               81041024 2461                 
age      1       326 81040698 2463    0.00  0.978  
lwt      1   1545445 79495579 2459    3.56  0.061 .
preterm  1   1338799 79702225 2460    3.07  0.081 .
ht       1   2457682 78583342 2457    5.72  0.018 *
ftv.cat  2    331205 80709819 2464    0.37  0.689  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Now lwt is the most significant
add1(update(lm.null, ~ . +ui +race.cat +smoke +ht), scope = ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, test = "F")
Single term additions

Model:
bwt ~ ui + race.cat + smoke + ht
        Df Sum of Sq      RSS  AIC F value Pr(>F)  
<none>               78583342 2457                 
age      1       882 78582460 2459    0.00  0.964  
lwt      1   2671613 75911729 2453    6.41  0.012 *
preterm  1   1266816 77316526 2456    2.98  0.086 .
ftv.cat  2    244671 78338671 2461    0.28  0.754  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Now no variable is significant at p < 0.1
add1(update(lm.null, ~ . +ui +race.cat +smoke +ht +lwt), scope = ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, test = "F")
Single term additions

Model:
bwt ~ ui + race.cat + smoke + ht + lwt
        Df Sum of Sq      RSS  AIC F value Pr(>F)
<none>               75911729 2453               
age      1    108807 75802922 2454    0.26   0.61
preterm  1   1008759 74902970 2452    2.44   0.12
ftv.cat  2    325455 75586274 2456    0.39   0.68
## Show summary for final model
summary(update(lm.null, ~ . +ui +race.cat +smoke +ht +lwt))
Call:
lm(formula = bwt ~ ui + race.cat + smoke + ht + lwt, data = lbw)

Residuals:
   Min     1Q Median     3Q    Max 
 -1843   -433     67    461   1631 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2837.64     243.63   11.65  < 2e-16 ***
ui             -524.44     134.65   -3.89  0.00014 ***
race.catBlack  -475.81     145.58   -3.27  0.00129 ** 
race.catOther  -350.00     112.34   -3.12  0.00213 ** 
smoke          -354.90     103.43   -3.43  0.00074 ***
ht             -585.11     199.61   -2.93  0.00381 ** 
lwt               4.24       1.68    2.53  0.01223 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 646 on 182 degrees of freedom
Multiple R-squared: 0.24,   Adjusted R-squared: 0.215 
F-statistic: 9.59 on 6 and 182 DF,  p-value: 0.00000000366 
statistical_regression_methods.txt · Last modified: 2018/06/15 08:30 by hkimscil