### Site Tools

pre-assumptions_of_regression_analysis

# pre-asumptions in regression test

• Linearity - the relationships between the predictors and the outcome variable should be linear
• Normality - the errors should be normally distributed - technically normality is necessary only for the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed
• Homogeneity of variance (or Homoscedasticity) - the error variance should be constant
• Independence - the errors associated with one observation are not correlated with the errors of any other observation
• Model specification - the model should be properly specified (including all relevant variables, and excluding irrelevant variables)
• Influence - individual observations that exert undue influence on the coefficients
• Collinearity or Singularity - predictors that are highly collinear, i.e. linearly related, can cause problems in estimating the regression coefficients.

## Outliers

For an example of dealing with outlier, see Outliers

 Model Summary(b) Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson 1 0.375935755 0.141327692 0.093623675 277.9593965 1.770202598 a Predictors: (Constant), income b Dependent Variable: sales
 ANOVA(b) Model Sum of Squares df Mean Square F Sig. 1 Regression 228894.3304 1 228894.3304 2.962595204 0.102353085 Residual 1390705.67 18 77261.42609 Total 1619600 19 a Predictors: (Constant), income b Dependent Variable: sales
 Coefficients(a) Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1 (Constant) 524.9368996 176.8956007 2.967495504 0.008247696 income 0.527406291 0.306414384 0.375935755 1.721219104 0.102353085 a Dependent Variable: sales

Note, R2 = .141
Further, Anova test shows that the model is not significant, which means that the IV (income) does not seem to be related (or predict) the sales.
Since F test failed, t-test for B also failed.

But, the result might be due to some outliers. So, check outliers by examining:

• scatter plot: (z-predicted(x), z-residual(y)). The shape should be rectangular.
• Cook distance
• Leverage

 Casewise Diagnostics(a) Case Number Std. Residual sales Predicted Value Residual 10 3.425856521 1820 867.7509889 952.2490111 a Dependent Variable: sales

두 개의 케이스를 제거한 후의 분석:
r2 값이 14%에서 70% 로 증가하였다.
독립변인 income의 b 값이 0.527406291에서 1.618765817로 증가 (따라서, t value도 증가) 하였다.

 Model Summary(b) Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson 1 0.836338533 0.699462142 0.680678526 100.2063061 1.559375101 a Predictors: (Constant), income b Dependent Variable: sales
 ANOVA(b) Model Sum of Squares df Mean Square F Sig. 1 Regression 373916.9174 1 373916.9174 37.23788521 1.52771E-05 Residual 160660.8604 16 10041.30378 Total 534577.7778 17 a Predictors: (Constant), income b Dependent Variable: sales
 Coefficients(a) Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1 (Constant) -42.98345338 132.2567413 -0.325000094 0.749391893 income 1.618765817 0.265272066 0.836338533 6.102285245 1.52771E-05 a Dependent Variable: sales

## Normality

get file="drivename:\\elemapi2.sav".
regression
/dependent api00
/method=enter meals ell emer
/save resid(apires).

examine
variables=apires
/plot boxplot stemleaf histogram npplot.

## Homoscedasticity

Homoscedasticity
The distribution of residual along the x-values (independent values) should NOT have a pattern.

## Multi-collinearity

• It is about correlations among IVs
• Why? . . . .

## case number

• About 20 times than IV numbers.
• When you have 5 IVs, you need 5 * 20 = 100 cases.
• Minimum is said to be 5 times (instead of 20 times).
pre-assumptions_of_regression_analysis.txt · Last modified: 2016/05/11 08:37 by hkimscil