User Tools

Site Tools


pre-assumptions_of_regression_analysis

pre-asumptions in regression test

  • Linearity - the relationships between the predictors and the outcome variable should be linear
  • Normality - the errors should be normally distributed - technically normality is necessary only for the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed
  • Homogeneity of variance (or Homoscedasticity) - the error variance should be constant
  • Independence - the errors associated with one observation are not correlated with the errors of any other observation
  • Model specification - the model should be properly specified (including all relevant variables, and excluding irrelevant variables)
  • Influence - individual observations that exert undue influence on the coefficients
  • Collinearity or Singularity - predictors that are highly collinear, i.e. linearly related, can cause problems in estimating the regression coefficients.

Outliers

For an example of dealing with outlier, see Outliers

Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson
1 0.375935755 0.141327692 0.093623675 277.9593965 1.770202598
a Predictors: (Constant), income
b Dependent Variable: sales
ANOVA(b)
Model Sum of Squares df Mean Square F Sig.
1 Regression 228894.3304 1 228894.3304 2.962595204 0.102353085
Residual 1390705.67 18 77261.42609
Total 1619600 19
a Predictors: (Constant), income
b Dependent Variable: sales
Coefficients(a)
Model Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
1 (Constant) 524.9368996 176.8956007 2.967495504 0.008247696
income 0.527406291 0.306414384 0.375935755 1.721219104 0.102353085
a Dependent Variable: sales

Note, R2 = .141
Further, Anova test shows that the model is not significant, which means that the IV (income) does not seem to be related (or predict) the sales.
Since F test failed, t-test for B also failed.

But, the result might be due to some outliers. So, check outliers by examining:

  • scatter plot: (z-predicted(x), z-residual(y)). The shape should be rectangular.
  • Cook distance
  • Leverage

scatter plot of zpre and zres

Casewise Diagnostics(a)
Case Number Std. Residual sales Predicted Value Residual
10 3.425856521 1820 867.7509889 952.2490111
a Dependent Variable: sales

두 개의 케이스를 제거한 후의 분석:
r2 값이 14%에서 70% 로 증가하였다.
독립변인 income의 b 값이 0.527406291에서 1.618765817로 증가 (따라서, t value도 증가) 하였다.

Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson
1 0.836338533 0.699462142 0.680678526 100.2063061 1.559375101
a Predictors: (Constant), income
b Dependent Variable: sales
ANOVA(b)
Model Sum of Squares df Mean Square F Sig.
1 Regression 373916.9174 1 373916.9174 37.23788521 1.52771E-05
Residual 160660.8604 16 10041.30378
Total 534577.7778 17
a Predictors: (Constant), income
b Dependent Variable: sales
Coefficients(a)
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) -42.98345338 132.2567413 -0.325000094 0.749391893
income 1.618765817 0.265272066 0.836338533 6.102285245 1.52771E-05
a Dependent Variable: sales

scatter plot of zpre and zres

Normality

Normality
Data: elemapi2.sav

get file="drivename:\\elemapi2.sav".
regression
  /dependent api00
  /method=enter meals ell emer
  /save resid(apires).

examine
  variables=apires
  /plot boxplot stemleaf histogram npplot.

Homoscedasticity

Homoscedasticity
The distribution of residual along the x-values (independent values) should NOT have a pattern.

Multi-collinearity

Multicollinearity

  • It is about correlations among IVs
  • Why? . . . .

Nonlinearity

case number

  • About 20 times than IV numbers.
    • When you have 5 IVs, you need 5 * 20 = 100 cases.
  • Minimum is said to be 5 times (instead of 20 times).
pre-assumptions_of_regression_analysis.txt · Last modified: 2016/05/11 08:37 by hkimscil

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki