{{keywords>assumptions for regression analysis, statistics, research methods, 회귀분석의 기본가정, 통계, 조사방법론}} ====== pre-asumptions in regression test ====== * [[Linearity]] - the relationships between the predictors and the outcome variable should be linear * [[:Normality]] - the errors should be normally distributed - technically normality is necessary only for the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed * [[:Homoscedasticity|Homogeneity]] of variance (or [[Homoscedasticity]]) - the error variance should be constant * Independence - the errors associated with one observation are not correlated with the errors of any other observation * Model specification - the model should be properly specified (including all relevant variables, and excluding irrelevant variables) * [[Influence]] - individual observations that exert undue influence on the coefficients * [[Collinearity]] or [[Singularity]] - predictors that are highly collinear, i.e. linearly related, can cause problems in estimating the regression coefficients. ===== Outliers ===== For an example of dealing with outlier, see [[:Outliers]] | **Model Summary(b) ** |||||| | Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Durbin-Watson | | 1 | 0.375935755 |@yellow: 0.141327692 | 0.093623675 | 277.9593965 | 1.770202598 | | a Predictors: (Constant), income |||||| | b Dependent Variable: sales |||||| | ANOVA(b) ||||||| | Model | | Sum of Squares | df | Mean Square | F | Sig. | | 1 | Regression | 228894.3304 | 1 | 228894.3304 | 2.962595204 | 0.102353085 | | | Residual | 1390705.67 | 18 | 77261.42609 | | | | | Total | 1619600 | 19 | | | | | a Predictors: (Constant), income ||||||| | b Dependent Variable: sales ||||||| | Coefficients(a) ||||||| | Model | | Unstandardized \\ Coefficients | | Standardized \\ Coefficients | t | Sig. | | | | B | Std. Error | Beta | | | | 1 | (Constant) | 524.9368996 | 176.8956007 | | 2.967495504 | 0.008247696 | | | income | 0.527406291 | 0.306414384 | 0.375935755 | 1.721219104 | 0.102353085 | | a Dependent Variable: sales ||||||| Note, R2 = .141 Further, Anova test shows that the model is not significant, which means that the IV (income) does not seem to be related (or predict) the sales. Since F test failed, t-test for B also failed. But, the result might be due to some outliers. So, check outliers by examining: * scatter plot: (z-predicted(x), z-residual(y)). The shape should be rectangular. * [[Mahalanobis distance]] score * Cook distance * Leverage {{regression04-outlier.jpg?450|scatter plot of zpre and zres}} | Casewise Diagnostics(a) ||||| | Case Number | Std. Residual | sales | Predicted Value | Residual | | 10 | 3.425856521 | 1820 | 867.7509889 | 952.2490111 | | a Dependent Variable: sales ||||| 두 개의 케이스를 제거한 후의 분석: r2 값이 14%에서 70% 로 증가하였다. 독립변인 income의 b 값이 0.527406291에서 1.618765817로 증가 (따라서, t value도 증가) 하였다. | Model Summary(b) |||||| | Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Durbin-Watson | | 1 | 0.836338533 | 0.699462142 | 0.680678526 | 100.2063061 | 1.559375101 | | a Predictors: (Constant), income |||||| | b Dependent Variable: sales |||||| | ANOVA(b) ||||||| | Model | | Sum of Squares | df | Mean Square | F | Sig. | | 1 | Regression | 373916.9174 | 1 | 373916.9174 | 37.23788521 | 1.52771E-05 | | | Residual | 160660.8604 | 16 | 10041.30378 | | | | | Total | 534577.7778 | 17 | | | | | a Predictors: (Constant), income ||||||| | b Dependent Variable: sales ||||||| | | | | | | | Coefficients(a) | | Model | | Unstandardized Coefficients | | Standardized Coefficients | t | Sig. | | | | B | Std. Error | Beta | | | | 1 | (Constant) | -42.98345338 | 132.2567413 | | -0.325000094 | 0.749391893 | | | income | 1.618765817 | 0.265272066 | 0.836338533 | 6.102285245 | 1.52771E-05 | | a Dependent Variable: sales ||||||| {{regression04-outlier-removed.jpg?450|scatter plot of zpre and zres}} ===== Normality ===== [[:Normality]] Data: {{:elemapi2.sav}} \\ get file="drivename:\\elemapi2.sav". regression /dependent api00 /method=enter meals ell emer /save resid(apires). examine variables=apires /plot boxplot stemleaf histogram npplot. ===== Homoscedasticity ===== [[Homoscedasticity]] The distribution of residual along the x-values (independent values) should NOT have a pattern. ===== Multi-collinearity ===== [[Multicollinearity]] * It is about correlations among IVs * Why? . . . . ===== Nonlinearity ===== [[Linearity]] ===== case number ===== * About 20 times than IV numbers. * When you have 5 IVs, you need 5 * 20 = 100 cases. * Minimum is said to be 5 times (instead of 20 times).