{{keywords>outlier "multiple regression" statistics "research methods"}}
====== Outliers e.g., ======
This is further reading for detecting outliers, adopted from http://www.ats.ucla.edu/stat/spss/webbooks/reg/chapter2/spssreg2.htm .

 {{:crime.sav}} \\
 {{:outlierCheck.sps}} \\

<code>get file = "DirectoryOfYourComputer\crime.sav".

descriptives
  /var=crime murder pctmetro pctwhite pcths poverty single.
</code>

<code>		Descriptive Statistics
			N	Minimum	Maximum	Mean	Std. Deviation
violent crime rate	51	82	2922	612.84	441.100
murder rate		51	1.60	78.50	8.7275	10.71758
pct metropolitan	51	24.00	100.00	67.3902	21.95713
pct white		51	31.80	98.50	84.1157	13.25839
pct hs graduates	51	64.30	86.60	76.2235	5.59209
pct poverty		51	8.00	26.40	14.2588	4.58424
pct single parent	51	8.40	22.10	11.3255	2.12149
Valid N (listwise)	51				
</code>

<WRAP box 600px>
pcmetro, poverty, single을 이용하여 crime을 예측한다고 가정해보자. 즉, pctmetro, poverty, single을 독립변인으로 하고 crime을 종속변인으로 하여 회귀분석을 실시해 보려고 한다. 변인에 대한 설명은 아래와 같다. 

| crime: | violent crime rate  |  
| murder:  | murder rate  | 
| pctmetro:  | pct metropolitan  | 
| pectwhite:  | pct white  | 
| pcths:  | pct hs graduates  | 
| poverty:  | pct poverty  | 
| single:  | pct single parent  | 

우선 각 변인들의 전반적인 상관관계를 보여주는 스캐터플롯(scatter plot)을 보면 아래와 같다.
</WRAP>

<code>graph
  /scatterplot(matrix)=crime murder pctmetro pctwhite pcths poverty single .
</code>
{{:r.crime.scatterplot.for.all.variables.jpg|scatterplot for all variables}}

<WRAP box 600px>
처음에 위치하는 종속변인 크라임과 다른 변인들 간의 상관관계 scatterplot을 보면 동떨어진 케이스가 존재함을 알 수 있다. 이 케이스를 좀더 살펴보고 꼭 필요한 것인지, 잘못된 곳은 없는지, 숫자측정변인으로서 아웃라이어에 해당하므로 제거하고 분석하는 것이 좋을래는지 등등에 대해서 판단해야 한다.
</WRAP>

<code>GRAPH /SCATTERPLOT(BIVAR)=pctmetro WITH crime BY state(name) .
</code>
{{:r.crime.scatterplot.for.crime.by.state.jpg|scatterplot of pcmetro by crime by state}}

<code>GRAPH /SCATTERPLOT(BIVAR)=poverty WITH crime BY state(name) .
</code>
{{:r.crime.scatterplot.for.poverty.by.state.jpg|scatterplot of poverty by state}}

<code>GRAPH /SCATTERPLOT(BIVAR)=single WITH crime BY state(name) .
</code>
{{:r.crime.scatterplot.for.single.by.state.jpg|scatterplot of single by state}}

<WRAP box 600px>
위의 세 그래프 모두, dc가 문제가 될 것 같다는 여지를 보여준다. 나중에 비교를 위해서 dc에 대한 아무런 조치를 취하지 않은채 (아래와 같이) pcmetro, poverty, single을 이용하여 crime rate를 예측하는 회귀분석을 실시해 본다.
</WRAP>

<code>regression
  /dependent crime
  /method=enter pctmetro poverty single.
</code>

<code>		Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.916a	.840	.830	182.068
a. Predictors: (Constant), pct single parent, pct metropolitan, pct poverty

			ANOVA(b)
Model			Sum of Squares	df	Mean Square	F	Sig.
1	Regression	8170480.211	3	2723493.404	82.160	.000a
	Residual	1557994.534	47	33148.820		
	Total		9728474.745	50			
a. Predictors: (Constant), pct single parent, pct metropolitan, pct poverty
b. Dependent Variable: violent crime rate

			Coefficients(a)
		Unstandardized Coefficients		Standardized Coefficients
Model				B		Std. Error	Beta	t	Sig.
1	(Constant)		-1666.436	147.852			-11.271	.000
	pct metropolitan	7.829		1.255		.390	6.240	.000
	pct poverty		17.680		6.941		.184	2.547	.014
	pct single parent	132.408		15.503		.637	8.541	.000
a. Dependent Variable: violent crime rate
</code>

<WRAP box 600px>
위에서 실시한 회귀분석 후에 residual(예측에 실패한 오차)를 모아 histogram을 만들어 보기로 하자. 아래의 마지막 명령어가 이에 해당한다. 
</WRAP>
<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram.
</code>

<code>		Model Summary(b)
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.916a	.840	.830	182.068
a. Predictors: (Constant), pct single parent, pct metropolitan, pct poverty
b. Dependent Variable: violent crime rate

			ANOVA(b)
Model			Sum of Squares	df	Mean Square	F	Sig.
1	Regression	8170480.211	3	2723493.404	82.160	.000a
	Residual	1557994.534	47	33148.820		
	Total		9728474.745	50			
a. Predictors: (Constant), pct single parent, pct metropolitan, pct poverty
b. Dependent Variable: violent crime rate

			Coefficients(a)
		Unstandardized Coefficients		Standardized Coefficients
Model				B		Std. Error	Beta	t	Sig.
1	(Constant)		-1666.436	147.852		-11.271	.000
	pct metropolitan	7.829		1.255		.390	6.240	.000
	pct poverty		17.680		6.941		.184	2.547	.014
	pct single parent	132.408		15.503		.637	8.541	.000
a. Dependent Variable: violent crime rate

		Residuals Statistics(a)
			Minimum		Maximum		Mean	Std.Deviation	N
Predicted Value	-30.51		2509.43		612.84	404.240		51
Residual		-523.013	426.111		.000	176.522		51
Std. Predicted Value	-1.592		4.692		.000	1.000		51
Std. Residual		-2.873		2.340		.000	.970		51
a. Dependent Variable: violent crime rate
</code>
{{:r.crime.residual.histogram.jpg|histogram}}

<WRAP box 600px>
위의 그림이 보여 주는 것은 -3.00과 2.00이 각각 넘는 부분의 오차가 다른 케이스와 달리 크다는 것을 알 수 있다.  \\ 
\\ 
\\ 
아래는 student deleted residual을 이용하여 histogram을 다시 그리도록 하는 명령어이다. [[:student deleted residual]]은 회귀분석에서 각 케이스를 제외하고 분석했을 때 얻은 예측치를 사용하여 얻은 잔차를 (residual) 말한다. 
</WRAP>


<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid).
</code>
{{:r.crime.residual.histogram.sdresidual.jpg|histogram sdresid}}

<WRAP box 600px>
Outlier가 존재한다는 판단하에 outliers(sdresid)와 id(state)를 이용해서 이들이 누구인지 파악해 본다. 이 명령어는 10개의 가장 그단적인 측정치를 보여준다. 아래의 아웃풋을 보면 "dc"가 가장 큰 값을 가지고 있고 (3.766), 다음으로 "ms" (-3.571) 그리고 "fl" (2.620) 순이라는 것을 알 수 있다.
</WRAP>

<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid) id(state) outliers(sdresid).
</code>
see at [[https://www2.bc.edu/william-stevenson/MB875/mb875_Analyzing%20Residuals.htm|Analyzing Residuals Document]] for sdresid (studentized deleted residuals). 
<code>		Residuals Statistics(a)
					Minimum		Maximum		Mean	Std. Deviation	N
Predicted Value				-30.51		2509.43		612.84	404.240		51
Std. Predicted Value			-1.592		4.692		.000	1.000		51
Standard Error of Predicted Value	25.788		133.343		47.561	18.563		51
Adjusted Predicted Value		-39.26		2032.11		605.66	369.075		51
Residual				-523.013	426.111		.000	176.522		51
Std. Residual				-2.873		2.340		.000	.970		51
Stud. Residual				-3.194		3.328		.015	1.072		51
Deleted Residual			-646.503	889.885		7.183	223.668		51
Stud. Deleted Residual		-3.571		3.766		.018	1.133		51
Mahal. Distance			.023		25.839		2.941	4.014		51
Cook's Distance			.000		3.203		.089	.454		51
Centered Leverage Value		.000		.517		.059	.080		51
a. Dependent Variable: violent crime rate
</code>

<WRAP box 600px>
/casewise 명령어를 이용해서 sdresid 극단치 중 2를 넘는 것을 알아 볼 수 있다. 아래, ''Casewise Diagnostics(a)'' 참조.
</WRAP>
<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid) id(state) outliers(sdresid)
  /casewise=plot(sdresid) outliers(2)  .
</code>

<code>		Casewise Diagnostics(a)
Case Number	state	Stud. Deleted 	violent crime 	Predicted 	Residual
			Residual	rate		Value
9		fl	2.620		1206		779.89		426.111
25		ms	-3.571		434		957.01		-523.013
51		dc	3.766		2922		2509.43		412.566
a. Dependent Variable: violent crime rate
</code>

<WRAP box 600px #333>
leverage 값을 살펴보는 방법이 아래에 제시된다. leverage 값은 회귀계수 추정치(regression coefficient estimates)에 큰 영향을 주는 값을 말하는데 histogram() 명령어와 outliers() 명령어 옵션으로 활용할 수 있다. 이 값은 일반적으로 (2k+2)/n 를 넘지 않아야 하며, 넘는 다면 아웃라이어로 추정될 수 있으니 주목할 필요가 있다. 여기서 k는 변인의 숫자, n은 케이스 숫자를 말한다. 따라서 (2*3+2)/51 의 계산으로 얻은 .1568 을 넘는 leverage 값을 갖는 케이스를 살펴봐야 한다. 

아래의 아웃풋을 보면 fl의 경우에는 stduent deleted residual값은 극단적인 마이너스 값을 갖지만, leverage값은 극단적이 아니므로 fl은 분석에 포함되는 것이 옳은 판단일 수도 있겠다. 그러나, dc의 경우에는 leverage값으로도 극단적이라는 평가를 받게 된다.
</WRAP>

<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid lever) id(state) outliers(sdresid lever)
  /casewise=plot(sdresid) outliers(2).
</code>
<code>		Outlier Statistics(a)
				Case 	state	Statistic
				Number
Stud. Deleted Residual	1	51	dc 	3.766
			2	25	ms 	-3.571
			3	9	fl 	2.620
			4	18	la 	-1.839
			5	39	ri 	-1.686
			6	12	ia 	1.590
			7	47	wa 	-1.304
			8	13	id 	1.293
			9	14	il 	1.152
			10	35	oh 	-1.148
Centered Leverage Value	1	51	dc 	.517
			2	1	ak 	.241
			3	25	ms 	.171
			4	49	wv 	.161
			5	18	la 	.146
			6	46	vt 	.117
			7	9	fl 	.083
			8	26	mt 	.080
			9	31	nj 	.075
			10	17	ky 	.072
a. Dependent Variable: violent crime rate
</code>

{{:r.crime.residual.histogram.sdresidual.jpg|histogram sdresid}}
{{:r.crime.residual.histogram.leverage.outlierl.jpg|histogram leverage}}

<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever)
  /casewise=plot(sdresid)  outliers(2)
  /scatterplot(*lever, *sdresid).
</code>
{{:r.crime.residual.scatterplot.leverage.sdresid.jpg|histogram sdresid}}

<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
  /casewise=plot(sdresid)  outliers(2) cook dffit
  /scatterplot(*lever, *sdresid).
</code>
<code>		Casewise Diagnostics(a)
Case Number	state	Stud. 		violent 	Cook's		DFFIT
			Deleted		crime		Distance	
			Residual	rate
9	fl		2.620		1206		.174		48.507
25	ms		-3.571		434		.602		-123.490
51	dc		3.766		2922		3.203		477.319
a. Dependent Variable: violent crime rate

		Outlier Statistics(a)
		Case Number	state	Statis	Sig. F
Stud.  		1	51	dc 	3.766	
Deleted		2	25	ms 	-3.571	
Residual	3	9	fl 	2.620	
		4	18	la 	-1.839	
		5	39	ri 	-1.686	
		6	12	ia 	1.590	
		7	47	wa 	-1.304	
		8	13	id 	1.293	
		9	14	il 	1.152	
		10	35	oh 	-1.148	
Cook's 		1	51	dc 	3.203	.021
Distance	2	25	ms 	.602	.663
		3	9	fl 	.174	.951
		4	18	la 	.159	.958
		5	39	ri 	.041	.997
		6	12	ia 	.041	.997
		7	13	id 	.037	.997
		8	20	md 	.020	.999
		9	6	co 	.018	.999
		10	49	wv 	.016	.999
Centered  	1	51	dc 	.517	
Leverage	2	1	ak 	.241	
Value		3	25	ms 	.171	
		4	49	wv 	.161	
		5	18	la 	.146	
		6	46	vt 	.117	
		7	9	fl 	.083	
		8	26	mt 	.080	
		9	31	nj 	.075	
		10	17	ky 	.072	
a. Dependent Variable: violent crime rate
</code>


<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
  /casewise=plot(sdresid)  outliers(2) cook dffit
  /scatterplot(*lever, *sdresid)
  /save sdbeta(sdfb).
</code>
<code>list
  /variables state sdfb1 sdfb2 sdfb3
  /cases from 1 to 10.
</code>
<code>state       sdfb1       sdfb2       sdfb3

ak        -.10618     -.13134      .14518
al         .01243      .05529     -.02751
ar        -.06875      .17535     -.10526
az        -.09476     -.03088      .00124
ca         .01264      .00880     -.00364
co        -.03705      .19393     -.13846
ct        -.12016      .07446      .03017
de         .00558     -.01143      .00519
fl         .64175      .59593     -.56060
ga         .03171      .06426     -.09120

Number of cases read:  10    Number of cases listed:  10
</code>

<code>VARIABLE LABLES sdfb1 "Sdfbeta pctmetro"
                              /sdfb2 "Sdfbeta poverty"
                              /sdfb3 "Sdfbeta single" .

GRAPH
  /SCATTERPLOT(OVERLAY)=sid sid sid  WITH sdfb1 sdfb2 sdfb3 (PAIR) BY state(name)
  /MISSING=LISTWISE .
</code>
{{:r.crime.residual.scatterplot.dbfBeta.jpg|dbfBeta value}}
|  Note  ||
| Measure  | Value  | 
| leverage  | >(2k+2)/n  | 
| abs(rstu)  | > 2  | 
| Cook's D  | > 4/n  | 
| abs(DFBETA)  | > 2/sqrt(n)  | 
<WRAP clear />

<code> PRED
  Unstandardized predicted values.
 RESID
  Unstandardized residuals.
 DRESID
  Deleted residuals.
 ADJPRED
  Adjusted predicted values.
 ZPRED
  Standardized predicted values.
 ZRESID
  Standardized residuals.
 SRESID
  Studentized residuals.
 SDRESID
  Studentized deleted residuals. 
 SEPRED
  Standard errors of the predicted values.
 MAHAL
  Mahalanobis distances.
 COOK
  Cook’s distances.
 LEVER
  Centered leverage values. 
 DFBETA
  Change in the regression coefficient that results from the deletion of the ith case. A DFBETA value is computed for each case for each regression coefficient generated by a model.
 SDBETA
  Standardized DFBETA. An SDBETA value is computed for each case for each regression coefficient generated by a model. 
 DFFIT
  Change in the predicted value when the ith case is deleted. 
 SDFIT
  Standardized DFFIT. 
 COVRATIO
  Ratio of the determinant of the covariance matrix with the ith case deleted to the determinant of the covariance matrix with all cases included. 
 MCIN
  Lower and upper bounds for the prediction interval of the mean predicted response. A lowerbound LMCIN and an upperbound UMCIN are generated. The default confidence interval is 95%. The confidence interval can be reset with the CIN subcommand. (See Dillon & Goldstein
 ICIN
  Lower and upper bounds for the prediction interval for a single observation. A lowerbound LICIN and an upperbound UICIN are generated. The default confidence interval is 95%. The confidence interval can be reset with the CIN subcommand. (See Dillon & Goldstein
</code>

<code>regression
  /dependent crime
  /method=enter pctmetro poverty single
  /residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
  /casewise=plot(sdresid)  outliers(2) cook dffit
  /scatterplot(*lever, *sdresid)
  /partialplot.  
</code>

{{:r.crime.regression.outlier.01.jpg}}
{{:r.crime.regression.outlier.02.jpg}}
{{:r.crime.regression.outlier.03.jpg}}

<code>regression
  /dependent crime
  /method=enter pctmetro poverty single.
</code>

<code>			Coefficients(a)
		Unstandardized Coefficients		Standardized Coefficients
Model		B	Std. Error	Beta	t	Sig.
1	(Constant)	-1666.436	147.852		-11.271	.000
	pct metropolitan	7.829	1.255	.390	6.240	.000
	pct poverty	17.680	6.941	.184	2.547	.014
	pct single parent	132.408	15.503	.637	8.541	.000
a. Dependent Variable: violent crime rate
</code>

<code>compute filtvar = (state NE "dc").
filter by filtvar.
regression
  /dependent crime
  /method=enter pctmetro poverty single . 
</code>

<code>
			Coefficients(a)
		Unstandardized Coefficients		Standardized Coefficients
Model		B	Std. Error	Beta	t	Sig.
1	(Constant)	-1197.538	180.487		-6.635	.000
	pct metropolitan	7.712	1.109	.565	6.953	.000
	pct poverty	18.283	6.136	.265	2.980	.005
	pct single parent	89.401	17.836	.446	5.012	.000
a. Dependent Variable: violent crime rate
</code>

====== e.g., 2 ======
[[:multiple_regression#eg2]] 참조 \\ 
 {{:elemapi2.sav}} \\ 
 {{:r.api00.OutlierDetection.sps}} \\

===== inspection =====
<code>descriptives /var= ALL .
</code>

|  Descriptive Statistics   | ||||||
|   | N  | Minimum  | Maximum  | Mean  | Std. Deviation  | 
|api 2000  | 400  | 369  | 940  | 647.62  | 142.249  | 
|english language learners  | 400  | 0  | 91  | 31.45  | 24.839  | 
|avg class size k-3  | 398  | 14  | 25  | 19.16  | 1.369  | 
|avg parent ed  | 381  | 1.00  | 4.62  | 2.6685  | .76379  | 
|pct free meals  | 400  | 0  | 100  | 60.32  | 31.912  | 
|Valid N (listwise)  | 379  |   |   |   |   | 
<WRAP clear />
<code>graph
  /scatterplot(matrix)=api00 ell acs_k3 avg_ed meals .
</code>
{{:r.graph.whole.jpg}} \\ 
This graph does not give any suspicious cases.

<code>GRAPH /SCATTERPLOT(BIVAR)=ell with api00 .
GRAPH /SCATTERPLOT(BIVAR)=acs_k3 with api00  .
GRAPH /SCATTERPLOT(BIVAR)=avg_ed with api00 .
GRAPH /SCATTERPLOT(BIVAR)=meals with api00  .
</code>
| {{:r.01.jpg?300}}  | {{:r.02.jpg?300|acsk3}}  |
| {{:r.03.jpg?300|ave_ed}}  | {{:r.04.jpg?300|meals}}  |

We speculate that the second IV (average class size) is not quite related to DV (api00). And, there seems no particular suspicious data. 

----
<code>REGRESSION
  /DEPENDENT api00
  /METHOD=ENTER ell acs_k3 avg_ed meals 
   /residuals=histogram(sdresid lever) id(snum) outliers(sdresid, lever, cook)
  /casewise=plot(sdresid)  outliers(2) cook dffit
  /scatterplot(*lever, *sdresid)
  /save sdbeta(sdfb) 
   /partialplot.  
</code>

|  Model Summary  ||||| 
|Model  | R  | R Square  | Adjusted \\ R Square  | Std. Error \\ of the Estimate  | 
|1  | .912a  | .833  | .831  | 58.633  | 
| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed  |||||
<WRAP clear />

|  ANOVA(b)  |||||||
|Model  |   | Sum of Squares  | df  | Mean Square  | F  | Sig.  | 
|1  | Regression  | 6393719.254  | 4  | 1598429.813  | 464.956  | .000a  | 
|  | Residual  | 1285740.498  | 374  | 3437.809  |   |   | 
|  | Total  | 7679459.752  | 378  |   |   |   | 
| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed  |||||||
| b. Dependent Variable: api 2000  |||||||
<WRAP clear />

|  Coefficients(a)  |||||||
|  |   | Unstandardized \\ Coefficients  |   | Standardized \\ Coefficients  |   |   | 
|Model  |   | B  | Std. Error  | Beta  | t  | Sig.  | 
|1  | (Constant)  | 709.639  | 56.240  |   | 12.618  | .000  | 
|  | english language learners  | -.843  | .196  | -.147  | -4.307  | .000  | 
|  | avg class size k-3  | 3.388  | 2.333  | .032  | 1.452  | .147  | 
|  | avg parent ed  | 29.072  | 6.924  | .156  | 4.199  | .000  | 
|  | pct free meals  | -2.937  | .195  | -.655  | -15.081  | .000  | 
| a. Dependent Variable: api 2000  |||||||

|  Casewise Diagnostics(a)  ||||||
|Case Number  | school number  | Stud. Deleted \\ Residual  | api 2000  | Cook's \\ Distance  | DFFIT  | 
|93  | 1497  | 2.170  | 604  | .010  | 1.292  | 
|97  | 1539  | 2.230  | 700  | .006  | .826  | 
|100  | 1515  | 2.222  | 667  | .005  | .661  | 
|105  | 1516  | 2.128  | 597  | .010  | 1.380  | 
|135  | 1633  | 2.072  | 584  | .044  | 6.085  | 
|188  | 1731  | 2.121  | 719  | .015  | 2.126  | 
|203  | 1621  | 2.034  | 717  | .006  | .831  | 
|226  | 211  | -3.241  | 386  | .015  | -1.325  | 
|227  | 182  | -2.653  | 411  | .005  | -.581  | 
|228  | 167  | 2.903  | 774  | .010  | .987  | 
|232  | 210  | -2.369  | 432  | .018  | -2.263  | 
|234  | 165  | -2.734  | 449  | .019  | -1.997  | 
|252  | 3700  | 2.036  | 717  | .013  | 1.878  | 
|259  | 3537  | -2.425  | 694  | .012  | -1.436  | 
|271  | 3758  | 3.012  | 690  | .022  | 2.108  | 
|272  | 3794  | 2.083  | 610  | .010  | 1.400  | 
|274  | 3759  | -2.290  | 585  | .069  | -8.646  | 
|304  | 4507  | 2.011  | 751  | .013  | 1.917  | 
|327  | 4737  | 2.470  | 808  | .012  | 1.447  | 
|334  | 4744  | 2.160  | 700  | .005  | .645  | 
|346  | 5362  | -2.138  | 487  | .010  | -1.359  | 
| a. Dependent Variable: api 2000  ||||||

|  Residuals Statistics(a)  ||||||  
|  | Minimum  | Maximum  | Mean  | Std. Deviation  | N  | 
|Predicted Value  | 449.17  | 910.04  | 647.64  | 130.056  | 379  | 
|Std. Predicted Value  | -1.526  | 2.018  | .000  | 1.000  | 379  | 
|Standard Error of Predicted Value  | 3.218  | 14.681  | 6.496  | 1.780  | 379  | 
|Adjusted Predicted Value  | 449.44  | 909.36  | 647.65  | 130.056  | 379  | 
|Residual  | -187.020  | 173.697  | .000  | 58.322  | 379  | 
|Std. Residual  | -3.190  | 2.962  | .000  | .995  | 379  | 
|Stud. Residual  | -3.201  | 2.980  | .000  | 1.002  | 379  | 
|Deleted Residual  | -188.345  | 175.805  | -.016  | 59.138  | 379  | 
|Stud. Deleted Residual  | -3.241  | 3.012  | .000  | 1.005  | 379  | 
|Mahal. Distance  | .141  | 22.702  | 3.989  | 3.030  | 379  | 
|Cook's Distance  | .000  | .069  | .003  | .006  | 379  | 
|Centered Leverage Value  | .000  | .060  | .011  | .008  | 379  | 
| a. Dependent Variable: api 2000  |||||| 

|  Outlier Statistics(a)  |||||| 
|  |   | Case Number  | school number  | Statistic  | Sig. F  | 
|Stud. Deleted Residual  | 1  | 226  | 211  | -3.241  |   | 
|  | 2  | 271  | 3758  | 3.012  |   | 
|  | 3  | 228  | 167  | 2.903  |   | 
|  | 4  | 234  | 165  | -2.734  |   | 
|  | 5  | 227  | 182  | -2.653  |   | 
|  | 6  | 327  | 4737  | 2.470  |   | 
|  | 7  | 259  | 3537  | -2.425  |   | 
|  | 8  | 232  | 210  | -2.369  |   | 
|  | 9  | 274  | 3759  | -2.290  |   | 
|  | 10  | 97  | 1539  | 2.230  |   | 
|Cook's Distance  | 1  | 274  | 3759  | .069  | .997  | 
|  | 2  | 135  | 1633  | .044  | .999  | 
|  | 3  | 26  | 4299  | .030  | 1.000  | 
|  | 4  | 193  | 1952  | .025  | 1.000  | 
|  | 5  | 271  | 3758  | .022  | 1.000  | 
|  | 6  | 234  | 165  | .019  | 1.000  | 
|  | 7  | 232  | 210  | .018  | 1.000  | 
|  | 8  | 200  | 1872  | .018  | 1.000  | 
|  | 9  | 108  | 1606  | .018  | 1.000  | 
|  | 10  | 388  | 4878  | .017  | 1.000  | 
|Centered Leverage Value  | 1  | 274  | 3759  | .060  |   | 
|  | 2  | 37  | 4308  | .058  |   | 
|  | 3  | 209  | 1795  | .050  |   | 
|  | 4  | 135  | 1633  | .046  |   | 
|  | 5  | 26  | 4299  | .040  |   | 
|  | 6  | 69  | 3000  | .037  |   | 
|  | 7  | 372  | 6068  | .036  |   | 
|  | 8  | 30  | 4317  | .035  |   | 
|  | 9  | 147  | 1709  | .035  |   | 
|  | 10  | 193  | 1952  | .033  |   | 
| a. Dependent Variable: api 2000  ||||| 

{{:r.api.histogram.sdresid.jpg|sdresidual check}}
{{:r.api.histogram.leverage.jpg|leverage check}}
{{:r.api.regression.predbyresi.01.jpg|plot spred by sresid}}

===== Outlier dection =====
Let's say, we decide to opt out cases whose studentized deleted residual value exceed normal. We set the criterion as ABS(sdresid) > 2. These cases which meet this criterion will filtered out.

We need to save some residual statistics first, with regression method. Saved values include:
 PRED
 ZPRED
 MAHAL
 COOK
 LEVER
 RESID
 ZRESID
 SDRESID
 DFBETA
Among them, we take a look at SDRESID, whose variable name will be SDR_1 in spss data set.

For the referece, 

| Note: outlier detection  ||| 
|Measure|Value  |   | 
|leverage  | >(2k+2)/n  | 0.021108179  | 
|abs(rstu)  | > 2  | 2  | 
|Cook's D  | > 4/n  | 0.01055409  | 
|abs(DFBETA)  | > 2/sqrt(n)  | 0.102733099  | 
<WRAP clear />


<code>REGRESSION
  /DESCRIPTIVES MEAN STDDEV CORR SIG N
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA COLLIN TOL CHANGE ZPP
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT api00
  /METHOD=ENTER meals ell acs_k3 avg_ed
  /residuals=histogram(sdresid lever) id(snum) outliers(sdresid, lever, cook) Durbin
  /casewise=plot(sdresid)  outliers(2) cook dffit
  /SCATTERPLOT=(*ZRESID ,*ZPRED)
  /SAVE PRED ZPRED MAHAL COOK LEVER RESID ZRESID SDRESID DFBETA.
</code>
Then, we need to filter out cases whose SDR_1 value exceed: 
 abs(SDR_1) > 2
with the below command.
<code>USE ALL.
COMPUTE filterVar=(abs(SDR)_1 < 2).
FILTER BY filterVar.
EXECUTE.
</code>

Then, we do regression again, excluding the suspicious cases. But, this time we do not save the residuals.
<code>REGRESSION
  /DESCRIPTIVES MEAN STDDEV CORR SIG N
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA CHANGE ZPP
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT api00
  /METHOD=ENTER ell avg_ed acs_k3 meals
  /SCATTERPLOT=(*ZRESID ,*ZPRED) .
</code>
 
Compare the ouptput between the previous and this regression. 

|  Model Summaryb  ||||||||||  
|Model  | R  | R \\ Square  | Adjusted \\ R Square  | Std. Error of \\ the Estimate  | Change \\ Statistics  |   |   |   |   | 
|  |   |   |   |   | R Square Change  | F Change  | df1  | df2  | Sig. F Change  | 
|1  | .938a  | .880  | .879  | 49.914  | .880  | 649.458  | 4  | 353  | .000  | 

|  | |  ANOVAb  ||||| 
|Model  |   | Sum of \\ Squares  | df  | Mean \\ Square  | F  | Sig.  | 
|1  | Regression  | 6472284.822  | 4  | 1618071.206  | 649.458  | .000a  | 
|  | Residual  | 879470.664  | 353  | 2491.418  |   |   | 
|  | Total  | 7351755.486  | 357  |   |   |   | 

|  Coefficientsa  |||||||||| 
|Model  |   | Unstandardized \\ Coefficients  |   | Standardized \\ Coefficients  | t  | Sig.  | Correlations  |   |   | 
|  |   | B  | Std. Error  | Beta  |   |   | Zero-order  | Partial  | Part  | 
|1  | (Constant)  | 705.495  | 51.072  |   | 13.814  | .000  |   |   |   | 
|  | ell  | -.915  | .170  | -.160  | -5.374  | .000  | -.789  | -.275  | -.099  | 
|  | avg_ed  | 25.661  | 6.061  | .138  | 4.234  | .000  | .809  | .220  | .078  | 
|  | acs_k3  | 4.452  | 2.127  | .040  | 2.093  | .037  | .204  | .111  | .039  | 
|  | meals  | -3.056  | .171  | -.683  | -17.868  | .000  | -.928  | -.689  | -.329  | 

{{tag>statistics "multiple regression" regression "research methods"}}