Differences

This shows you the differences between two versions of the page.

--- outliers [2016/05/04 06:35] – created hkimscil
+++ outliers [2017/04/05 07:55] (current) – hkimscil
@@ Line 3: / Line 3: @@
 This is further reading for detecting outliers, adopted from http://www.ats.ucla.edu/stat/spss/webbooks/reg/chapter2/spssreg2.htm .
- attachment:crime.sav
+ {{:crime.sav}} \\
- attachment:outlierCheck.sps
+ {{:outlierCheck.sps}} \\
 <code>get file = "DirectoryOfYourComputer\crime.sav".
@@ Line 23: / Line 23: @@
 Valid N (listwise)	51
 </code>
+<WRAP box 600px>
+pcmetro, poverty, single을 이용하여 crime을 예측한다고 가정해보자. 즉, pctmetro, poverty, single을 독립변인으로 하고 crime을 종속변인으로 하여 회귀분석을 실시해 보려고 한다. 변인에 대한 설명은 아래와 같다.
+| crime: | violent crime rate  |
+| murder:  | murder rate  |
+| pctmetro:  | pct metropolitan  |
+| pectwhite:  | pct white  |
+| pcths:  | pct hs graduates  |
+| poverty:  | pct poverty  |
+| single:  | pct single parent  |
+우선 각 변인들의 전반적인 상관관계를 보여주는 스캐터플롯(scatter plot)을 보면 아래와 같다.
+</WRAP>
 <code>graph
   /scatterplot(matrix)=crime murder pctmetro pctwhite pcths poverty single .
 </code>
 {{:r.crime.scatterplot.for.all.variables.jpg|scatterplot for all variables}}
+<WRAP box 600px>
+처음에 위치하는 종속변인 크라임과 다른 변인들 간의 상관관계 scatterplot을 보면 동떨어진 케이스가 존재함을 알 수 있다. 이 케이스를 좀더 살펴보고 꼭 필요한 것인지, 잘못된 곳은 없는지, 숫자측정변인으로서 아웃라이어에 해당하므로 제거하고 분석하는 것이 좋을래는지 등등에 대해서 판단해야 한다.
+</WRAP>
 <code>GRAPH /SCATTERPLOT(BIVAR)=pctmetro WITH crime BY state(name) .
@@ Line 41: / Line 58: @@
 </code>
 {{:r.crime.scatterplot.for.single.by.state.jpg|scatterplot of single by state}}
+<WRAP box 600px>
+위의 세 그래프 모두, dc가 문제가 될 것 같다는 여지를 보여준다. 나중에 비교를 위해서 dc에 대한 아무런 조치를 취하지 않은채 (아래와 같이) pcmetro, poverty, single을 이용하여 crime rate를 예측하는 회귀분석을 실시해 본다.
+</WRAP>
 <code>regression
@@ Line 70: / Line 91: @@
 </code>
+<WRAP box 600px>
+위에서 실시한 회귀분석 후에 residual(예측에 실패한 오차)를 모아 histogram을 만들어 보기로 하자. 아래의 마지막 명령어가 이에 해당한다.
+</WRAP>
 <code>regression
   /dependent crime
@@ Line 110: / Line 132: @@
 </code>
 {{:r.crime.residual.histogram.jpg|histogram}}
+<WRAP box 600px>
+위의 그림이 보여 주는 것은 -3.00과 2.00이 각각 넘는 부분의 오차가 다른 케이스와 달리 크다는 것을 알 수 있다.  \\
+\\
+\\
+아래는 student deleted residual을 이용하여 histogram을 다시 그리도록 하는 명령어이다. [[:student deleted residual]]은 회귀분석에서 각 케이스를 제외하고 분석했을 때 얻은 예측치를 사용하여 얻은 잔차를 (residual) 말한다.
+</WRAP>
 <code>regression
@@ Line 117: / Line 147: @@
 </code>
 {{:r.crime.residual.histogram.sdresidual.jpg|histogram sdresid}}
+<WRAP box 600px>
+Outlier가 존재한다는 판단하에 outliers(sdresid)와 id(state)를 이용해서 이들이 누구인지 파악해 본다. 이 명령어는 10개의 가장 그단적인 측정치를 보여준다. 아래의 아웃풋을 보면 "dc"가 가장 큰 값을 가지고 있고 (3.766), 다음으로 "ms" (-3.571) 그리고 "fl" (2.620) 순이라는 것을 알 수 있다.
+</WRAP>
 <code>regression
@@ Line 123: / Line 157: @@
   /residuals=histogram(sdresid) id(state) outliers(sdresid).
 </code>
-see at http://www2.bc.edu/~stevenw/MB875/mb875_Analyzing%20Residuals.htm for sdresid (studentized deleted residuals).
+see at [[https://www2.bc.edu/william-stevenson/MB875/mb875_Analyzing%20Residuals.htm|Analyzing Residuals Document]] for sdresid (studentized deleted residuals).
 <code>		Residuals Statistics(a)
 					Minimum		Maximum		Mean	Std. Deviation	N
@@ Line 141: / Line 175: @@
 </code>
+<WRAP box 600px>
+/casewise 명령어를 이용해서 sdresid 극단치 중 2를 넘는 것을 알아 볼 수 있다. 아래, ''Casewise Diagnostics(a)'' 참조.
+</WRAP>
 <code>regression
   /dependent crime
@@ Line 157: / Line 193: @@
 a. Dependent Variable: violent crime rate
 </code>
+<WRAP box 600px #333>
+leverage 값을 살펴보는 방법이 아래에 제시된다. leverage 값은 회귀계수 추정치(regression coefficient estimates)에 큰 영향을 주는 값을 말하는데 histogram() 명령어와 outliers() 명령어 옵션으로 활용할 수 있다. 이 값은 일반적으로 (2k+2)/n 를 넘지 않아야 하며, 넘는 다면 아웃라이어로 추정될 수 있으니 주목할 필요가 있다. 여기서 k는 변인의 숫자, n은 케이스 숫자를 말한다. 따라서 (2*3+2)/51 의 계산으로 얻은 .1568 을 넘는 leverage 값을 갖는 케이스를 살펴봐야 한다.
+아래의 아웃풋을 보면 fl의 경우에는 stduent deleted residual값은 극단적인 마이너스 값을 갖지만, leverage값은 극단적이 아니므로 fl은 분석에 포함되는 것이 옳은 판단일 수도 있겠다. 그러나, dc의 경우에는 leverage값으로도 극단적이라는 평가를 받게 된다.
+</WRAP>
 <code>regression
@@ Line 278: / Line 320: @@
 fl         .64175      .59593     -.56060
 ga         .03171      .06426     -.09120
 Number of cases read:  10    Number of cases listed:  10
 </code>
 <code>VARIABLE LABLES sdfb1 "Sdfbeta pctmetro"
@@ Line 293: / Line 333: @@
 </code>
 {{:r.crime.residual.scatterplot.dbfBeta.jpg|dbfBeta value}}
-|| Note  |
+|  Note  ||
-|Measure|Value  |
+| Measure  | Value  |
-|leverage  | >(2k+2)/n  |
+| leverage  | >(2k+2)/n  |
-|abs(rstu)  | > 2  |
+| abs(rstu)  | > 2  |
-|Cook's D  | > 4/n  |
+| Cook's D  | > 4/n  |
-|abs(DFBETA)  | > 2/sqrt(n)  |
+| abs(DFBETA)  | > 2/sqrt(n)  |
 <WRAP clear />
@@ Line 388: / Line 428: @@
 ====== e.g., 2 ======
- redirected from . . . [wiki:MultipleRegression#s-4 multiple regression].
+[[:multiple_regression#eg2]] 참조 \\
- attachment:elemapi2.sav
+ {{:elemapi2.sav}} \\
- attachment:r.api00.OutlierDetection.sps
+ {{:r.api00.OutlierDetection.sps}} \\
 ===== inspection =====
 <code>descriptives /var= ALL .
@@ Line 407: / Line 448: @@
   /scatterplot(matrix)=api00 ell acs_k3 avg_ed meals .
 </code>
-{{:r.graph.whole.jpg)]]
+{{:r.graph.whole.jpg}} \\
 This graph does not give any suspicious cases.
 <code>GRAPH /SCATTERPLOT(BIVAR)=ell with api00 .
 GRAPH /SCATTERPLOT(BIVAR)=acs_k3 with api00  .
@@ Line 414: / Line 456: @@
 GRAPH /SCATTERPLOT(BIVAR)=meals with api00  .
 </code>
-|{{:r.01.jpg,width=300|ell",selflink)]]|{{:r.02.jpg,width=300|acsk3",selflink)]]|
+| {{:r.01.jpg?300}}  | {{:r.02.jpg?300|acsk3}}  |
-|{{:r.03.jpg,width=300|ave_ed",selflink)]]|{{:r.04.jpg,width=300|meals",selflink)]]|
+| {{:r.03.jpg?300|ave_ed}}  | {{:r.04.jpg?300|meals}}  |
 We speculate that the second IV (average class size) is not quite related to DV (api00). And, there seems no particular suspicious data.
@@ Line 430: / Line 472: @@
 </code>
-|  Model Summary  ||||||
+|  Model Summary  |||||
-|Model  | R  | R Square  | Adjusted[[br]]R Square  | Std. Error \\ of the Estimate  |
+|Model  | R  | R Square  | Adjusted \\ R Square  | Std. Error \\ of the Estimate  |
 |1  | .912a  | .833  | .831  | 58.633  |
-| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed  ||||||
+| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed  |||||
 <WRAP clear />
-|  ANOVA(b)  ||||||||
+|  ANOVA(b)  |||||||
 |Model  |   | Sum of Squares  | df  | Mean Square  | F  | Sig.  |
 |1  | Regression  | 6393719.254  | 4  | 1598429.813  | 464.956  | .000a  |
 |  | Residual  | 1285740.498  | 374  | 3437.809  |   |   |
 |  | Total  | 7679459.752  | 378  |   |   |   |
-| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed  ||||||||
+| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed  |||||||
-| b. Dependent Variable: api 2000  ||||||||
+| b. Dependent Variable: api 2000  |||||||
 <WRAP clear />
-|  Coefficients(a)  |||||||||
+|  Coefficients(a)  |||||||
-|  |   | Unstandardized[[br]]Coefficients  |   | Standardized[[br]]Coefficients  |   |   |
+|  |   | Unstandardized \\ Coefficients  |   | Standardized \\ Coefficients  |   |   |
 |Model  |   | B  | Std. Error  | Beta  | t  | Sig.  |
 |1  | (Constant)  | 709.639  | 56.240  |   | 12.618  | .000  |
@@ Line 453: / Line 495: @@
 |  | avg parent ed  | 29.072  | 6.924  | .156  | 4.199  | .000  |
 |  | pct free meals  | -2.937  | .195  | -.655  | -15.081  | .000  |
-| a. Dependent Variable: api 2000  |||||||||
+| a. Dependent Variable: api 2000  |||||||
-|  Casewise Diagnostics(a)  ||||||||
+|  Casewise Diagnostics(a)  ||||||
-|Case Number  | school number  | Stud. Deleted[[br]]Residual  | api 2000  | Cook's[[br]]Distance  | DFFIT  |
+|Case Number  | school number  | Stud. Deleted \\ Residual  | api 2000  | Cook's \\ Distance  | DFFIT  |
 |93  | 1497  | 2.170  | 604  | .010  | 1.292  |
 |97  | 1539  | 2.230  | 700  | .006  | .826  |
@@ Line 478: / Line 520: @@
 |334  | 4744  | 2.160  | 700  | .005  | .645  |
 |346  | 5362  | -2.138  | 487  | .010  | -1.359  |
-| a. Dependent Variable: api 2000  ||||||||
+| a. Dependent Variable: api 2000  ||||||
-|  Residuals Statistics(a)  ||||||||
+|  Residuals Statistics(a)  ||||||
 |  | Minimum  | Maximum  | Mean  | Std. Deviation  | N  |
 |Predicted Value  | 449.17  | 910.04  | 647.64  | 130.056  | 379  |
@@ Line 494: / Line 536: @@
 |Cook's Distance  | .000  | .069  | .003  | .006  | 379  |
 |Centered Leverage Value  | .000  | .060  | .011  | .008  | 379  |
-| a. Dependent Variable: api 2000  ||||||||
+| a. Dependent Variable: api 2000  ||||||
-|  Outlier Statistics(a)  ||||||||
+|  Outlier Statistics(a)  ||||||
 |  |   | Case Number  | school number  | Statistic  | Sig. F  |
 |Stud. Deleted Residual  | 1  | 226  | 211  | -3.241  |   |
@@ Line 528: / Line 570: @@
 |  | 9  | 147  | 1709  | .035  |   |
 |  | 10  | 193  | 1952  | .033  |   |
-| a. Dependent Variable: api 2000  |||||||
+| a. Dependent Variable: api 2000  |||||
 {{:r.api.histogram.sdresid.jpg|sdresidual check}}
 {{:r.api.histogram.leverage.jpg|leverage check}}
+{{:r.api.regression.predbyresi.01.jpg|plot spred by sresid}}
-{{:r.api.regression.predbyresi.01.jpg|plot spred by sresid)]]
 ===== Outlier dection =====
 Let's say, we decide to opt out cases whose studentized deleted residual value exceed normal. We set the criterion as ABS(sdresid) > 2. These cases which meet this criterion will filtered out.
@@ Line 597: / Line 639: @@
 |  Model Summaryb  ||||||||||
-|Model  | R  | R[[br]]Square  | Adjusted[[br]]R Square  | Std. Error of[[br]]the Estimate  | Change[[br]]Statistics  |   |   |   |   |
+|Model  | R  | R \\ Square  | Adjusted \\ R Square  | Std. Error of \\ the Estimate  | Change \\ Statistics  |   |   |   |   |
 |  |   |   |   |   | R Square Change  | F Change  | df1  | df2  | Sig. F Change  |
 |1  | .938a  | .880  | .879  | 49.914  | .880  | 649.458  | 4  | 353  | .000  |
 |  | |  ANOVAb  |||||
-|Model  |   | Sum of[[br]]Squares  | df  | Mean[[br]]Square  | F  | Sig.  |
+|Model  |   | Sum of \\ Squares  | df  | Mean \\ Square  | F  | Sig.  |
 |1  | Regression  | 6472284.822  | 4  | 1618071.206  | 649.458  | .000a  |
 |  | Residual  | 879470.664  | 353  | 2491.418  |   |   |
@@ Line 608: / Line 650: @@
 |  Coefficientsa  ||||||||||
-|Model  |   | Unstandardized[[br]]Coefficients  |   | Standardized[[br]]Coefficients  | t  | Sig.  | Correlations  |   |   |
+|Model  |   | Unstandardized \\ Coefficients  |   | Standardized \\ Coefficients  | t  | Sig.  | Correlations  |   |   |
 |  |   | B  | Std. Error  | Beta  |   |   | Zero-order  | Partial  | Part  |
 |1  | (Constant)  | 705.495  | 51.072  |   | 13.814  | .000  |   |   |   |