Differences

This shows you the differences between two versions of the page.

--- factor_analysis [2018/12/05 12:14] – [Factor solution among many . . .] hkimscil
+++ factor_analysis [2019/11/01 02:53] – [E.g. 2] hkimscil
@@ Line 74: / Line 74: @@
 \end{equation}
-위 식 [1]에서 e는 error term을 말하고, F1, F2 는 각각 잠재적인 요인이다. finance, marketing, policy 점수는 F1과 F2의 기여로 만들어지는 점수이다. F1과 F2가 observation에 기초한 변인이 아니므로 데이터를 이용한 regression을 구하는 방법은 적당치 한다. 따라서 다른 방법으로 이를 해결해야 한다.
+위 식 [1]에서 e는 error term을 말하고, F1, F2 는 각각 잠재적인 요인이다. finance, marketing, policy 점수는 F1과 F2의 기여로 만들어지는 점수이다. F1과 F2가 observation에 기초한 변인이 아니므로 데이터를 이용한 regression을 구하는 방법은 적당치 않다. 따라서 다른 방법으로 이를 해결해야 한다.
 한편, $\beta_{ij}$ 는 표준화된 correlation coefficient 값을 말한다 (regression에서 beta값) -- factor analysis에서는 흔히 factor loading이라고 부른다. beta를 해석하는 방법과 마찬가지로 factor loading 값은 F1이나 F2의 인자가 finance (혹은 다른 변인 점수) 점수에 얼마나 기여하는지를 나타내 주는 지표라고 하겠다.
@@ Line 91: / Line 91: @@
 위의 요인이 포함된 regression공식이 갖는 가정은 다음과 같다.
   - $E(e_{i}) = 0, \quad Var(e_{i}) = \sigma^2_{i}$
+    * error의 분포에 관한 내용이다.
     * expected value = mean of error terms = 0, with standard deviation = $\sigma_{i}$
     * 에러는 평균 0을 중심으로 무작위로 펼쳐져 있는 상태가 가정되므로 위와 같은 성격을 갖는다.
   - $E(F_{j}) = 0, \quad Var(F_{j}) = 1 $
+    * F는 표준화된 coefficient로  크기가 나타내지는 가상의 인자이다 (factor).
     * Factors are standardized with mean =0, standard deviation = 1. Hence, Var(F) = 1.
     * factors의 계수를 내기 전의 data는 표준점수 처리가 된 것을 가정한다. 따라서, F의 mean과 standard deviation값은 각각 0과 1이어야 하고, 따라서 F의 variance값 또한 1이 된다.
@@ Line 280: / Line 282: @@
 |  Variable, \\ Y<sub>i</sub>  |  Observed \\ variance, S<sup>2</sup><sub>i</sub>   |  Loadings on  || Communality, \\ $b^2_{i1} + b^2_{i2} $  |  Percent \\ explained  |  spec. \\ variance  |
 |  (1)  |  (2)  |  $F_{1}, b_{i1}$ \\ (3)  |  $F_{2}, b_{i2}$ \\ (4)  |  (5)  |  (6) = 100 x (5)/(2)  |    |
-|  Finance, $Y_{1}$  |  9.84  \\ (7)  |  3.136773  |  0.023799  |  9.8399 \\ (8)  |  99.999 \\ (8) / (7) * 100 =    |  0.0001 \\ (7) - (8)  |
+|  Finance, $Y_{1}$  |  9.84  \\ (7)  |  __3.136773__ ((heavy on F1 side))  |  0.023799  |  9.8399 \\ (8)  |  99.999 \\ (8) / (7) * 100 =    |  0.0001 \\ (7) - (8)  |
-|  Marketing, $Y_{2}$  |  5.04  |  -0.132190  |  2.237858  |  5.0255  |  99.712  |    |
+|  Marketing, $Y_{2}$  |  5.04  |  -0.132190  |  __2.237858__ ((heavy on F2 side))  |  5.0255  |  99.712  |  0.0145  |
-|  policy, $Y_{3}$  |  3.04  |  0.127697  |  1.731884  |  3.0157  |  99.201  |    |
+|  policy, $Y_{3}$  |  3.04  |  0.127697  |  __1.731884__ ((heavy on F2 side)) |  3.0157  |  99.201  |  0.0243   |
 |  Overall \\ SS loadings  |  17.92 \\ (9)  |  9.873125((contribution of F1 over the total variance)) \\ (10)  |  8.007997 ((contribution of F2 over the total variance of Y<sub>i</sub>)) \\ (11)  |  17.8811  |  99.783  |
 |    |    |  55.1%  \\ (10) / (9) =  |  44.7%  \\  (11) / (9) =  |    |    |    |
+각주 1) -> finance = 수학능력 = F1
+각주 2), 3) -> marketing, policy = 언어능력 = F2
+각주 4)는  아래와 같이 구함
+<code>
+> l.f <- 3.136773
+> l.m <- -0.132190
+> l.p <- 0.127697
+> loadings.f1 <- c(l.f,l.m,l.p)
+> sum(loadings.f1^2) # value of (10) in the above table
+[1] 9.873126
+> </code>
+<code>> fd <- data.frame(finance,marketing,policy)
+> fd
+  finance marketing policy
+       3         6      5
+       7         3      3
+      10         9      8
+       3         9      7
+      10         6      5</code>
+아래는 population variance, sd를 구하기 위한 function
+<code>> pvar <- function(x) {
++     sum((x - mean(x))**2) / length(x)
++ }
+> psd <- function(x) {
++     sqrt (sum((x - mean(x))**2) / length(x))
++ }</code>
+<code>> fds <- stack(fd)
+> tapply(fds$values, fds$ind, mean)
+  finance marketing    policy
+.6       6.6       5.6
+> tapply(fds$values, fds$ind, pvar)
+  finance marketing    policy
+.84      5.04      3.04
+> options(digits=5)
+> tapply(fds$values, fds$ind, psd)
+  finance marketing    policy
+.1369    2.2450    1.7436
+>
+ </code>
 |  Standardized \\ Variable, \\ Y<sub>i</sub>  |  Observed \\ variance, S<sup>2</sup><sub>i</sub>   |  Loadings on  || Communality, \\ $b^2_{i1} + b^2_{i2} $  |  Percent \\ explained  |  spec. \\ variance  |
@@ Line 387: / Line 433: @@
 > data <- read.csv("dataset_EFA.csv")
 > data <- read.csv("http://commres.net/wiki/_media/r/dataset_exploratoryfactoranalysis.csv")
+> data <- read.csv("https://github.com/manirath/BigData/blob/master/dataset_EFA.csv")
 > #display the data (warning: large output - only the first 10 rows are shown here)
@@ Line 1102: / Line 1149: @@
 ====== e.g., 5 ======
 {{:r:EFA.csv}}
+====== e.g. secu com finance 2007 example  ======
+{{:r:secu_com_finance_2007.csv}}
+<code>
+Sys.setlocale("LC_ALL","Korean")
+secu_com_finance_2007 <- read.csv("http://commres.net/wiki/_media/r/secu_com_finance_2007.csv")
+secu_com_finance_2007
+# V1 : 총자본순이익율
+# V2 : 자기자본순이익율
+# V3 : 자기자본비율
+# V4 : 부채비율
+# V5 : 자기자본회전율
+# 표준화 변환 (standardization)
+secu_com_finance_2007 <- transform(secu_com_finance_2007,
+    V1_s = scale(V1),
+    V2_s = scale(V2),
+    V3_s = scale(V3),
+    V4_s = scale(V4),
+    V5_s = scale(V5))
+# 부채비율(V4_s)을 방향(max(V4_s)-V4_s) 변환
+secu_com_finance_2007 <- transform(secu_com_finance_2007, V4_s2 = max(V4_s) - V4_s)
+# variable selection
+secu_com_finance_2007_2 <- secu_com_finance_2007[,c("company", "V1_s", "V2_s", "V3_s", "V4_s2", "V5_s")]
+# Correlation analysis
+cor(secu_com_finance_2007_2[,-1])
+round(cor(secu_com_finance_2007_2[,-1]), digits=3) # 반올림
+# Scatter plot matrix
+plot(secu_com_finance_2007_2[,-1])
+# Scree Plot
+plot(prcomp(secu_com_finance_2007_2[,c(2:6)]), type="l", sub = "Scree Plot")
+</code>
+<code>
+# 요인분석(maximum likelihood factor analysis)
+# rotation = "varimax"
+secu_factanal <- factanal(secu_com_finance_2007_2[,2:6],
+    factors = 2,
+    rotation = "varimax", # "varimax", "promax", "none"
+    scores="regression") # "regression", "Bartlett"
+print(secu_factanal)
+</code>
+<code>
+print(secu_factanal$loadings, cutoff=0) # display every loadings
+# factor scores plotting
+secu_factanal$scores
+plot(secu_factanal$scores, main="Biplot of the first 2 factors")
+# 관측치별 이름 매핑(rownames mapping)
+text(secu_factanal$scores[,1], secu_factanal$scores[,2],
+   labels = secu_com_finance_2007$company,
+   cex = 0.7, pos = 3, col = "blue")
+# factor loadings plotting
+points(secu_factanal$loadings, pch=19, col = "red")
+text(secu_factanal$loadings[,1], secu_factanal$loadings[,2],
+   labels = rownames(secu_factanal$loadings),
+   cex = 0.8, pos = 3, col = "red")
+# plotting lines between (0,0) and (factor loadings by Var.)
+segments(0,0,secu_factanal$loadings[1,1], secu_factanal$loadings[1,2])
+segments(0,0,secu_factanal$loadings[2,1], secu_factanal$loadings[2,2])
+segments(0,0,secu_factanal$loadings[3,1], secu_factanal$loadings[3,2])
+segments(0,0,secu_factanal$loadings[4,1], secu_factanal$loadings[4,2])
+segments(0,0,secu_factanal$loadings[5,1], secu_factanal$loadings[5,2])
+</code>
 ====== etc.  ======
 <del>see http://geog.uoregon.edu/bartlein/courses/geog495/lec16.html</del>
 {{:r:boxes.csv}}
 {{:r:cities.csv}}
-{{:r:secu_com_finance_2007.csv}}
 ====== Reference ======
 {{:factor_analysis_lecture_note.pdf|Lecture Note}} from databaser