Differences

This shows you the differences between two versions of the page.

--- b:head_first_statistics:visualization [2023/09/11 08:11] – [Scatter plot] hkimscil
+++ b:head_first_statistics:visualization [2025/09/08 08:22] (current) – [Histogram Modality] hkimscil
@@ Line 79: / Line 79: @@
 | 999  | 2  |
+{{:b:head_first_statistics:pasted:20240904-082648.png}}
 in R . . . .
@@ Line 90: / Line 90: @@
 hist(dat, breaks=5)
 </code>
+{{:b:head_first_statistics:pasted:20240904-082258.png}}
+<code>
+dat.iq <- rnorm(1000, 100, 15)
+head(dat.iq)
+tail(dat.iq)
+head(dat.iq, n=12)
+tail(dat.iq, n=12)
+mean(dat.iq)
+sd(dat.iq)
+hist(dat.iq)
+hist(dat.iq, breaks=30, col='lightblue')
+set.seed(101)
+dat.iq <- rnorm(1000, 100, 15)
+head(dat.iq)
+tail(dat.iq)
+head(dat.iq, n=12)
+tail(dat.iq, n=12)
+mean(dat.iq)
+sd(dat.iq)
+hist(dat.iq)
+hist(dat.iq, breaks=30, col='lightblue')
+</code>
 ====== Scatter plot ======
 <code>
@@ Line 138: / Line 165: @@
    pch=19)</code>
-{{:c:ps1-1:2019:pasted:20190909-075028.png}}
+{{:b:head_first_statistics:pasted:20240904-083016.png}}
 explanatory (설명) variable at x axis
@@ Line 146: / Line 173: @@
 Drawing a line among the data.
 <code># Add fit lines
 abline(lm(mpg~wt), col="red") # regression line (y~x)
-lines(lowess(wt,mpg), col="blue") # lowess line (x,y)</code>
+</code>
-{{:c:ps1-1:2019:pasted:20190909-075639.png}}
+{{:b:head_first_statistics:pasted:20240904-083157.png}}
+Outlier에 대한 주의
+[{{:pearson-6.png? |}}]
+<WRAP clear />
-A bit more fancy line
-<code># Enhanced Scatterplot of MPG vs. Weight
-# by Number of Car Cylinders
-library(car)
-scatterplot(mpg ~ wt | cyl, data=mtcars,
-   xlab="Weight of Car", ylab="Miles Per Gallon",
-   main="Enhanced Scatter Plot",
-   labels=row.names(mtcars))</code>
-{{:c:ps1-1:2019:pasted:20190909-080032.png}}
-Line can be:
+====== Presentation ======
+For a very good example, see
+https://www.gapminder.org/answers/how-does-income-relate-to-life-expectancy/
+  * Life expectancy data: {{:life.exp.csv}}
-**__관계의 방향 (direction)__**
+<WRAP clear/>
-^  관계의 방향  ^^
+====== Histogram skewedness ======
-| {{:r.positive.png}}  | {{:r.negative.png}}  |
+<WRAP column half>
+<code>
+####
+# left-skewed distribution
+# 1.
+set.seed(1)
+data <- rbeta(500, shape1 = 10, shape2 = 2)
+hist(data, probability = TRUE,
+     main = "Histogram with Left-skewed data",
+     xlab = "Value", ylab = "Density",
+     col = "lightblue", border = "white")
+# 2.
+# install.packages("fitdistrplus")
+library(fitdistrplus)
-**__관계의 모양 (shape)__**
+fit <- fitdist(data, "beta")
-^  관계의 모양  ^^
+alpha_est <- fit$estimate["shape1"]
-| {{:r.positive.png}}  | {{:r.curvepositive.png}}  |
+beta_est <- fit$estimate["shape2"]
-**__관계의 정도 (힘)__**
+# 3.
-^  관계의 정도 (힘)  ^^
+curve(dbeta(x, shape1 = alpha_est, shape2 = beta_est),
-| [{{:r.StrengthA.png|Figure_4-1}}]  | [{{:r.StrengthB.png|Figure 4-2}}]  |
+      add = TRUE, col = "red", lwd = 2)
-| [{{:r.StrengthC.png|Figure_4-3}}]  | [{{:r.StrengthD.png|Figure 4-4}}]  |
+</code>
-<WRAP clear />
+</WRAP>
-Pearson's r 의 의미
-__Relations, not cause-effect__
-[{{:r_eg.15.6.png?250 |Figure 6. Correlation And Causation}}] 상관관계 계수는 단순히 두 변인 (x, y) 간의 관계가 있다는 것을 알려줄 뿐, 왜 그 관계가 있는지는 설명하지 않는다. 바꿔 말하면, 충분한 r 값을 구했다고 해서 이 값이 두 변인 간의 '''원인'''과 '''결과'''의 관계를 말한다고 이야기 하면 __안된다__. 예를 들면 아이스크림의 판매량과 성범죄가 서로 상관관계에 있다고 해서, 전자가 후자의 원인이라고 단정할 수 있는 근거는 없다. 이는 연구자의 논리적인 판단 혹은 이론적인 판단에 따른다.
-<WRAP clear />
-__Interpretation with limited range__
+<WRAP column half>
-[{{:r_eg.15.71.png?250 |Figure_7._Correlation_And_Range}}]
+{{:b:head_first_statistics:pasted:20250903-074821.png}}
-[{{:r_eg.15.7b1.png?250 |Figure_7._Correlation_And_Range}}]
+</WRAP>
-데이터의 [[Range]]에 대한 판단에 신중해야 한다. 왜냐 하면, 데이터의 어느 곳을 자르느냐에 따라서 r 값이 심하게 변하기 때문이다.
+<WRAP clear/>
-<WRAP clear />
+<WRAP column half>
-__Outliers__
+<code>
-[{{:r_eg.15.8a.png?250 |Figure_7._Correlation_And_Extreme_Data}}]
+set.seed(1)
-[{{:r_eg.15.8b.png?250 |Figure_7._Correlation_And_Extreme_Data}}]
+data <- rbeta(500, shape1 = 10, shape2 = 10)
-위의 설명과 관련하여, 만약에 아주 심한 Outlier가 존재한다면 두 변인 간의 상관관계에 심한 영향을 준다.
+hist(data, probability = TRUE,
-[{{:pearson-6.png?300 |}}]
+     main = "Histogram with Normal Distribution Data",
+     xlab = "Value", ylab = "Density",
+     col = "lightblue", border = "white")
-make it sure that there is __no data entry error__.
+# 2.
-{{:r.crime.scatterplot.for.single.by.state.jpg}}
+# install.packages("fitdistrplus")
+library(fitdistrplus)
+fit <- fitdist(data, "beta")
+alpha_est <- fit$estimate["shape1"]
+beta_est <- fit$estimate["shape2"]
-<WRAP clear />
+# 3.
+curve(dbeta(x, shape1 = alpha_est, shape2 = beta_est),
+      add = TRUE, col = "red", lwd = 2)
+</code>
+</WRAP>
-see
+<WRAP column half>
-https://www.gapminder.org/answers/how-does-income-relate-to-life-expectancy/
+{{:b:head_first_statistics:pasted:20250903-074830.png}}
-  * Life expectancy data: {{:life.exp.csv}}
+</WRAP>
+<WRAP clear/>
+<WRAP column half>
 <code>
-le <- as.data.frame(read.csv("http://commres.net/wiki/_media/life.exp.csv", header=T))
+##
-colnames(le)[1] <- "c.code" # not really necessary. But, sometimes imported first characters are broken.
+# right-skewed distribution
-lea <- le$X2017
+# 1.
-leb <- lea[complete.cases(lea)]
+set.seed(1)
-hist(leb, color="grey")
+data <- rbeta(500, shape1 = 2, shape2 = 10)
+hist(data, probability = TRUE,
+     main = "Histogram with Right-skewed Distribution",
+     xlab = "Value", ylab = "Density",
+     col = "lightblue", border = "white")
+# install.packages("fitdistrplus")
+library(fitdistrplus)
+fit <- fitdist(data, "beta")
+alpha_est <- fit$estimate["shape1"]
+beta_est <- fit$estimate["shape2"]
+#
+curve(dbeta(x, shape1 = alpha_est, shape2 = beta_est),
+      add = TRUE, col = "red", lwd = 2)
 </code>
+</WRAP>
+<WRAP column half>
+{{:b:head_first_statistics:pasted:20250903-082513.png}}
+</WRAP>
+<WRAP clear/>
-[{{:c:ps1-1:2019:pasted:20190909-110252.png|Life expectancy in 2017}}]
+====== Histogram Modality======
-<WRAP clear/>.
+<WRAP column half>
-[{{:c:ps1-1:2019:pasted:20190909-104759.png|Distribution of temperature}}]
+Unimodal
-<WRAP clear/>.
+<code>
-[{{:c:ps1-1:2019:pasted:20190909-111117.png|skewness}}]
+### unimodal data
-<WRAP clear/>.
+set.seed(1)
-[{{:c:ps1-1:2019:pasted:20190909-111001.png|modality}}]
+d.1 <- rnorm(500, 10, 2)
-<WRAP clear/>.
+hist(d.1, breaks = 30, probability = T,
-box plot
+     main = "Hist with Unimodal distrib",
+     xlab = "Value", ylab = "Density",
+     col = "lightblue", border = "black")
+lines(density(d.1),
+      col = "darkred", lwd = 2)
+</code>
+</WRAP>
+<WRAP column half>
+{{:b:head_first_statistics:pasted:20250903-083409.png}}
+</WRAP>
+<WRAP clear/>
+Bimodal distribution
+<WRAP column half>
+<code>
+### bimodal data
+set.seed(1)
+d.1 <- rnorm(500, 10, 2)
+d.2 <- rnorm(500, 20, 2)
+d.all <- c(d.1, d.2)
+hist(d.all, breaks = 30, probability = T,
+     main = "Hist with bimodal distrib",
+     xlab = "Value", ylab = "Density",
+     col = "lightblue", border = "black")
+lines(density(d.all),
+      col = "darkred", lwd = 2)
+</code>
+</WRAP>
+<WRAP column half>
+{{:b:head_first_statistics:pasted:20250903-083524.png}}
+</WRAP>
+<WRAP clear/>
+<WRAP column half>
+<code>
+### multi-modal data
+# Parameters for the first normal distribution (Mode 1)
+m.1 <- 50
+sd.1 <- 5
+# Parameters for the second normal distribution (Mode 2)
+m.2 <- 100
+sd.2 <- 15
+m.3 <- 160
+sd.3 <- 6
+# Mixing proportion for Mode 1
+prop.1 <- 0.3
+# Mixing proportion for Mode 2
+prop.2 <- 0.6 # This is 1 - prop1
+# Mixing proportion for Mode 2
+prop.3 <- 1.0 # This is 1 - prop1
+# Number of samples to generate
+n.sam <- 1000
+# Create an empty vector to store the combined samples
+mm.dist <- numeric(n.sam)
+set.seed(1)
+for (i in 1:n.sam) {
+  # Randomly choose which distribution to sample from
+  tmp <- runif(1)
+  if (tmp < prop.1) {
+    mm.dist[i] <- rnorm(1, mean = m.1, sd = sd.1)
+  } else if (tmp < prop.2) {
+    mm.dist[i] <- rnorm(1, mean = m.2, sd = sd.2)
+  } else {
+    mm.dist[i] <- rnorm(1, mean = m.3, sd = sd.3)
+  }
+}
+hist(mm.dist, breaks = 30,
+     main = "Multimodal Distribution",
+     xlab = "Value", ylab = "Density",
+     freq = FALSE, probability = T,
+     col = "lightblue", border = "black")
+lines(density(mm.dist),
+      col = "darkred", lwd = 2)
+</code>
+</WRAP>
+<WRAP column half>
+{{:b:head_first_statistics:pasted:20250908-082219.png}}
+</WRAP>
+<WRAP clear/>
+====== box plot ======
+<WRAP column half>
 <code>
 # Boxplot of MPG by Car Cylinders
@@ Line 228: / Line 388: @@
     ylab="Miles Per Gallon")
 </code>
-{{:c:ps1-1:2019:pasted:20190909-111438.png}}
+</WRAP>
+<WRAP column half>
+{{:c:ps1-1:2019:pasted:20190909-111438.png}}
+</WRAP>
+<WRAP clear/>
+====== see also ======
+https://r-graph-gallery.com/