This is an old revision of the document!
Table of Contents
R Cookbook
Chapter 1 Getting Started and Getting Help
Chapter 2 Some Basics
Chapter 3 Navigating the Software
Chapter 4 Input and Output
Chapter 5 Data Structures
Chapter 6 Data Transformations
Chapter 7 Strings and Dates
Chapter 8 Probability
Chapter 9 General Statistics
Chapter 10 Graphics
Chapter 11 Linear Regression and ANOVA
Chapter 12 Useful Tricks
Chapter 13 Beyond Basic Numerics and Statistics
Chapter 14 Time Series Analysis
Week01 (Mar 16, 19)
ideas and concepts
https://youtu.be/6ExajWI_r2w
https://youtu.be/J8e5dEH8K_Q
https://youtu.be/W3DhUXI5cyQ
https://youtu.be/qCeTcvWBDNY
https://youtu.be/1hJm0O-RY4Q
Course Introduction –> syllabus
Introduction to R and others
- Downloading and Installing R
- Starting R
- Entering Commands
- Exiting from R
- Interrupting R
- Viewing the Supplied Documentation
- Getting Help on a Function
- Searching the Supplied Documentation
- Getting Help on a Package
- Searching the Web for Help
- Finding Relevant Functions and Packages
- Searching the Mailing Lists
- Submitting Questions to the Mailing Lists
기본용어
기술통계 (descriptive statistics)
추론통계 (inferential statistics)
아래의 개념은 샘플링 문서를 먼저 볼것
- 전집 (population)
- 표본 (sample)
- 모수치 (parameter)
- 통계치 (statistics)
- sampling methods
- probability
- non-probability
가설 (hypothesis)
- 차이와 연관 (difference and association)
변인 (variables)
Assignment
Week02 (Mar. 23, 26)
Concepts and ideas
Some basics
- Introduction
- Printing Something
- Setting Variables
- Listing Variables
- Deleting Variables
- Creating a Vector
- Computing Basic Statistics
- Creating Sequences
- Comparing Vectors
- Selecting Vector Elements
- Performing Vector Arithmetic
- Getting Operator Precedence Right
- Defining a Function
- Typing Less and Accomplishing More
- Avoiding Some Common Mistakes
from the previous lecture (research question and hypothesis)
- Research Questions (or Problems)
- Two ideas guided by theories
- Questions on their relationships
- Conceptualization
-
- Educated guess (via theories)
- Difference
- Association
- Variables (vs. ideas, concepts, and constructs)
-
- Control variable
- Mediating (Intervening) variable
Assignment
Week03 (Mar 30, April 2)
3주차 온라인 강의 동영상
이후 Howell, Ch. 4 내용 중 Variance와 (분산) Standard deviation은 (표준편차는) 이후 통계 검증방법을 이해하는데 기초가 되는 중요한 내용이니 꼭 숙지하시기 바랍니다.
Concepts and ideas
Navigating software
- Introduction
- Getting and Setting the Working Directory
- Saving Your Workspace
- Viewing Your Command History
- Saving the Result of the Previous Command
- Displaying the Search Path
- Accessing the Functions in a Package
- Accessing Built-in Datasets
- Viewing the List of Installed Packages
- Installing Packages from CRAN
- Setting a Default CRAN Mirror
- Suppressing the Startup Message
- Running a Script
- Running a Batch Script
- Getting and Setting Environment Variables
- Locating the R Home Directory
- Customizing R
Mean
Mode
Median
Variance
Standard Deviation
+-1 sd = 68% = +-1 sd
+-2 sd = 95% = +-1.96 sd
+-3 sd = 99% (99.7%) = +-3 sd
표준점수 (unit with a standard deviation) = z score
Sampling distribution via random sampling
Central Limit Theorem
Assignment
Find two research articles that have listed hypotheses (social science research article would be good option). For each article:
- 각 가설을 적고
- 독립변인과 종속변인 그리고 intervening (moderator) 변인 등이 무엇인지 설명하시오.
- 각 변인이 어떻게 측정되었는지 설명하시오.
- 각 가설이 어떤 종류인지 설명하시오. (차이, 연관의 가설)
- 가설검증을 위해서 어떤 테스트방법을 취했는지 찾아서 기록하시오.
due date: 다음 주 수요일 자정까지 완성하시오 (2018/09/26 11:59).
Week04 (April 6, 9)
Class Activity
Lecture materials for this week
- https://youtu.be/JvpOJPCBQkQ : R cookbook: data structure
- https://youtu.be/_ynGzFFmm7U : Howell Ch 4. Variance 01: Introduction (DS, error, and SS)
- https://youtu.be/HugtyhU7Im8 : Howell Ch. 4. Variance 02: Variance for sample and n-1
Concepts and ideas
- Introduction
- Entering Data from the Keyboard
- Printing Fewer Digits (or More Digits)
- Redirecting Output to a File
- Listing Files
- Dealing with “Cannot Open File” in Windows
- Reading Fixed-Width Records
- Reading Tabular Data Files
- Reading from CSV Files
- Writing to CSV Files
- Reading Tabular or CSV Data from the Web
- Reading Data from HTML Tables
- Reading Files with a Complex Structure
- Reading from MySQL Databases
- Saving and Transporting Objects
Assignment
Week05 (April 13, 16)
- https://youtu.be/RE6DSk1DcJI : 왜 분산에는 n-1을 사용하는가?
- https://youtu.be/PrPoOCW3v1s : n-1 증명
- https://youtu.be/Ssznnbdj5Lg : degrees of freedom
- https://youtu.be/valhVpf-haY : standard deviation
- https://youtu.be/Qaxj6LZ-iL0 : sampling distribution
- https://youtu.be/AbeIQvJJ5Vw : sampling distribution e.g. in R
Concepts and ideas
- Introduction
- Appending Data to a Vector
- Inserting Data into a Vector
- Understanding the Recycling Rule
- Creating a Factor (Categorical Variable)
- Combining Multiple Vectors into One Vector and a Factor
- Creating a List
- Selecting List Elements by Position
- Selecting List Elements by Name
- Building a Name/Value Association List
- Removing an Element from a List
- Flatten a List into a Vector
- Removing NULL Elements from a List
- Removing List Elements Using a Condition
- Initializing a Matrix
- Performing Matrix Operations
- Giving Descriptive Names to the Rows and Columns of a Matrix
- Selecting One Row or Column from a Matrix
- Initializing a Data Frame from Column Data
- Initializing a Data Frame from Row Data
- Appending Rows to a Data Frame
- Preallocating a Data Frame
- Selecting Data Frame Columns by Position
- Selecting Data Frame Columns by Name
- Selecting Rows and Columns More Easily
- Changing the Names of Data Frame Columns
- Editing a Data Frame
- Removing NAs from a Data Frame
- Excluding Columns by Name
- Combining Two Data Frames
- Merging Data Frames by Common Column
- Accessing Data Frame Contents More Easily
- Converting One Atomic Value into Another
- Converting One Structured Data Type into Another
Assignment
조원들과 협력하여
- 선행연구조사와 가설이 수록된 사회과학 논문을 찾습니다
- dbpia, kyobo scholar를 이용하세요
- 선행연구조사에 수록된 내용을 요약합니다.
- 가설을 소개합니다.
- 각 가설의 독립변인과 종속변인 혹은 그 외의 변인종류를 밝힙니다
- 각 변인이 어떻게 측정되었는지 그 측정수준을 밝힙니다
- 논문을 하나 찾기 전에 조원들과 함께 조원들의 학문적 관심사에 대한 통일을 하여 재미있는 논문을 찾기를 권합니다. 가령 내가 디자인에 관심이 많은 학생이라면 UI와 관련된 논문에 더 관심이 갈 것입니다. 거기에 더하여 요사이 자율주행 자동차 (혹은 그냥 자동차) UI에 대한 논문이 사회과학에서 있어서 읽을 수 있다면 흥미로울 것입니다 (그런데 없을 것 같은 생각이 . . . )
- 마감일은 다음 주 화요일 자정까지 입니다.
- 조원미팅은 카톡방이나 그 외의 테크놀로지를 이용하여 하시는 걸 권합니다.
Week06 (April 20, 23)
오늘 할 일 (실시간 온라인 미팅)
- 그룹확인
- 다음 주 퀴즈 공지
- 그룹과제 설명
- 그룹미팅
Concepts and ideas
- Introduction
- Splitting a Vector into Groups
- Applying a Function to Each List Element
- Applying a Function to Every Row
- Applying a Function to Every Column
- Applying a Function to Groups of Data
- Applying a Function to Groups of Rows
- Applying a Function to Parallel Vectors or Lists
Strings and Dates
Announcement
- First quiz on Week 07, Tuesday class (Oct. 16)
- RANGE: Week 01 - 03 materials + lecture content + textbook
- Textbook:
- chapter 2, 3, 4, 5
- NEXT quiz will be held on Oct. 23 during the mid term schedule.
- The 2nd quiz will cover 1st quiz + Week 05-07 materials.
Assignment
Week07 (April 27, 30)
Concepts and ideas
과제 리뷰 –> groups
- r 에서 qnorm(proportion) pnorm(z-score) function 이해 필요
- z_score 참조
- r 에서, qt(proportion, df), pt(t-score, df) function 이해 필요
- probability 참조
Probability calculation in R ← Probability in R cookbook (텍스트북)
. . . .
ANOVA
factorial anova
correlation
regression
- Introduction
- Counting the Number of Combinations
- Generating Combinations
- Generating Random Numbers
- Generating Reproducible Random Numbers
- Generating a Random Sample
- Generating Random Sequences
- Randomly Permuting a Vector
- Calculating Probabilities for Discrete Distributions
- Calculating Probabilities for Continuous Distributions
- Converting Probabilities to Quantiles
- Plotting a Density Function
Assignment
- 가설 만들어 보기
- how to write hypothesis at behavioral science writing.
- One sample hypothesis Hypothesis at www.socialresearchmethods.net
개인과제
Week08 (May 4, 7)
시험기간
보강영상 수업
Week09 (May 11, 14)
Concepts and ideas
General Statistics
t-test
ANOVA
Factorial ANOVA
repeated measures anova
correlation and regression and multiple regression
- Before regression, SS actually is sum of (error squared of guessing estimates).
- sum of error square = 오차의 제곱의 합 = SS (오차라는 단어 없이 사용되는 용어)
- For this, read carefully 표준오차 잔여변량 (standard error residual) in Regression document.
- Introduction
- Summarizing Your Data
- Calculating Relative Frequencies
- Tabulating Factors and Creating Contingency Tables
- Testing Categorical Variables for Independence
- Calculating Quantiles (and Quartiles) of a Dataset
- Inverting a Quantile
- Converting Data to Z-Scores
- Testing the Mean of a Sample (t Test)
- Forming a Confidence Interval for a Mean
- Forming a Confidence Interval for a Median
- Testing a Sample Proportion
- Forming a Confidence Interval for a Proportion
- Testing for Normality
- Testing for Runs
- Comparing the Means of Two Samples
- Comparing the Locations of Two Samples Nonparametrically
- Testing a Correlation for Significance
- Testing Groups for Equal Proportions
- Performing Pairwise Comparisons Between Group Means
- Testing Two Samples for the Same Distribution
Assignment
Week10 (May 18, 21)
Concepts and ideas
multiple regression continued.
using dummy variables
Assignment
Week11 (May 25, 28)
Concepts and ideas
getting started
basics
navigating in r
input output in r
data structures
data transformations
- Introduction
- Creating a Scatter Plot
- Adding a Title and Labels
- Adding a Grid
- Creating a Scatter Plot of Multiple Groups
- Adding a Legend
- Plotting the Regression Line of a Scatter Plot
- Plotting All Variables Against All Other Variables
- Creating One Scatter Plot for Each Factor Level
- Creating a Bar Chart
- Adding Confidence Intervals to a Bar Chart
- Coloring a Bar Chart
- Plotting a Line from x and y Points
- Changing the Type, Width, or Color of a Line
- Plotting Multiple Datasets
- Adding Vertical or Horizontal Lines
- Creating a Box Plot
- Creating One Box Plot for Each Factor Level
- Creating a Histogram
- Adding a Density Estimate to a Histogram
- Creating a Discrete Histogram
- Creating a Normal Quantile-Quantile (Q-Q) Plot
- Creating Other Quantile-Quantile Plots
- Plotting a Variable in Multiple Colors
- Graphing a Function
- Pausing Between Plots
- Displaying Several Figures on One Page
- Opening Additional Graphics Windows
- Writing Your Plot to a File
- Changing Graphical Parameters
Assignment
Week12 (June 1, 4)
Announcement
Quiz 03: Nov. 23
Concepts and ideas
chi-square test
probability
general statistics
Graphics
Assignment
Week13 (June 8, 11)
Concepts and ideas
Do the following
S1 <- c(89, 85, 85, 86, 88, 89, 86, 82, 96, 85, 93, 91, 98, 87, 94, 77, 87, 98, 85, 89, 95, 85, 93, 93, 97, 71, 97, 93, 75, 68, 98, 95, 79, 94, 98, 95) S2 <- c(60, 98, 94, 95, 99, 97, 100, 73, 93, 91, 98, 86, 66, 83, 77, 97, 91, 93, 71, 91, 95, 100, 72, 96, 91, 76, 100, 97, 99, 95, 97, 77, 94, 99, 88, 100, 94, 93, 86) S3 <- c(95, 86, 90, 90, 75, 83, 96, 85, 83, 84, 81, 98, 77, 94, 84, 89, 93, 99, 91, 77, 95, 90, 91, 87, 85, 76, 99, 99, 97, 97, 97, 77, 93, 96, 90, 87, 97, 88) S4 <- c(67, 93, 63, 83, 87, 97, 96, 92, 93, 96, 87, 90, 94, 90, 82, 91, 85, 93, 83, 90, 87, 99, 94, 88, 90, 72, 81, 93, 93, 94, 97, 89, 96, 95, 82, 97) scores <- list(S1=S1,S2=S2,S3=S3,S4=S4)
- find means for each element in “scores” in a list format
- find standard deviation for each element in “scores” in a data frame format
- find variance for each element in “scores” in a data frame format without using “var” function
longdata<- c(-1.850152, -1.406571, -1.0104817, -3.7170704, -0.2804896, 0.9496313, 1.346517, -0.1580926, 1.6272786, -2.4483321, -0.5407272, -1.708678, -0.3480616, -0.2757667, -1.2177024)
- make “longdata” to a matrix whose size is 3 by 5
- name columns “trial1, trial2, . . . . trial5”
- name rows “subject1, subject2, subject3”
- get means for each subject
- attach the above data to the matrix data and name it “longtemp.”
- get standard deviation for each trial
- attach the above data to the matrix data, “longtemp.”
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_transformations?codeblock=15", head=T, sep=" ")
- get subrubs data as the above
- get population means by each state (listed in the data, suburbs)
- use aggregate and refer to the below e.g.
attach(Cars93) aggregate(MPG.city ~ Origin, Cars93, mean)
- get population sum by each county with tapply function.
- tapply(number, byfactor, function)
- how many counties are there?
- Use Cars93 data, get MPG.city mean by Origin.
Using pnorm, qnorm
pnorm : get proportion out of normal distribution whose characteristics are mean and sd
pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
- What is the value of the below?
pnorm(1)
- How would you get 68, 95, 99% from pnorm
- use ?pnorm and see the default option
- generate 10 random numbers with runif function
year <- c(1900:2016) # years in vector year world.series <- data.frame(year)
- get 10 year samples out of world.series data with “sample” command
- how would you get the sample sample again latter?
pnorm(110, mean=100, sd=10)
- What would be the result from the above?
library(MASS) # load the MASS package tbl = table(survey$Smoke, survey$Exer) tbl # the contingency table
summary(tbl)
- read the above output and interpret
- what about the below one?
chisq.test(tbl)
see first chi-square test
see chi-square test in r document space for more
library(MASS) cardata <- data.frame(Cars93$Origin, Cars93$Type) cardata
- Can you say the types of cars are different by the Origins?
dur <- faithful$eruptions dur
- make the above data into z-score (zdur).
- get mean of the zdur
- get sd of the zdur
set.seed(1123) x <- rnorm(50, mean=100, sd=15)
- test x against population mean 95.
- test x against population mean 99.
- are they different from each other?
- what would you do if you want to see the different result from the second one?
a = c(65, 78, 88, 55, 48, 95, 66, 57, 79, 81) > t.test(a, mu=60) One Sample t-test data: a t = 2.3079, df = 9, p-value = 0.0464 alternative hypothesis: true mean is not equal to 60 95 percent confidence interval: 60.22187 82.17813 sample estimates: mean of x 71.2
- find the t critical value with function qt.
- explain what happens in the next code
- read (or remind) what pnorm and qnorm do.
> s <- sd(x) > m <- mean(x) > n <- length(x) > n [1] 50 > m [1] 96.00386 > s [1] 17.38321 > SE <- s / sqrt(n) > SE [1] 2.458358 > E <- qt(.975, df=n-1)*SE > E [1] 4.940254 > m + c(-E, E) [1] 91.0636 100.9441 >
- what's wrong with the below?
t.test(x)
> mtcars
- using aggregate, get mean for each trnas. type.
- compare the difference of mileage between auto and manual cars.
- use t.test (two sample)
- “use var.equal=T” option
a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179) b = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)
- stack them into data c
- convert colnames into score and trans
- t.test score by trans with var.equal option true.
- aov test
- see t.test t value, t = -0.9474 and F value, F = ?
Assignment
그룹 assignment: independent t-test, repeated measures t-test, ANOVA, Factorial ANOVA, repeated measures ANOVA, regression, multiple regression 와 관련된 가설을 만들고, 구글독스를 이용하여 설문문항을 작성하시오. 이를 이용하여 데이터를 수집한 후 검증을 하시오. 검증 결과를 최대한 자세하게 논하시오. 과제는 기본적으로 아래를 수행하여야 합니다.
- 가설은 일반상식, 알고있는 사회과학 이론 등에 기반을 해서 만듭니다
- 가설작성에는 가설에 대한 설명이 포함되어야 합니다. 즉, 가설만 만들어서는 부족합니다.
- 구글서베이를 이용하여 서베이 문항을 만들 때 아래를 포함하여야 합니다.
- 응답자 학번, 이름, 이메일 (참여 평가를 위해서: 참여 + (불)성실응답)
- 각 가설을 검증할 수 있는 문항들
- R을 이용하여 검증합니다.
- 검증 결과를 의미있게 논합니다.
—-
과제제출
- 가설 소개와 설명
- independent t-test
- repeated measures t-test
- ANOVA
- Factorial ANOVA
- repeated measures ANOVA
- regression
- multiple regression
- 가설에 따른 설문 문항과 이 때의 IV와 DV 파악 및 측정 수준에 대한 설명
- independent t-test
- repeated measures t-test
- ANOVA
- Factorial ANOVA
- repeated measures ANOVA
- regression
- multiple regression
- 각 가설검증 분석결과 및 논의
- independent t-test
- repeated measures t-test
- ANOVA
- Factorial ANOVA
- repeated measures ANOVA
- regression
- multiple regression
첨부파일 제출
- 서베이 참여자 명단 (survey.participants.group.01.xlsx 와 같은 이름의 excel 파일 형식으로 따로 제출)
- 강사가 우선 클래스메이트 명단을 배포할 것입니다 (excel 파일로).
- 스프레드시트에 참여한 사람의 성과이름은 붙여서 적습니다.
- 자신의 조에 속한 조원들도 자기 조 서베이에 참여합니다.
- 완전참여 = 1
- 비참여 = 0
- 불완전참여 = 2
Week14 (June 15, 18)
Concepts and ideas
ANOVA
Linear Regression and ANOVA
http://commres.net/wiki/text_mining_example_with_korean_songs
Assignment
이번 주 주말에 (토요일) 퀴즈 봅니다.
- 사지선다 혹은 단답식
- 토요일 오전 9:00 - 오후 6:00 중 시작시간 정할 수 있음
- 퀴즈제한 시간은 40분정도
퀴즈 범위는
stats part
r part
Week15 (June 22, 25)
Final quiz
Part I (필기시험): NO open book.
- factor analysis - 이론적인 이해와 관련된 부분
- r 과 관련된 내용 중 통계에 대한 이해와 관련된 부분, 예를 들면
- t-test, ANOVA, Factorial ANOVA output에 대한 이해
- regression, multiple regression output에 대한 이해 등
Part II (r 실기시험): 교재와 R help만 허용
Week16 (June 22, 25)
Final-term
- July 02. 목요일 Quiz 봅니다.
- 퀴즈 시간은 12:00 - 9:00 입니다. 퀴즈 시간은 한정되어 있습니다. 연장이나 늦게 제출되지 않도록 할 예정입니다. 60분이 제한시간이면 이 시간이 지나면 자동제출됩니다.
- 범위는 다음과 같습니다.
- t-test
- ANOVA
- Repeated measure ANOVA
- Factorial ANOVA
- correlation
- Regression
- Multiple regression
- Using dummy variable
- Interpreting IVs roles