Sequential data analysis
Sequential data analysis
Outline Sequential Data Analysis Starting with R and TraMineR
1
About the R statistical environment
2
A short introduction to R
Alexis Gabadinho, Matthias Studer
3
Importing data and checking content of data frames
Institute for Demographic and Life Course Studies, University of Geneva and NCCR LIVES: Overcoming vulnerability, life course perspectives http://mephisto.unige.ch/traminer
4
TraMineR and other useful packages
5
Basic statistical analysis in R
Gilbert Ritschard
September - November, 2012
©
G. Ritschard (2012), 1/76. Distributed under licence CC BY-NC-ND 3.0
©
G. Ritschard (2012), 2/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis About the R statistical environment
Sequential data analysis About the R statistical environment
What is R?
Installation
R and the modules can be downloaded from the CRAN http://cran.r-project.org
R is a free software environment for statistical computing and graphics (http://www.r-project.org)
By default, no GUI is proposed under Linux.
R is derived from the S language
Under Windows and MacOSX, the basic GUI remains limited. ... but try
R is free and open source R is Easily extensible with numerous contributed modules
Rcmdr (an R package) Deducer http://www.deducer.org RStudio http://www.rstudio.org
R is available for Linux, MacOS X, Windows
©
G. Ritschard (2012), 4/76. Distributed under licence CC BY-NC-ND 3.0
©
G. Ritschard (2012), 5/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis About the R statistical environment
Sequential data analysis About the R statistical environment
The increasing use of R
The increasing use of R Source: (Muenchen, 2012) http://r4stats.com/articles/popularity/
Source: (Muenchen, 2012) http://r4stats.com/articles/popularity/
©
G. Ritschard (2012), 6/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis About the R statistical environment
©
G. Ritschard (2012), 7/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Using RStudio
R Packages on the CRAN
The RStudio Environment
Source: (Muenchen, 2012) http://r4stats.com/articles/popularity/
©
G. Ritschard (2012), 8/76. Distributed under licence CC BY-NC-ND 3.0
©
G. Ritschard (2012), 11/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Using RStudio
Sequential data analysis A short introduction to R Objects, Functions and Libraries
The R console and the R script editor
R objects In R you handle objects that can be of many different types, and have different content types
The prompt ‘>’ indicates that R is waiting for commands. Several ways of sending commands to R
An object is created with the ‘assign’ operator: ‘ a b a
Using scripts allows to
[1] 5
store, re-use or later modify your statistical analysis. share code with others.
R> b
R scripts are text files containing a series of R commands.
[1] "my object"
The usual extension for such files is ‘.R’
Object names are case sensitive (a 6= A) and can be of arbitrary length
Comments: everything between ‘#’ and the end of line. Strongly recommended to document scripts with comments!
©
G. Ritschard (2012), 12/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Objects, Functions and Libraries
R> My.very.first.R.object My.very.first.R.object
©
Sequential data analysis A short introduction to R Objects, Functions and Libraries
Operators
Arithmetic Comparison Logical
[1] 13 G. Ritschard (2012), 14/76. Distributed under licence CC BY-NC-ND 3.0
Functions
+ (addition), − (substraction), ∗ (multiplication), / (division), ˆ (power) == (equality), ! = (different), > (greater), >= (greater or equal), < (less than), b [1] "my object" R> c c
Operations on objects R> a/2
[1] "my object is beautiful"
[1] 2.5 R> A a == A [1] FALSE
©
G. Ritschard (2012), 15/76. Distributed under licence CC BY-NC-ND 3.0
©
G. Ritschard (2012), 16/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Objects, Functions and Libraries
Sequential data analysis A short introduction to R Objects, Functions and Libraries
Function arguments
Functions and libraries
Argument names can be omitted as long as you respect their order. For clarity we recommended to give them explicitly. Many standard statistical functions are available as core functions: descriptive statistics, regression, etc ... Additional, specialized functions are available through
R> seq(from = 1, to = 10, by = 2) [1] 1 3 5 7 9 R> seq(1, 10, 2)
pre-installed libraries such as foreign for reading data from other statistical packages, survival for survival analysis, etc ... add-on libraries, available from the Comprehensive R Archive Network (CRAN), for example TraMineR for sequence analysis.
[1] 1 3 5 7 9
Using argument names, you can pass them in any order R> seq(by = 2, to = 10, from = 1) [1] 1 3 5 7 9 R> seq(2, 10, 1) [1]
©
2
3
4
5
6
7
8
9 10
G. Ritschard (2012), 17/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Installing libraries and exploring its content
©
Sequential data analysis A short introduction to R Installing libraries and exploring its content
Installing packages
Use
install.packages()
Help on functions and libraries
to install a library from the CRAN.
Access the functions provided by a library R> library(TraMineR)
> install.packages("TraMineR", dependencies=TRUE)
Get information on a library
Some packages use functions of other packages which must also be loaded. This is automatically done with dependencies=TRUE .
> library(help = "TraMineR")
Help on a particular function > library("foreign") > help(read.spss)
Installing from other sites than the CRAN: > install.packages("TraMineRextras", repos="http://R-forge.R-project.org")
Access the index of the functions provided by a package
When available, you can also use the menu.
©
G. Ritschard (2012), 18/76. Distributed under licence CC BY-NC-ND 3.0
G. Ritschard (2012), 20/76. Distributed under licence CC BY-NC-ND 3.0
> help(package="foreign")
©
G. Ritschard (2012), 21/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames
Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames
Higher dimensional objects - Vectors
Indexing with logical expressions
Vectors are one-dimensional objects containing numeric or character values. The c() function combines values into a vector
One can use logical expressions to retrieve vector elements R> v1[v1 >= 4]
R> v1 v1
[1] 4 8
[1] 1 2 4 8
R> v2[v2 %in% c("A", "C")]
R> v2 v2
[1] "A" "C"
Use which() to get the indexes of the elements that satisfy a given condition
[1] "A" "B" "C" "D"
Specific elements of vectors can be retrieved with indexes
R> which(v1 >= 4)
R> v1[3]
[1] 3 4
[1] 4
R> which(v2 %in% c("A", "C"))
R> v2[1:3]
[1] 1 3
[1] "A" "B" "C" R> v2[c(1, 4)]
©
[1] "A" "D" G. Ritschard (2012), 23/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames
©
Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames
Higher dimensional objects - Matrices
Higher dimensional objects - Data frames
Matrices are two dimensional objects containing numeric or character values
Data frames combine columns (vectors) of any type: factors, numeric, character strings
R> m1 m1 [1,] [2,] [3,] [4,]
R> data(iris) R> iris[1:4, ]
[,1] [,2] [,3] [,4] 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 2 3 4
Specific elements of matrices are retrieved with row and column indexes
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
You can access a variable in a data frame by giving its name preceded by a ‘$’ instead of the column index
Element in the second row, fourth column R> m1[2, 4]
R> iris$Sepal.Width[1:4]
[1] 14
[1] 3.5 3.0 3.2 3.1
Whole fourth column (by omitting row index)
R> iris[1:4, 2]
R> m1[, 4]
©
G. Ritschard (2012), 24/76. Distributed under licence CC BY-NC-ND 3.0
[1] 13 14 15 16 G. Ritschard (2012), 25/76. Distributed under licence CC BY-NC-ND 3.0
[1] 3.5 3.0 3.2 3.1
©
G. Ritschard (2012), 26/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames
Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames
Referencing subsets of objects: a summary
Factors
Vectors
A factor is a categorical variable, that is, a variable that takes (usually non-measurable) categorical values.
x[n] nth element x[-n] all but the nth element x[1:n] first n elements x[-(1:n)] elements from n+1 to the end x[c(1,4,2)] specific elements x["name"] element named "name" x[x > 3] all elements greater than 3 x[x > 3 & x < 5] all elements between 3 and 5 x[x %in% c("a","and","the")] elements in the given set
The
Species variable in R> class(iris$Species)
©
data frame is a factor
Possible categories of a factor are called levels R> levels(iris$Species)
element at row i, column j row i column j columns 1 and 3 row named "name"
[1] "setosa"
"versicolor" "virginica"
You can change the labels of the levels
(by respecting their order) R> levels(iris$Species) head(iris$Species)
Data frames (same as matrix plus the following) x[["name"]] x$name
iris
[1] "factor"
Matrices x[i,j] x[i,] x[,j] x[,c(1,3)] x["name",]
the
[1] Species 1 Species 1 Species 1 Species 1 Species 1 Species 1 Levels: Species 1 Species 2 Species 3
column named "name" idem
©
G. Ritschard (2012), 27/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Methods
G. Ritschard (2012), 28/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Methods
Objects and methods
Objects and methods - Example We use the
table() function to produce a contingency R> my.table class(my.table) [1] "table"
There are many other types of objects in R For example contingency tables or outputs of regression models are objects of a specific type There are usually specific methods such as summary() for each type of object
print(), plot()
The dedicated
plot() R> plot(my.table)
method produces the following figure
or my.table
Entering just the name of an object displays its content through an automatical call of the associated print() method.
no
no
R> A [1] 8 R> print(A)
yes
[1] 8
©
G. Ritschard (2012), 30/76. Distributed under licence CC BY-NC-ND 3.0
©
table
G. Ritschard (2012), 31/76. Distributed under licence CC BY-NC-ND 3.0
yes
Sequential data analysis A short introduction to R Working environment
Sequential data analysis A short introduction to R Working environment
The workspace
Loading and saving data
Objects can be saved using the save function. Several objects may be saved in a same file. The usual extension for R data file is: ‘.RData’.
The objects created during an R session are stored in the working environment, i.e, in the memory.
R> save(mvad, a, b, file = "myfile.RData")
Objects can be loaded using the
The objects in the R environment are listed in the workspace panel of RStudio.
load
function.
R> load(file = "myfile.RData")
You can save all the objects in your environment with save.image. R> save.image(file = "myenvironment.RData")
When you quit R, you are asked if you want to save your working environment in a ‘.RData’ file.
©
©
G. Ritschard (2012), 33/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis A short introduction to R Working environment
G. Ritschard (2012), 34/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Importing data and checking content of data frames Importing data into R
Working directory
Importing text files
Import ‘.csv’ (comma separated values) text files R> my.data my.data
R loads and saves files in the working directory. Check your current working directory with You should set a working directory using
getwd(). 1 2 3 4 5
setwd("path").
On windows, the full path should be specified using ‘/’ and not ‘\’.
Id Age Sex 1 22 Male 2 18 Female 3 40 Male 4 27 Female 5 33 Female
Import tab separated text files, with read.table()
©
G. Ritschard (2012), 35/76. Distributed under licence CC BY-NC-ND 3.0
©
G. Ritschard (2012), 38/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Importing data and checking content of data frames Importing data into R
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Importing from other statistical packages
The mvad data frame The mvad data set is included in the TraMineR library (with permission of the authors). We load it with R> data(mvad)
Import SPSS ‘.sav’ files, with
read.spss() R> titanic head(titanic) 1 2 3 4 5 6
The mvad object is of type ‘data frame’. It contains data from different formats (numeric values, factors)
ID CLASS AGE SEX LIVING 1 c1 adult Male yes 2 c1 adult Male yes 3 c1 adult Male yes 4 c1 adult Male yes 5 c1 adult Male yes 6 c1 adult Male yes
Import Stata ‘.dta’ files, with
R> class(mvad) [1] "data.frame"
It contains 712 rows and 86 variables R> dim(mvad) [1] 712 read.dta()
86
R> nrow(mvad) [1] 712 R> ncol(mvad)
©
G. Ritschard (2012), 39/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Importing data and checking content of data frames Exploring data frames
©
Sequential data analysis Importing data and checking content of data frames Exploring data frames
The data editor The
View() function > View(mvad)
[1] 86 G. Ritschard (2012), 41/76. Distributed under licence CC BY-NC-ND 3.0
Variable names List the variables in the ‘data frame’
opens a simple data editor
R> names(mvad) [1] [7] [13] [19] [25] [31] [37] [43] [49] [55] [61] [67] [73] [79] [85]
"id" "Southern" "fmpr" "Nov.93" "May.94" "Nov.94" "May.95" "Nov.95" "May.96" "Nov.96" "May.97" "Nov.97" "May.98" "Nov.98" "May.99"
"weight" "S.Eastern" "livboth" "Dec.93" "Jun.94" "Dec.94" "Jun.95" "Dec.95" "Jun.96" "Dec.96" "Jun.97" "Dec.97" "Jun.98" "Dec.98" "Jun.99"
"male" "Western" "Jul.93" "Jan.94" "Jul.94" "Jan.95" "Jul.95" "Jan.96" "Jul.96" "Jan.97" "Jul.97" "Jan.98" "Jul.98" "Jan.99"
"catholic" "Grammar" "Aug.93" "Feb.94" "Aug.94" "Feb.95" "Aug.95" "Feb.96" "Aug.96" "Feb.97" "Aug.97" "Feb.98" "Aug.98" "Feb.99"
"Belfast" "funemp" "Sep.93" "Mar.94" "Sep.94" "Mar.95" "Sep.95" "Mar.96" "Sep.96" "Mar.97" "Sep.97" "Mar.98" "Sep.98" "Mar.99"
Access the description of the data set and its variables
©
G. Ritschard (2012), 42/76. Distributed under licence CC BY-NC-ND 3.0
©
> help(mvad) G. Ritschard (2012), 43/76. Distributed under licence CC BY-NC-ND 3.0
"N.Eastern" "gcse5eq" "Oct.93" "Apr.94" "Oct.94" "Apr.95" "Oct.95" "Apr.96" "Oct.96" "Apr.97" "Oct.97" "Apr.98" "Oct.98" "Apr.99"
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Summary
Frequency tables The successive states forming the sequences are in variables Jul.93 ... Jun.99, that is in columns 15 to 86. Here are the data for the first 6 months and first 4 records.
We get a summary for the first five variables with R> summary(mvad[, 1:5]) id Min. : 1 1st Qu.:179 Median :356 Mean :356 3rd Qu.:534 Max. :712
weight Min. :0.130 1st Qu.:0.450 Median :0.690 Mean :0.999 3rd Qu.:1.070 Max. :4.460
The
weight variable is numeric R> levels(mvad$catholic) [1] "no"
male no :342 yes:370
while
catholic no :368 yes:344
catholic
Belfast no :624 yes: 88
R> mvad[1:4, 15:20] Jul.93 Aug.93 Sep.93 Oct.93 Nov.93 Dec.93 1 training training employment employment employment employment 2 joblessness joblessness FE FE FE FE 3 joblessness joblessness training training training training 4 training training training training training training
Frequency table of the gcse5eq variable (qualifications gained by the end of compulsory education)
is a factor
R> table(mvad$gcse5eq)
"yes"
no yes 452 260
©
©
G. Ritschard (2012), 44/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Importing data and checking content of data frames Exploring data frames
G. Ritschard (2012), 45/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Contingency tables
Row and marginal distributions Row and column distributions R> prop.table(ct1, 1)
Cross tabulate variables funemp (father unemployed) and (qualification gained at the end of compulsory school)
gcse5eq employed unemployed
Assign more informative value labels to the two factors (both are dummy variables with ‘yes’/‘no’ labels in the original file) R> R> R> R>
Lower qual. Higher qual. 0.6084 0.3916 0.7692 0.2308
R> prop.table(ct1, 2)
levels(mvad$funemp) hist(mvad$weight, col = "cyan") Histogram of mvad$weight
250
We perform a Chi-squared independence test using the chisq.test() function Frequency
0
50
data: ct1 X-squared = 10.23, df = 1, p-value = 0.001384
100
Pearson's Chi-squared test with Yates' continuity correction
150
200
R> chisq.test(ct1)
0
1
2
3
4
mvad$weight
©
©
G. Ritschard (2012), 48/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Barplot Applying bar plot
G. Ritschard (2012), 49/76. Distributed under licence CC BY-NC-ND 3.0
XY scatterplot plot()
on factors (categorical variables) generates a
Applying plot() on two numerical variables generates a scatterplot
R> plot(mvad$gcse5eq, col = c("red", "green"), main = "Variable gcse5eq")
R> plot(iris$Sepal.Length, iris$Sepal.Width, col = "red")
Variable gcse5eq
400
● ● ● ● ● ●
3.5
●● ●●● ● ● ●●● ●● ●● ● ● ●● ●●●
3.0
● ●● ●
● ● ●
Higher qual.
●
●
●
4.5
5.0
2.0
0
Lower qual.
●
● ●
● ●
●
● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●●● ●●●● ●● ●● ●●●●● ● ● ●●● ●●●●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ●
G. Ritschard (2012), 50/76. Distributed under licence CC BY-NC-ND 3.0
©
●● ● ●
●
5.5
6.0
6.5
iris$Sepal.Length
©
●
●●
●
2.5
100
200
iris$Sepal.Width
300
4.0
●
G. Ritschard (2012), 51/76. Distributed under licence CC BY-NC-ND 3.0
7.0
7.5
8.0
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Boxplots
Saving graphics
The boxplot() function accepts a formula as argument to produce a boxplot for each category of a factor R> boxplot(iris$Sepal.Length ~ iris$Species, col = "cyan", main = "Sepal length, by species")
To save graphics in files, depending on the format you can use the pdf(), jpeg() or png() function with the name of the file as argument
8.0
Sepal length, by species
7.0
7.5
Once you have issued all plotting commands you have to close the file with the dev.off() function
5.0
5.5
6.0
6.5
R> pdf(file = "hist") R> plot(mvad$Sep.93, mvad$Sep.94) R> dev.off() pdf 2
4.5
●
Species 1
©
Species 2
Species 3
©
G. Ritschard (2012), 52/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Sequential data analysis Importing data and checking content of data frames Exploring data frames
Some useful functions - A Add a new variable to
mvad data frame R> mvad$weight100 my.groups table(my.groups)
R> mean(mvad$weight100) [1] 99.94 R> min(mvad$weight100)
my.groups 0-19 20-39 19 20
[1] 13
40-59 60-100 20 40
R> max(mvad$weight100) [1] 446 G. Ritschard (2012), 54/76. Distributed under licence CC BY-NC-ND 3.0
or
[1] 1 2 3 4 5 6
Some basic statistical functions
©
seq(from,to)
R> my.seq head(my.seq)
R> head(mvad$weight100) [1]
G. Ritschard (2012), 53/76. Distributed under licence CC BY-NC-ND 3.0
©
G. Ritschard (2012), 55/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis TraMineR and other useful packages
Sequential data analysis TraMineR and other useful packages
TraMineR and other useful packages
Suggested packages
From the TraMineR team on R-Forge
TraMineR is available from the CRAN http://cran.r-project.org.
TraMineR development version (Gabadinho et al., 2011, 2009) TraMineRextras: ancillary functions to be used with TraMineR (Ritschard et al., 2012) PST: Probabilistic suffix trees (Gabadinho and Ritschard, 2012) WeightedCluster: clustering and measures of cluster quality
It is just one over more than 3500 packages on the CRAN, and there are many more on other repositories such as http://R-forge.R-project.org and http://www.bioconductor.org.
(Studer, 2012a,b)
Some other packages for clustering analysis (CRAN):
Whatever you want to do, there most probably exists a package which does it: Just Google for R + what you are interested in.
©
G. Ritschard (2012), 57/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Linear Regression
cluster (Kaufman and Rousseeuw, 2005; Maechler et al., 2005) fastcluster (M¨ ullner, 2012) flashClust (Langfelder and Horvath, 2012)
©
G. Ritschard (2012), 58/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Linear Regression
Statistical modeling: Regression
Loading the data R> R> R> R>
[>] [>] [>] [>]
sequence object created with TraMineR version 1.9-2 712 sequences in the data set, 490 unique min/max sequence length: 70/70 alphabet (state labels): 1=EM (employment) 2=FE (FE) 3=HE (HE) 4=JL (joblessness) 5=SC (school) 6=TR (training) [>] dimensionality of the sequence space: 350 [>] colors: 1=#7FC97F 2=#BEAED4 3=#FDC086 4=#FFFF99 5=#386CB0 6=#F0027F
We use the mvad data of TraMineR Regression of longitudinal entropies on male, catholic, ...
©
G. Ritschard (2012), 61/76. Distributed under licence CC BY-NC-ND 3.0
mvad.lab summary(lm.entrop) Call: lm(formula = entrop ~ male + catholic + gcse5eq, data = mvad) 3
Normal Q−Q
● ●
Coefficients:
©
Residual standard error: 0.174 on 708 degrees of freedom Multiple R-squared: 0.0538, Adjusted R-squared: 0.0497 F-statistic: 13.4 on 3 and 708 DF, p-value: 1.61e-08 G. Ritschard (2012), 65/76. Distributed under licence CC BY-NC-ND 3.0
*** ** ***
−2
Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3892 0.0128 30.48 < 2e-16 maleyes -0.0418 0.0133 -3.14 0.0017 catholicyes 0.0177 0.0131 1.35 0.1764 gcse5eqHigher qual. 0.0645 0.0138 4.68 3.4e-06 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1
1
2
Max 0.4479
0
3Q 0.1286
Standardized residuals
Median 0.0000
−1
Residuals: Min 1Q -0.4713 -0.0951
' ' 1
● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● 310 ●●● 421 193 ● ●
−3
©
−2
−1
0
1
2
3
Theoretical Quantiles lm(entrop ~ male + catholic + gcse5eq)
G. Ritschard (2012), 66/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Logistic regression
Sequential data analysis Basic statistical analysis in R Logistic regression
Logistic regression I
Logistic regression II
Logistic regression: specific case of the generalized linear model glm() with family = binomial R> lg.gr summary(lg.gr)
(Dispersion parameter for binomial family taken to be 1)
Call: glm(formula = gcse5eq ~ male + catholic, family = binomial, data = mvad) Deviance Residuals: Min 1Q Median -1.129 -1.082 -0.793
©
3Q 1.276
Null deviance: 934.62 Residual deviance: 910.51 AIC: 916.5
Max 1.619
G. Ritschard (2012), 68/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Logistic regression
©
Retrieve coefficients and compute their
Defining a custom function R> discretize discretize(0.33)
exp()
R> exp(lg.gr$coefficients) maleyes catholicyes 0.4641 1.1190
Completing the table of coefficients, standard errors and significativity with exp(β) R> lg.gr.coeff lg.gr.coeff lg.gr.coeff
©
G. Ritschard (2012), 69/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Defining a custom function
Computing the ‘odds ratios’
(Intercept) maleyes catholicyes
degrees of freedom degrees of freedom
Number of Fisher Scoring iterations: 4
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.228 0.131 -1.74 0.082 . maleyes -0.768 0.159 -4.83 1.3e-06 *** catholicyes 0.112 0.159 0.71 0.478 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Intercept) 0.7959
on 711 on 709
Estimate Std. Error z value Pr(>|z|) Exp Estim. -0.2283 0.1315 -1.736 8.250e-02 0.7959 -0.7677 0.1588 -4.833 1.343e-06 0.4641 0.1124 0.1586 0.709 4.783e-01 1.1190
G. Ritschard (2012), 70/76. Distributed under licence CC BY-NC-ND 3.0
[1] 1 R> table(apply(entrop, 1, discretize))
©
1 2 385 243
3 84
G. Ritschard (2012), 72/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Defining a custom function
Sequential data analysis Basic statistical analysis in R Defining a custom function
References I Gabadinho, A. and G. Ritschard (2012). PST: Probabilistic Suffix Trees. R package version 0.66/r157.
Thank you! Thank you! See you See you next next week.week.
Gabadinho, A., G. Ritschard, N. S. M¨ uller, and M. Studer (2011). Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software 40 (4), 1–37. Gabadinho, A., G. Ritschard, M. Studer, and N. S. M¨ uller (2009). Mining sequence data in R with the TraMineR package: A user’s guide. Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva. Kaufman, L. and P. J. Rousseeuw (2005). Finding Groups in Data. Hoboken: John Wiley & Sons. Langfelder, P. and S. Horvath (2012). Fast R functions for robust correlations and hierarchical clustering. Journal of Statistical Software 46 (11), 1–17.
©
G. Ritschard (2012), 73/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Defining a custom function
©
Maechler, M., P. Rousseeuw, A. Struyf, and M. Hubert (2005). Package ‘cluster’: Cluster analysis basics and extensions. Reference manual, R-project, CRAN. G. Ritschard (2012), 74/76. Distributed under licence CC BY-NC-ND 3.0
Sequential data analysis Basic statistical analysis in R Defining a custom function
References II
References III
Maindonald, J. and J. Brown (2010). Data Analysis and Graphics Using R: An Example-based Approach (3rd ed.). Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press.
´ Studer, M. (2012a). Etude des in´egalit´es de genre en d´ebut de carri`ere acad´emique ` a l’aide de m´ethodes innovatrices d’analyse de donn´ees s´equentielles, Volume SES-777 of Collection des th`eses. Universit´e de Gen`eve, Facult´e des sciences ´economiques et sociales.
Maindonald, J. H. (2008). Using R for data analysis and graphics: Introduction, code and commentary. Manual, Centre for Mathematics and Its Applications, Austrialian National University. Muenchen, R. A. (2012). The popularity of data analysis software. Online at r4stat.
Studer, M. (2012b). WeightedCluster: Clustering of Weighted Data. R package version 0.9.
M¨ ullner, D. (2012). fastcluster: Fast hierarchical clustering routines for R and Python. Version 1.1.6.
Venables, W. N., D. M. Smith, and the R Development Core Team (2011). An introduction to R. Manual, The R-project.
Paradis, E. (2006). R for beginners. Manual, Institut des Sciences de l’ Evolution, Universit´e Montpellier II. Ritschard, G., R. B¨ urgin, M. Studer, and N. M¨ uller (2012). TraMineRextras: Extras for use with the TraMineR package. R package version 0.1-111/r226.
©
G. Ritschard (2012), 75/76. Distributed under licence CC BY-NC-ND 3.0
©
G. Ritschard (2012), 76/76. Distributed under licence CC BY-NC-ND 3.0