Outline. Installation

Sequential data analysis Sequential data analysis Outline Sequential Data Analysis Starting with R and TraMineR 1 About the R statistical environm...

Author: Posy Horton

36 downloads 0 Views 580KB Size

Report

Download PDF

Recommend Documents

Course Outline Course Outline

Outline

OUTLINE OF LUKE OUTLINE OF ACTS

COURSE OUTLINE (Replaces PNCR and Course Outline)

OUTLINE OF LUKE OUTLINE OF ACTS

06 Outline

2008 OUTLINE

SPECIFICATION OUTLINE

Paper Outline

Presentation outline

Presentation Outline

Introduction Outline

INSTRUCTIONAL OUTLINE

03. Outline

Installation. Installation

Sequential data analysis

Sequential data analysis

Outline Sequential Data Analysis Starting with R and TraMineR

1

About the R statistical environment

2

A short introduction to R

Alexis Gabadinho, Matthias Studer

3

Importing data and checking content of data frames

Institute for Demographic and Life Course Studies, University of Geneva and NCCR LIVES: Overcoming vulnerability, life course perspectives http://mephisto.unige.ch/traminer

4

TraMineR and other useful packages

5

Basic statistical analysis in R

Gilbert Ritschard

September - November, 2012

©

G. Ritschard (2012), 1/76. Distributed under licence CC BY-NC-ND 3.0

©

G. Ritschard (2012), 2/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis About the R statistical environment

Sequential data analysis About the R statistical environment

What is R?

Installation

R and the modules can be downloaded from the CRAN http://cran.r-project.org

R is a free software environment for statistical computing and graphics (http://www.r-project.org)

By default, no GUI is proposed under Linux.

R is derived from the S language

Under Windows and MacOSX, the basic GUI remains limited. ... but try

R is free and open source R is Easily extensible with numerous contributed modules

Rcmdr (an R package) Deducer http://www.deducer.org RStudio http://www.rstudio.org

R is available for Linux, MacOS X, Windows

©

G. Ritschard (2012), 4/76. Distributed under licence CC BY-NC-ND 3.0

©

G. Ritschard (2012), 5/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis About the R statistical environment

Sequential data analysis About the R statistical environment

The increasing use of R

The increasing use of R Source: (Muenchen, 2012) http://r4stats.com/articles/popularity/

Source: (Muenchen, 2012) http://r4stats.com/articles/popularity/

©

G. Ritschard (2012), 6/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis About the R statistical environment

©

G. Ritschard (2012), 7/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Using RStudio

R Packages on the CRAN

The RStudio Environment

Source: (Muenchen, 2012) http://r4stats.com/articles/popularity/

©

G. Ritschard (2012), 8/76. Distributed under licence CC BY-NC-ND 3.0

©

G. Ritschard (2012), 11/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Using RStudio

Sequential data analysis A short introduction to R Objects, Functions and Libraries

The R console and the R script editor

R objects In R you handle objects that can be of many different types, and have different content types

The prompt ‘>’ indicates that R is waiting for commands. Several ways of sending commands to R

An object is created with the ‘assign’ operator: ‘ a b a

Using scripts allows to

[1] 5

store, re-use or later modify your statistical analysis. share code with others.

R> b

R scripts are text files containing a series of R commands.

[1] "my object"

The usual extension for such files is ‘.R’

Object names are case sensitive (a 6= A) and can be of arbitrary length

Comments: everything between ‘#’ and the end of line. Strongly recommended to document scripts with comments!

©

G. Ritschard (2012), 12/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Objects, Functions and Libraries

R> My.very.first.R.object My.very.first.R.object

©

Sequential data analysis A short introduction to R Objects, Functions and Libraries

Operators

Arithmetic Comparison Logical

[1] 13 G. Ritschard (2012), 14/76. Distributed under licence CC BY-NC-ND 3.0

Functions

+ (addition), − (substraction), ∗ (multiplication), / (division), ˆ (power) == (equality), ! = (different), > (greater), >= (greater or equal), < (less than), b [1] "my object" R> c c

Operations on objects R> a/2

[1] "my object is beautiful"

[1] 2.5 R> A a == A [1] FALSE

©

G. Ritschard (2012), 15/76. Distributed under licence CC BY-NC-ND 3.0

©

G. Ritschard (2012), 16/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Objects, Functions and Libraries

Sequential data analysis A short introduction to R Objects, Functions and Libraries

Function arguments

Functions and libraries

Argument names can be omitted as long as you respect their order. For clarity we recommended to give them explicitly. Many standard statistical functions are available as core functions: descriptive statistics, regression, etc ... Additional, specialized functions are available through

R> seq(from = 1, to = 10, by = 2) [1] 1 3 5 7 9 R> seq(1, 10, 2)

pre-installed libraries such as foreign for reading data from other statistical packages, survival for survival analysis, etc ... add-on libraries, available from the Comprehensive R Archive Network (CRAN), for example TraMineR for sequence analysis.

[1] 1 3 5 7 9

Using argument names, you can pass them in any order R> seq(by = 2, to = 10, from = 1) [1] 1 3 5 7 9 R> seq(2, 10, 1) [1]

©

2

3

4

5

6

7

8

9 10

G. Ritschard (2012), 17/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Installing libraries and exploring its content

©

Sequential data analysis A short introduction to R Installing libraries and exploring its content

Installing packages

Use

install.packages()

Help on functions and libraries

to install a library from the CRAN.

Access the functions provided by a library R> library(TraMineR)

> install.packages("TraMineR", dependencies=TRUE)

Get information on a library

Some packages use functions of other packages which must also be loaded. This is automatically done with dependencies=TRUE .

> library(help = "TraMineR")

Help on a particular function > library("foreign") > help(read.spss)

Installing from other sites than the CRAN: > install.packages("TraMineRextras", repos="http://R-forge.R-project.org")

Access the index of the functions provided by a package

When available, you can also use the menu.

©

G. Ritschard (2012), 18/76. Distributed under licence CC BY-NC-ND 3.0

G. Ritschard (2012), 20/76. Distributed under licence CC BY-NC-ND 3.0

> help(package="foreign")

©

G. Ritschard (2012), 21/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames

Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames

Higher dimensional objects - Vectors

Indexing with logical expressions

Vectors are one-dimensional objects containing numeric or character values. The c() function combines values into a vector

One can use logical expressions to retrieve vector elements R> v1[v1 >= 4]

R> v1 v1

[1] 4 8

[1] 1 2 4 8

R> v2[v2 %in% c("A", "C")]

R> v2 v2

[1] "A" "C"

Use which() to get the indexes of the elements that satisfy a given condition

[1] "A" "B" "C" "D"

Specific elements of vectors can be retrieved with indexes

R> which(v1 >= 4)

R> v1[3]

[1] 3 4

[1] 4

R> which(v2 %in% c("A", "C"))

R> v2[1:3]

[1] 1 3

[1] "A" "B" "C" R> v2[c(1, 4)]

©

[1] "A" "D" G. Ritschard (2012), 23/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames

©

Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames

Higher dimensional objects - Matrices

Higher dimensional objects - Data frames

Matrices are two dimensional objects containing numeric or character values

Data frames combine columns (vectors) of any type: factors, numeric, character strings

R> m1 m1 [1,] [2,] [3,] [4,]

R> data(iris) R> iris[1:4, ]

[,1] [,2] [,3] [,4] 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

1 2 3 4

Specific elements of matrices are retrieved with row and column indexes

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa

You can access a variable in a data frame by giving its name preceded by a ‘$’ instead of the column index

Element in the second row, fourth column R> m1[2, 4]

R> iris$Sepal.Width[1:4]

[1] 14

[1] 3.5 3.0 3.2 3.1

Whole fourth column (by omitting row index)

R> iris[1:4, 2]

R> m1[, 4]

©

G. Ritschard (2012), 24/76. Distributed under licence CC BY-NC-ND 3.0

[1] 13 14 15 16 G. Ritschard (2012), 25/76. Distributed under licence CC BY-NC-ND 3.0

[1] 3.5 3.0 3.2 3.1

©

G. Ritschard (2012), 26/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames

Sequential data analysis A short introduction to R Vectors, Matrices and Data Frames

Referencing subsets of objects: a summary

Factors

Vectors

A factor is a categorical variable, that is, a variable that takes (usually non-measurable) categorical values.

x[n] nth element x[-n] all but the nth element x[1:n] first n elements x[-(1:n)] elements from n+1 to the end x[c(1,4,2)] specific elements x["name"] element named "name" x[x > 3] all elements greater than 3 x[x > 3 & x < 5] all elements between 3 and 5 x[x %in% c("a","and","the")] elements in the given set

The

Species variable in R> class(iris$Species)

©

data frame is a factor

Possible categories of a factor are called levels R> levels(iris$Species)

element at row i, column j row i column j columns 1 and 3 row named "name"

[1] "setosa"

"versicolor" "virginica"

You can change the labels of the levels

(by respecting their order) R> levels(iris$Species) head(iris$Species)

Data frames (same as matrix plus the following) x[["name"]] x$name

iris

[1] "factor"

Matrices x[i,j] x[i,] x[,j] x[,c(1,3)] x["name",]

the

[1] Species 1 Species 1 Species 1 Species 1 Species 1 Species 1 Levels: Species 1 Species 2 Species 3

column named "name" idem

©

G. Ritschard (2012), 27/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Methods

G. Ritschard (2012), 28/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Methods

Objects and methods

Objects and methods - Example We use the

table() function to produce a contingency R> my.table class(my.table) [1] "table"

There are many other types of objects in R For example contingency tables or outputs of regression models are objects of a specific type There are usually specific methods such as summary() for each type of object

print(), plot()

The dedicated

plot() R> plot(my.table)

method produces the following figure

or my.table

Entering just the name of an object displays its content through an automatical call of the associated print() method.

no

no

R> A [1] 8 R> print(A)

yes

[1] 8

©

G. Ritschard (2012), 30/76. Distributed under licence CC BY-NC-ND 3.0

©

table

G. Ritschard (2012), 31/76. Distributed under licence CC BY-NC-ND 3.0

yes

Sequential data analysis A short introduction to R Working environment

Sequential data analysis A short introduction to R Working environment

The workspace

Loading and saving data

Objects can be saved using the save function. Several objects may be saved in a same file. The usual extension for R data file is: ‘.RData’.

The objects created during an R session are stored in the working environment, i.e, in the memory.

R> save(mvad, a, b, file = "myfile.RData")

Objects can be loaded using the

The objects in the R environment are listed in the workspace panel of RStudio.

load

function.

R> load(file = "myfile.RData")

You can save all the objects in your environment with save.image. R> save.image(file = "myenvironment.RData")

When you quit R, you are asked if you want to save your working environment in a ‘.RData’ file.

©

©

G. Ritschard (2012), 33/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis A short introduction to R Working environment

G. Ritschard (2012), 34/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Importing data and checking content of data frames Importing data into R

Working directory

Importing text files

Import ‘.csv’ (comma separated values) text files R> my.data my.data

R loads and saves files in the working directory. Check your current working directory with You should set a working directory using

getwd(). 1 2 3 4 5

setwd("path").

On windows, the full path should be specified using ‘/’ and not ‘\’.

Id Age Sex 1 22 Male 2 18 Female 3 40 Male 4 27 Female 5 33 Female

Import tab separated text files, with read.table()

©

G. Ritschard (2012), 35/76. Distributed under licence CC BY-NC-ND 3.0

©

G. Ritschard (2012), 38/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Importing data and checking content of data frames Importing data into R

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Importing from other statistical packages

The mvad data frame The mvad data set is included in the TraMineR library (with permission of the authors). We load it with R> data(mvad)

Import SPSS ‘.sav’ files, with

read.spss() R> titanic head(titanic) 1 2 3 4 5 6

The mvad object is of type ‘data frame’. It contains data from different formats (numeric values, factors)

ID CLASS AGE SEX LIVING 1 c1 adult Male yes 2 c1 adult Male yes 3 c1 adult Male yes 4 c1 adult Male yes 5 c1 adult Male yes 6 c1 adult Male yes

Import Stata ‘.dta’ files, with

R> class(mvad) [1] "data.frame"

It contains 712 rows and 86 variables R> dim(mvad) [1] 712 read.dta()

86

R> nrow(mvad) [1] 712 R> ncol(mvad)

©

G. Ritschard (2012), 39/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Importing data and checking content of data frames Exploring data frames

©

Sequential data analysis Importing data and checking content of data frames Exploring data frames

The data editor The

View() function > View(mvad)

[1] 86 G. Ritschard (2012), 41/76. Distributed under licence CC BY-NC-ND 3.0

Variable names List the variables in the ‘data frame’

opens a simple data editor

R> names(mvad) [1] [7] [13] [19] [25] [31] [37] [43] [49] [55] [61] [67] [73] [79] [85]

"id" "Southern" "fmpr" "Nov.93" "May.94" "Nov.94" "May.95" "Nov.95" "May.96" "Nov.96" "May.97" "Nov.97" "May.98" "Nov.98" "May.99"

"weight" "S.Eastern" "livboth" "Dec.93" "Jun.94" "Dec.94" "Jun.95" "Dec.95" "Jun.96" "Dec.96" "Jun.97" "Dec.97" "Jun.98" "Dec.98" "Jun.99"

"male" "Western" "Jul.93" "Jan.94" "Jul.94" "Jan.95" "Jul.95" "Jan.96" "Jul.96" "Jan.97" "Jul.97" "Jan.98" "Jul.98" "Jan.99"

"catholic" "Grammar" "Aug.93" "Feb.94" "Aug.94" "Feb.95" "Aug.95" "Feb.96" "Aug.96" "Feb.97" "Aug.97" "Feb.98" "Aug.98" "Feb.99"

"Belfast" "funemp" "Sep.93" "Mar.94" "Sep.94" "Mar.95" "Sep.95" "Mar.96" "Sep.96" "Mar.97" "Sep.97" "Mar.98" "Sep.98" "Mar.99"

Access the description of the data set and its variables

©

G. Ritschard (2012), 42/76. Distributed under licence CC BY-NC-ND 3.0

©

> help(mvad) G. Ritschard (2012), 43/76. Distributed under licence CC BY-NC-ND 3.0

"N.Eastern" "gcse5eq" "Oct.93" "Apr.94" "Oct.94" "Apr.95" "Oct.95" "Apr.96" "Oct.96" "Apr.97" "Oct.97" "Apr.98" "Oct.98" "Apr.99"

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Summary

Frequency tables The successive states forming the sequences are in variables Jul.93 ... Jun.99, that is in columns 15 to 86. Here are the data for the first 6 months and first 4 records.

We get a summary for the first five variables with R> summary(mvad[, 1:5]) id Min. : 1 1st Qu.:179 Median :356 Mean :356 3rd Qu.:534 Max. :712

weight Min. :0.130 1st Qu.:0.450 Median :0.690 Mean :0.999 3rd Qu.:1.070 Max. :4.460

The

weight variable is numeric R> levels(mvad$catholic) [1] "no"

male no :342 yes:370

while

catholic no :368 yes:344

catholic

Belfast no :624 yes: 88

R> mvad[1:4, 15:20] Jul.93 Aug.93 Sep.93 Oct.93 Nov.93 Dec.93 1 training training employment employment employment employment 2 joblessness joblessness FE FE FE FE 3 joblessness joblessness training training training training 4 training training training training training training

Frequency table of the gcse5eq variable (qualifications gained by the end of compulsory education)

is a factor

R> table(mvad$gcse5eq)

"yes"

no yes 452 260

©

©

G. Ritschard (2012), 44/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Importing data and checking content of data frames Exploring data frames

G. Ritschard (2012), 45/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Contingency tables

Row and marginal distributions Row and column distributions R> prop.table(ct1, 1)

Cross tabulate variables funemp (father unemployed) and (qualification gained at the end of compulsory school)

gcse5eq employed unemployed

Assign more informative value labels to the two factors (both are dummy variables with ‘yes’/‘no’ labels in the original file) R> R> R> R>

Lower qual. Higher qual. 0.6084 0.3916 0.7692 0.2308

R> prop.table(ct1, 2)

levels(mvad$funemp) hist(mvad$weight, col = "cyan") Histogram of mvad$weight

250

We perform a Chi-squared independence test using the chisq.test() function Frequency

0

50

data: ct1 X-squared = 10.23, df = 1, p-value = 0.001384

100

Pearson's Chi-squared test with Yates' continuity correction

150

200

R> chisq.test(ct1)

0

1

2

3

4

mvad$weight

©

©

G. Ritschard (2012), 48/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Barplot Applying bar plot

G. Ritschard (2012), 49/76. Distributed under licence CC BY-NC-ND 3.0

XY scatterplot plot()

on factors (categorical variables) generates a

Applying plot() on two numerical variables generates a scatterplot

R> plot(mvad$gcse5eq, col = c("red", "green"), main = "Variable gcse5eq")

R> plot(iris$Sepal.Length, iris$Sepal.Width, col = "red")

Variable gcse5eq

400

● ● ● ● ● ●

3.5

●● ●●● ● ● ●●● ●● ●● ● ● ●● ●●●

3.0

● ●● ●

● ● ●

Higher qual.

●

●

●

4.5

5.0

2.0

0

Lower qual.

●

● ●

● ●

●

● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●●● ●●●● ●● ●● ●●●●● ● ● ●●● ●●●●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ●

G. Ritschard (2012), 50/76. Distributed under licence CC BY-NC-ND 3.0

©

●● ● ●

●

5.5

6.0

6.5

iris$Sepal.Length

©

●

●●

●

2.5

100

200

iris$Sepal.Width

300

4.0

●

G. Ritschard (2012), 51/76. Distributed under licence CC BY-NC-ND 3.0

7.0

7.5

8.0

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Boxplots

Saving graphics

The boxplot() function accepts a formula as argument to produce a boxplot for each category of a factor R> boxplot(iris$Sepal.Length ~ iris$Species, col = "cyan", main = "Sepal length, by species")

To save graphics in files, depending on the format you can use the pdf(), jpeg() or png() function with the name of the file as argument

8.0

Sepal length, by species

7.0

7.5

Once you have issued all plotting commands you have to close the file with the dev.off() function

5.0

5.5

6.0

6.5

R> pdf(file = "hist") R> plot(mvad$Sep.93, mvad$Sep.94) R> dev.off() pdf 2

4.5

●

Species 1

©

Species 2

Species 3

©

G. Ritschard (2012), 52/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Sequential data analysis Importing data and checking content of data frames Exploring data frames

Some useful functions - A Add a new variable to

mvad data frame R> mvad$weight100 my.groups table(my.groups)

R> mean(mvad$weight100) [1] 99.94 R> min(mvad$weight100)

my.groups 0-19 20-39 19 20

[1] 13

40-59 60-100 20 40

R> max(mvad$weight100) [1] 446 G. Ritschard (2012), 54/76. Distributed under licence CC BY-NC-ND 3.0

or

[1] 1 2 3 4 5 6

Some basic statistical functions

©

seq(from,to)

R> my.seq head(my.seq)

R> head(mvad$weight100) [1]

G. Ritschard (2012), 53/76. Distributed under licence CC BY-NC-ND 3.0

©

G. Ritschard (2012), 55/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis TraMineR and other useful packages

Sequential data analysis TraMineR and other useful packages

TraMineR and other useful packages

Suggested packages

From the TraMineR team on R-Forge

TraMineR is available from the CRAN http://cran.r-project.org.

TraMineR development version (Gabadinho et al., 2011, 2009) TraMineRextras: ancillary functions to be used with TraMineR (Ritschard et al., 2012) PST: Probabilistic suffix trees (Gabadinho and Ritschard, 2012) WeightedCluster: clustering and measures of cluster quality

It is just one over more than 3500 packages on the CRAN, and there are many more on other repositories such as http://R-forge.R-project.org and http://www.bioconductor.org.

(Studer, 2012a,b)

Some other packages for clustering analysis (CRAN):

Whatever you want to do, there most probably exists a package which does it: Just Google for R + what you are interested in.

©

G. Ritschard (2012), 57/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Linear Regression

cluster (Kaufman and Rousseeuw, 2005; Maechler et al., 2005) fastcluster (M¨ ullner, 2012) flashClust (Langfelder and Horvath, 2012)

©

G. Ritschard (2012), 58/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Linear Regression

Statistical modeling: Regression

Loading the data R> R> R> R>

[>] [>] [>] [>]

sequence object created with TraMineR version 1.9-2 712 sequences in the data set, 490 unique min/max sequence length: 70/70 alphabet (state labels): 1=EM (employment) 2=FE (FE) 3=HE (HE) 4=JL (joblessness) 5=SC (school) 6=TR (training) [>] dimensionality of the sequence space: 350 [>] colors: 1=#7FC97F 2=#BEAED4 3=#FDC086 4=#FFFF99 5=#386CB0 6=#F0027F

We use the mvad data of TraMineR Regression of longitudinal entropies on male, catholic, ...

©

G. Ritschard (2012), 61/76. Distributed under licence CC BY-NC-ND 3.0

mvad.lab summary(lm.entrop) Call: lm(formula = entrop ~ male + catholic + gcse5eq, data = mvad) 3

Normal Q−Q

● ●

Coefficients:

©

Residual standard error: 0.174 on 708 degrees of freedom Multiple R-squared: 0.0538, Adjusted R-squared: 0.0497 F-statistic: 13.4 on 3 and 708 DF, p-value: 1.61e-08 G. Ritschard (2012), 65/76. Distributed under licence CC BY-NC-ND 3.0

*** ** ***

−2

Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3892 0.0128 30.48 < 2e-16 maleyes -0.0418 0.0133 -3.14 0.0017 catholicyes 0.0177 0.0131 1.35 0.1764 gcse5eqHigher qual. 0.0645 0.0138 4.68 3.4e-06 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1

1

2

Max 0.4479

0

3Q 0.1286

Standardized residuals

Median 0.0000

−1

Residuals: Min 1Q -0.4713 -0.0951

' ' 1

● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● 310 ●●● 421 193 ● ●

−3

©

−2

−1

0

1

2

3

Theoretical Quantiles lm(entrop ~ male + catholic + gcse5eq)

G. Ritschard (2012), 66/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Logistic regression

Sequential data analysis Basic statistical analysis in R Logistic regression

Logistic regression I

Logistic regression II

Logistic regression: specific case of the generalized linear model glm() with family = binomial R> lg.gr summary(lg.gr)

(Dispersion parameter for binomial family taken to be 1)

Call: glm(formula = gcse5eq ~ male + catholic, family = binomial, data = mvad) Deviance Residuals: Min 1Q Median -1.129 -1.082 -0.793

©

3Q 1.276

Null deviance: 934.62 Residual deviance: 910.51 AIC: 916.5

Max 1.619

G. Ritschard (2012), 68/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Logistic regression

©

Retrieve coefficients and compute their

Defining a custom function R> discretize discretize(0.33)

exp()

R> exp(lg.gr$coefficients) maleyes catholicyes 0.4641 1.1190

Completing the table of coefficients, standard errors and significativity with exp(β) R> lg.gr.coeff lg.gr.coeff lg.gr.coeff

©

G. Ritschard (2012), 69/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Defining a custom function

Computing the ‘odds ratios’

(Intercept) maleyes catholicyes

degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 4

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.228 0.131 -1.74 0.082 . maleyes -0.768 0.159 -4.83 1.3e-06 *** catholicyes 0.112 0.159 0.71 0.478 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Intercept) 0.7959

on 711 on 709

Estimate Std. Error z value Pr(>|z|) Exp Estim. -0.2283 0.1315 -1.736 8.250e-02 0.7959 -0.7677 0.1588 -4.833 1.343e-06 0.4641 0.1124 0.1586 0.709 4.783e-01 1.1190

G. Ritschard (2012), 70/76. Distributed under licence CC BY-NC-ND 3.0

[1] 1 R> table(apply(entrop, 1, discretize))

©

1 2 385 243

3 84

G. Ritschard (2012), 72/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Defining a custom function

Sequential data analysis Basic statistical analysis in R Defining a custom function

References I Gabadinho, A. and G. Ritschard (2012). PST: Probabilistic Suffix Trees. R package version 0.66/r157.

Thank you! Thank you! See you See you next next week.week.

Gabadinho, A., G. Ritschard, N. S. M¨ uller, and M. Studer (2011). Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software 40 (4), 1–37. Gabadinho, A., G. Ritschard, M. Studer, and N. S. M¨ uller (2009). Mining sequence data in R with the TraMineR package: A user’s guide. Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva. Kaufman, L. and P. J. Rousseeuw (2005). Finding Groups in Data. Hoboken: John Wiley & Sons. Langfelder, P. and S. Horvath (2012). Fast R functions for robust correlations and hierarchical clustering. Journal of Statistical Software 46 (11), 1–17.

©

G. Ritschard (2012), 73/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Defining a custom function

©

Maechler, M., P. Rousseeuw, A. Struyf, and M. Hubert (2005). Package ‘cluster’: Cluster analysis basics and extensions. Reference manual, R-project, CRAN. G. Ritschard (2012), 74/76. Distributed under licence CC BY-NC-ND 3.0

Sequential data analysis Basic statistical analysis in R Defining a custom function

References II

References III

Maindonald, J. and J. Brown (2010). Data Analysis and Graphics Using R: An Example-based Approach (3rd ed.). Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press.

´ Studer, M. (2012a). Etude des in´egalit´es de genre en d´ebut de carri`ere acad´emique ` a l’aide de m´ethodes innovatrices d’analyse de donn´ees s´equentielles, Volume SES-777 of Collection des th`eses. Universit´e de Gen`eve, Facult´e des sciences ´economiques et sociales.

Maindonald, J. H. (2008). Using R for data analysis and graphics: Introduction, code and commentary. Manual, Centre for Mathematics and Its Applications, Austrialian National University. Muenchen, R. A. (2012). The popularity of data analysis software. Online at r4stat.

Studer, M. (2012b). WeightedCluster: Clustering of Weighted Data. R package version 0.9.

M¨ ullner, D. (2012). fastcluster: Fast hierarchical clustering routines for R and Python. Version 1.1.6.

Venables, W. N., D. M. Smith, and the R Development Core Team (2011). An introduction to R. Manual, The R-project.

Paradis, E. (2006). R for beginners. Manual, Institut des Sciences de l’ Evolution, Universit´e Montpellier II. Ritschard, G., R. B¨ urgin, M. Studer, and N. M¨ uller (2012). TraMineRextras: Extras for use with the TraMineR package. R package version 0.1-111/r226.

©

G. Ritschard (2012), 75/76. Distributed under licence CC BY-NC-ND 3.0

©

G. Ritschard (2012), 76/76. Distributed under licence CC BY-NC-ND 3.0