A brief introduction to R

A brief introduction to R R is a software for data manipulation, calculation and graphic display. It is a common used software among statisticians. I...
Author: Hugh Williamson
7 downloads 0 Views 126KB Size
A brief introduction to R

R is a software for data manipulation, calculation and graphic display. It is a common used software among statisticians. It is a free software and with many open sources. Many new statistical methods are implemented in R and built into R platform as a package, which makes R attractive to many practitioners. It is developed from S language, which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. In this short introduction, some basic commands and functions are provided to help you get started with R. However, I believe that the most efficient way to learn a language is to use it. To get more information on any specific named function in R, for example solve, use the command > help(solve) An alternative is > ?solve The help.search command (alternatively ??) allows searching for help in various ways. For example, > ??solve For more detail and a comprehensive introduction to R, please refer to http://cran.r-project.org/manuals.html

1

Vectors

R operates on objective, such as vectors of real or complex values, which are the basic units we work on. To set up a vector named x, consisting of four numbers (1.2, 1.5, 5.6, 2.5), use the R command > x x x [1] 1 2 3 4 5 6 7 > x x [1] 1 2 3 4 5 6 7

8

9 10

8

9 10

8

9 10

The elementary arithmetic operators, such as +, -, ×, /, between vectors are performed element by element. In addition, many commonly used arithmetic functions can be applied to vectors. For example, log, exp, sin, cos, tan, sqrt. > x x^2 # square each component in x [1] 1.44 2.25 31.36 6.25 > x^4 # x to the power of 4 [1] 2.0736 5.0625 983.4496 39.0625 > sqrt(x) [1] 1.095445 1.224745 2.366432 1.581139 > tan(x) [1] 2.5721516 14.1014199 -0.8139433 -0.7470223 R provides many convenient ways to manipulate vectors. x[1] represent the first component of the vector x, where the [] means the index set of the vector. To select a sub vector from vector x, we appending to x with an index vector in the bracket []. Such index set could be (a) a logic vector (b) a vector of positive integers in the range of the length of the vector (c) a vector with negative integers. > x[c(T,F,F,T)] [1] 1.2 2.5 > x>2 [1] FALSE FALSE > x[x>2] [1] 5.6 2.5 > x[c(1,4)] [1] 1.2 2.5 > x[-1] [1] 1.5 5.6 2.5

2

# select the 1st and 4th components in x. # T for TRUE and F for FALSE TRUE

TRUE # select the sub vector greater than 2 # select the 1st and 4th components in x # exclude the 1st component

Matrices

Matrix can be easily created in R by using the function matrix(). It largely simplifies the complication in manipulate arrays in the basic languages, such as C. For example, 2

> amatrix amatrix [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6

# Create a 2 by 3 matrix named amatrix

The following examples provides ways to obtain columns or rows from the matrix amatrix : > amatrix[1,] [1] 1 3 5 > amatrix[,2] [1] 3 4 > amatrix[,2:3] [,1] [,2] [1,] 3 5 [2,] 4 6 We can also bind a vector or matrix to another matrix or vector to form new vectors or matrices. > cmat cmat [1] 2 3 > dmat dmat cmat [1,] 1 3 5 2 [2,] 2 4 6 3 > emat dmat dmat [,1] [,2] [,3] 1 3 5 2 4 6 emat 2 3 8 R contains many operators and functions that are available only for matrices. For example t(X) is the matrix transpose function. The functions nrow(A) and ncol(A) give the number of rows and columns in the matrix A respectively. The matrix multiplication can be done using the operators %*% in R. > bmatrix amatrix%*%bmatrix [,1] [,2] [1,] 22 49 [2,] 28 64 3

To get an inverse, eigenvalues and eigenvectors, determinant and singular value decomposition of a matrix, the corresponding functions are solve(), eigen(), det(), svd(). > solve(dmat) emat [1,] -1.75 1.125 0.25 [2,] 0.50 0.250 -0.50 [3,] 0.25 -0.375 0.25 > eigen(dmat) $values [1] 12.1204947 1.3635608 -0.4840555 $vectors [,1] [,2] [,3] [1,] -0.4572209 -0.3248032 -0.9607523 [2,] -0.5985983 -0.8212344 0.2381445 [3,] -0.6577455 0.4691235 0.1422753 > det(dmat) [1] -8 > svd(dmat) $d [1] 12.883717

1.339795

0.463458

$u [,1] [,2] [,3] [1,] -0.4572923 0.2742784 0.8459640 [2,] -0.5767985 0.6325630 -0.5168824 [3,] -0.6768953 -0.7243172 -0.1310628 $v [,1] [,2] [,3] [1,] -0.2301106 0.06774931 -0.9708033 [2,] -0.4431762 0.88083306 0.1665171 [3,] -0.8663971 -0.46855431 0.1726641

3

Reading data from external files

There are several ways to read external data file into R. Here we will introduce functions read.table() and scan(). For more detail about reading external data file and exporting data, please see R Data Import/Export manual. The function read.table() is used to import data with a special form. For example, the airline passenger data as follows: 4

1949 1950 1951 1952 1953

Jan 112 115 145 171 196

Feb 118 126 150 180 196

Mar 132 141 178 193 236

Apr 129 135 163 181 235

May 121 125 172 183 229

Jun 135 149 178 218 243

Jul 148 170 199 230 264

Aug 148 170 199 242 272

Sep 136 158 184 209 237

Oct 119 133 162 191 211

Nov 104 114 146 172 180

Dec 118 140 166 194 201

The first row contains names of the columns, each row is one sample and all the rows containing the same number of columns. Suppose the file name of the above data is ‘AirPassenger.txt’. To read this file into R, we could use the following commands: > AirPassenger AirPassenger Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1949 112 118 132 129 121 135 148 148 136 119 104 118 1950 115 126 141 135 125 149 170 170 158 133 114 140 1951 145 150 178 163 172 178 199 199 184 162 146 166 1952 171 180 193 181 183 218 230 242 209 191 172 194 1953 196 196 236 235 229 243 264 272 237 211 180 201 If the data file does not contain column names, you should modify header=T to header=F. The above file can be also read using scan(). > AirData AirData [1] 1949 112 118 132 129 121 135 148 [14] 1950 115 126 141 135 125 149 170 [27] 1951 145 150 178 163 172 178 199 [40] 1952 171 180 193 181 183 218 230 [53] 1953 196 196 236 235 229 243 264

4

skip the column name

148 170 199 242 272

136 158 184 209 237

119 133 162 191 211

104 114 146 172 180

118 140 166 194 201

Probability distributions

As a statistical package, R provides easy commands to generate random variables, evaluate densities or mass functions and distribution functions, and quantile functions for a comprehensive list of probability distributions. Some of the commonly used distributions are listed in Table 1. To generate random variables from above distributions, prefix the R name by ’r’. For example, if we want to generate 10 random variables from binomial(5,0.5), we will use the following command > rbinom(10,5,0.5) [1] 3 1 2 4 3 3 4 1 1 4 5

Table 1: Commonly used distributions Distribution beta binomial chi-squared exponential F gamma geometric normal Poisson Student’s t uniform Weibull

R name beta binom chisq exp f gamma geom norm pois t unif weibull

Additional arguments shape1, shape2, ncp size, prob df, ncp rate df1, df2, ncp shape, scale prob mean, sd lambda df, ncp min, max shape, scale

The first argument is the number of random variables and the second and third arguments are parameters n and p respectively. In addition, prefix the R name given here by d for the density, p for the CDF, q for the quantile function. The first argument is x for dxxx , q for pxxx , p for qxxx and n for rxxx. For example, > dbinom(4,5,0.5) [1] 0.15625 The above command gives the probability mass at 4 for binomial distribution with parameters 5 and 0.5.

5

Graphics

Graphical facilities are important components of the R environment. Plotting commands are divided into two basic groups: • High-level plotting functions create a new plot on the graphics device, possibly with axes, labels, titles and so on. • Low-level plotting functions add more information to an existing plot, such as extra points, lines and labels. Some of the often used high-level plotting functions are: plot(x, y): If x and y are vectors, plot(x, y) produces a scatter plot of y against x. hist(x): Produces a histogram of the numeric vector x. 6

qqnorm(x): Distribution-comparison plots. contour(x, y, z, ...) : Draw a contour plot for z, as a function of x and y. Useful lower-level plotting functions are: points(x, y), lines(x, y): Adds points or connected lines to the current plot. text(x, y, labels, ...): Add text to a plot at points given by x, y. abline(a, b): Adds a line of slope b and intercept a to the current plot. legend(x, y, legend, ...): Adds a legend to the current plot at the specified position. To access and modify the list of graphics parameters for the current graphics device, using par() function.

6

Loops and apply () functions

As many other languages, R has three ways to construct loops. (a) for (name in expr1 ) expr2 where name is the loop variable. expr 1 is a vector expression, (often a sequence like 1:20), and expr 2 is often a grouped expression. expr 2 is repeatedly evaluated as name ranges through the values in the vector result of expr 1. (b) repeat expr1 if expr2 break; (c) while (condition ) expr For example, the following code is for “calculating averages for Bootstrap samples” given in Lecture 2. > x myvec times for (i in 1:times) + { + y