Some R basics

[This is a cut down version, intended for people already used to using Matlab or similar software, covering some essential points of the tutorials available from http://ww2.coastal.edu/kingw/statistics/R-tutorials/ I recommend you look through these to improve your understanding. On the other hand you may find there are parts of the following you can safely skip, particularly if you already feel pretty secure about basic descriptive stats. Another handy quick reference is http://www.statmethods.net/]

Start in terminal with command ‘R’. You will enter interactive programming environment (read–eval–print loop 'REPL') just like in Python or Matlab.

You can exit any time with:

> quit()

And you can interrupt execution of a command with Ctrl+C

Up arrow will scroll through previous commands.

If you're thinking of using R in your project, I suggest you install R Studio which is a convenient IDE (similar to Matlab) http://www.rstudio.com/

R is case sensitive, but not generally picky about spacing.

Anything following # is a comment

> system("mkdir Rspace") # run linux command to create directory to work in

Note you might usually do this before starting R from within the relevant directory.

> setwd("Rspace") # changes working directory to Rspace (but does not change current workspace contents)

> rm(list=ls()) # This completely clears the workspace.

help(mean), for example, gives help information on the function ‘mean’, exit help by typing q. Can also use help.search("mean") and apropos("mean") to find relevant functions. ?mean also works, ??mean also useful.

Can continue a command over several lines (will get + prompt if command is apparently unfinished)

Can create script (sequence of commands) in separate editor and run like this:

> source("script.r") # n.b. the file extension can be anything you like

Can create new functions within R:

> times3 = function(x) {x+x+x}

> times3(7)

> times3 # will print the function definition

Functions can be considered like objects (see below) in terms of their existence in the workspace, saving, loading, etc.

Objects

For assignment can use

> x=7

> x = 7

> x <- 7

> x<-7

> 7 -> x

Original versions of R and many online and printed tutorials insist on using '<-' instead of ‘=’ assignment operator, because of the potential confusion with logic testing. Note <- must not have a space in the middle!

Throughout this workshop we'll use '=' mainly for convenience (one keystroke instead of two) and consistency with other programming environments which you might already know.

Variable names cannot have spaces, standard use is ‘.’ separator, e.g. my.data also accepted. Avoid giving variables function names (R will not object but can cause confusion) or T,F (used for boolean coding).

Strings can be encapsulated in single or double quotation marks interchangeably .

> x # to see what the object contains

> ls() # List the contents of the workspace.

> rm(x) # Removes object x from workspace (permanently!)

> save.image # saves copy of workspace to .RData file in working directory; will reload next time you start R in that directory; note you will also be automatically prompted to do this on quit.

R has built in objects, including many functions but also various data sets, and you can use them without them being explicitly loaded into the workspace (R will look through search path to find them). To put a copy into the workspace can use assignment:

> rivers # will show you the content – lengths of major rivers in North America

> x = rivers # will put copy in workspace called x

> data(rivers) # will put copy in workspace called rivers

Can save and load objects, with save(x,file="name") and load("name").

Vectors (matrices, arrays):

Consist of one type of data: numbers or character strings. Note R does not distinguish between row and column vectors.

> x = c(1,2,3,4,5) # c for concatenate; note can also use…

> y = scan() # and then type the elements, which is often easier for entry

Try it! You can continue data entry over multiple lines, with as many elements per line as you like, a blank line will end entry.

If x and y are the same length can do a scatter plot

> plot(x,y)

NOTE ON PRODUCING GRAPHS

There are lots of ways to change details of the plot, see here for some examples: http://www.bio.ic.ac.uk/research/crawley/statistics/exercises/R1Plots.pdf

Also, see http://www.harding.edu/fmccown/r/ for a comprehensive guide on different types of plots and charts.

In order to save a graph to your disk you can use variety of methods.

In the most common one, you first specify the target file and format then produce the graph and finally close the device:

> png(filename="your_filename.png")

> plot(x,y)

> dev.off()

png function can be replaced with any other format, including svg, pdf, ps and eps which are useful for big posters and latex documents. In case of eps the function to use is called postscript. In order to save our scatterplot in .eps we need to type:

> setEPS()

> postscript("our_scatterplot.eps")

> plot(x,y)

> dev.off()

Alternatively, in order to save whatever is currently displayed you can use:

> dev.copy(png,'your_filename.png')

> dev.off()

Same as before, you can replace png with any other format.

If you want to make your graphs look pretty and have better control over what happens in your figure consider using modules such as lattice or ggplot2 which will allow you to produce complex publication-quality figures. That's not necessary for this workshop though.

> y = c("A","Bit","Charlie","A","B") # a vector of strings

Note that to enter character strings with scan() you need to specify scan(what="character")

Vectors of numbers or strings are conceptually similar to matlab. You can also have 2-d matrices and multidimensional arrays. Can create ranges:

> x = 1:5 # creates a vector containing 1 2 3 4 5

repeats:

> x = rep(1,5) # creates a vector containing 5 ones

random variables:

> x = rnorm(50,0,1) # creates vector of 50 random values from a normal distribution with mean 0 and standard deviation 1

and extract elements by indexes (use square brackets):

> x[3]

or logical statements:

> x2 = x[x>2]

The data set ‘rivers’ is a vector (lengths of major rivers in North America). You could visualise it using plot:

>plot(rivers)

This does give you some information (e.g. that there’s one very long river) but plotting this way is really a bad idea. Why?

Better ways to look at the distribution:

> hist(rivers)

> boxplot(rivers)

A boxplot displays the data using the range, median and quartiles, i.e. the highest and lowest values, the middle value, and the values at the 25% and 75% points. It thus contains more information about the distribution than the conventional plot of means with standard deviations, and is also less subject to bias, so generally should be preferred. To check what is going on, compare the plot to the output of:

> summary(rivers)

If you want to explore some more:

> data() # will list the available datasets

Following is different to matlab!!:

Named vectors: can give each item in a vector an associated name, and can then use name rather than index to refer to it.

> x = 1:5

> names(x)= c("jo","di","al","pip","mo")

> x["pip"]

Note confusion can arise because the ‘name’ could be a number. e.g.

> names(x)=5:1 # names the elements of x ‘5’ ‘4’ ‘3’ ‘2’ ‘ 1’

> x

5 4 3 2 1 # names of elements in x

1 2 3 4 5 # actual elements of x

> x[2] # indexing the second element of x

> x["2"] # indexing the element of x named "2", i.e. the fourth element

This may seem obscure (why would we use such a naming system?!?) but can be a source of confusion when we get to data frames (see below), because R will name your rows with numbers by default, and it is important to be aware that they are not indices.

Look at dataset ‘precip’

>str(precip) # str() describes the structure of an object

This gives rainfall by US city, so you could use the city name as index to extract the rainfall for that location. Look at whether this data is normally distributed by plotting against a normal distribution with the same mean and s.d.

> precip.norm = rnorm(length(precip),mean(precip),sd(precip))

> plot(sort(precip.norm),sort(precip))

There’s a direct function for this as well

>qqnorm(precip) # does a ‘quantile-quantile’ plot

and

> shapiro.test(precip)

does a statistical test for normality, where a p-value <.05 indicates the data is not well explained by an assumption of normality (see next session for more on logic of such tests).

Factors: the vector is treated as categorical. Can be ordered or unordered. Coding can be numerical, character or Boolean.

E.g. if a vector is a factor you will have a different result from the summary() function. Try:

> x = 1:5

> summary(x)

> y = c("A","B","C","A","B")

> summary(y)

Then change y to a factor:

> y = factor(y)

> summary(y)

This will also have a different effect on plot:

> plot(y,x)

The factor will be treated as a way of grouping values of the numeric vector and the default will be a boxplot.

Lists: Any set of objects, can mix types. Indexed using double brackets. Can name elements of a list, and then index using list$name.

Data frame: list of column vectors (can be of mixed types) of equal length.

The basic idea is that each column is the measurement on one variable, and each row is one case. E.g. groups in an experiment will be entered as levels in a factor vector that makes up one column of the data frame. For a simple example look at:

> ? InsectSprays

> str(InsectSprays)

> InsectSprays

Important to get used to this as it structures your ideas and analysis, particularly as you start doing more complex experiments. In particular, incorrect arrangement of data in a data frame may lead to incorrect application of statistical tests. See later sessions.

You could create a dataframe by entering the vectors and then combining them. The columns will have the names of the vectors.

> x = 1:5

> y = c("A","B","C","A","B")

> y = factor(y)

> my.frame = data.frame(x,y)

Note that by default the rows will be ‘named’ by numbers, and you will see these when the data frame is displayed. WARNING! These are not indices. In particular, if you order the data frame (sort all rows by one variable) these names will be sorted as well. Can change to actual names.

> rownames(my.frame) = c("Bob","Fred","Al","Sue","Jeff")

> my.frame$x = NULL # This is how you erase a variable from a data frame.

> my.frame$z = 5:1 # You can add a new column to a data frame like this.

Typically you would not create a data frame directly in R but read it in from a file. E.g. download this file into your current working directory of R, and look at the contents using any text editor:

gardens.txt

Read it into R using read.table function:

> test.data=read.table("gardens.txt",header=T) # the header option tells R to use the first line as names.

> str(test.data) # as before, shows the structure and indicates the content

> head(test.data) # or just look directly at the first few rows

You can refer to the columns by name using the list$name construct

> test.data$gardenA

You can also ‘attach’ the frame to your search path so that you can refer to individual columns directly

> attach(test.data)

> gardenA

NOTE ABOUT DATA IMPORT:

You can find comprehensive information on how to import data sets from other programmes here. Note that when reading in files, character strings will be interpreted as factors (unless you specify otherwise). If numbers should be treated as factors (e.g. you have coded the groups in your experiment as 1,2,3 etc.) you will need to specify this too.

read.table and related functions (such as read.csv) have more optional arguments which allow you to specify the separating character (in read.table it's a tab character \t), number of lines to skip, what to do with strings (e.g. whether to change them into factors, keep the quotation marks, etc.). When using R to analyse your data you may need to adjust the values of these arguments to the characteristics of the data.

Also, R is able to read tables directly from gzip archives thus skipping the step of decompression and saving space on your disk.

Finally, if the datasets are big, it might be sensible to read them into the memory using a local (temporary) SQLite database as a staging point. Sometimes you may want want to consider only reading parts of the table to computer's working memory while keeping the rest in a local SQLite database on your disk, this way we're not limited by the size of RAM and we can easily work with >1GB tables.

We won't use these methods during the workshop but if you're interested, some of the procedures are outlined here http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r?lq=1 and in many other stackoverflow threads.

Back to our small data set “gardens”.

>test.data

This example illustrates the potential ambiguity in arrangement of data. According to Crawley, it shows "the ozone concentration in parts per hundred million (pphm) in three market gardens on ten summer days". But is the ordering in the columns meaningful, e.g. 10 consecutive days (so potentially numeric data), or each row representing a strictly later date than the previous one (ordinal data)? Or could we reorder the rows without losing information (treat as ‘random’ samples)? Were the measurements taken on the same 10 summer days in all three gardens? Or is there no real relationship between data appearing in the same row but different columns?

To be consistent with the data frame concept we should really rearrange these data, so that there is one column for ‘ozone’, one factor for ‘garden’ and (assuming the order is meaningful) one factor for ‘day’. In theory there are functions called stack() and reshape() to help do this but their syntax is rather obscure. We can do it by hand:

> ozone = c(gardenA,gardenB,gardenC)

> garden =c(rep("A",10),rep("B",10),rep("C",10))

> day = rep(1:10,3)

> garden = factor(garden)

> day = factor(day)

> garden.frame = data.frame(day,garden,ozone)

> garden.frame

There are several command structures in R that exploit the data frame structure and learning to use them can help to make your processing more clear and efficient , e.g. with(), by() and tapply(). Try:

> mean(garden.frame$ozone[garden=="A"]) # gives the mean for garden A

> tapply(garden.frame$ozone, garden.frame$garden, mean) #applies the mean function to ozone for each level of garden

> cor(women$height,women$weight)

> with(women, cor(height,weight)) # with the data frame women, correlate the vectors height and weight

Either has the same effect, to produce the rstatistic for the correlation of the two variables, i.e. their covariance divided by the square root of the product of the individual group variances:

Or, equivalently, this is measure of fit of a least squares linear regression of y on x. If you are not familiar with linear regression (or want to brush up) see some basic notes here. We will come back to regression in a later session.

You can fit a linear model using lm(). For example to fit a line for the relationship of height to weight

> lm(women$weight~women$height)

And you could look at how well this fits using

> plot(women$height,women$weight)

> abline(lm(women$weight~women$height)) # plots the line y=a+bx, with a,b taken from the regression.

Purely categorical data sets are normally stored as tables rather than data frames. For more information see http://ww2.coastal.edu/kingw/statistics/R-tutorials/