Evolving with R #1: Getting started & your first simulation

R is an application for doing statistics, graphing, and all kinds of computations. Because it is free, runs on Macs, PCs, and Unix, and is extremely versatile, it is used throughout the sciences and in medicine, business, government, and in non-profits–in short, everywhere people are asked to analyze data. Because proficiency with R is not trivially learned, it is a skill that demands respect and can open doors for you.

In this lab, we use R to simulate biology in order to learn about ecology and evolution. But we also use it to analyze data from simulations or experiments, to prepare statistical results and figures for publication, and to share our methods with colleagues after a paper is published. Learning R will therefore help you participate in every part of the scientific activities of this lab, and of most research environments you might find yourself in later on.

“R” is not very useful to Google on its own; you can look up the acronym “CRAN,” which stands for the Comprehensive R Archive Network. Or you can go to the URL https://cran.r-project.org/  Once there, you want to download the latest “base” version of R as a precompiled binary for your Mac or PC, then install it. Bookmark the CRAN site, as it has many resources that will prove helpful once you get started.

These notes are inspired by Michael Crawley’s R book and my own experience with R over the last ten years or so. While not very comprehensive, these notes should help you quickly learn to use R for doing simulations of population-level models in biology.

On a Mac, you should have an application called “R” which you can double-click to open. You’ll see a window open with some version information and a greater-than symbol with a blinking cursor next to it. This “>” symbol is called the command prompt; basically, you type next to it then hit “Enter” or “Return” to have R run whatever you typed.

I’ll use blue text to indicate things you should type into R, and orange text to indicate output that you get from R in response to commands. Following Crawley’s R book, I will bold text to indicate key terms that I want you to be sure to understand.

Let’s try typing the following at the command prompt and hitting return. You should see the output given below the blue command.

1:10

[1] 1 2 3 4 5 6 7 8 9 10

The command “1:10” made a set of integers from 1 to 10 and printed that to the console–that is, the window in which we can directly type commands and see text output. Instead of just printing the results, we can store the results of this command in a variable, which is just a container with a name1 that we use to store information. Try this out with the following use of the assignment operator2, “=”.

x = 1:10

x

[1] 1 2 3 4 5 6 7 8 9 10

Notice that assigning the results to a variable prevents them from being printed; we can then see what’s in the container “x” by just typing its name and hitting enter.

Let’s break down what we just did. An operator is a special symbol that tells R to perform some operation. There’s only a handful of operators to remember; mostly, you’ll use functions to tell R what to do. In this case, the single equals sign “=” assigns the results of whatever is to the right of it to the variable with the name on the left.

In fact, the “:” symbol is an operator as well; it handles the common task of making a sequences of integers. It has an equivalent function called seq(). A function is the basic unit of doing complex computations. Each function works like a little machine that takes input and does something to that input, producing output. A program is basically like an assembly line, where you might take the output from one machine and put it as input to another machine until you have a finished product.

Let’s try out seq() a bit. First, why am I writing it with a pair of empty parentheses after the name? This is to remind us that this is a function; functions are typically going to be used by putting inputs, or arguments, inside those parentheses. Even if you’re not giving the function any arguments, you still need the parentheses to help R realize that you’re giving it the name of a function. Arguments are either going to be pieces of data that we want the function to do something with, or instructions about what we want that function to do.

seq(from=1, to=10)
[1] 1 2 3 4 5 6 7 8 9 10

seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10

In the first example, I gave the function two arguments by name: “from” tells it where to start, and “to” tells it where to stop. In the second example I left off the names, which works because I gave those arguments in the expected order. To see the difference, we can reverse the arguments in both examples and see what the results look like.

seq(to=10, from=1)
[1] 1 2 3 4 5 6 7 8 9 10

seq(10,1)
[1] 10 9 8 7 6 5 4 3 2 1

We can see that if we write “seq(x,y)”, R interprets this as from=x and to=y. How do we know what the arguments are called and that it’s going to read the arguments in that way? Well, we can look at the help file, which you can call up by typing:

?seq

The help files are useful only if you’ve put in some effort to think through what you want to do with the function. Never be afraid to just try out examples!


Exercise

Using the help file for seq(), find a method for producing the odd integers starting with 1 and ending with 99.


 

seq() is an example of a function with an output–it makes an object and hands it back to you when it is done, allowing us to put that object in a container with a name by assigning it to a variable. This output is sometimes called a return value; in the help files, this is described under the heading Value. Just like we need to know what arguments a function expects to see, we need to know what the return value will look like to use it properly. Again, the easiest way to get started is just to experiment. Let’s try to figure out what kind of thing seq() is giving back to us.

seq(1:100)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
[20] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
[39] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
[58] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
[77] 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
[96] 96 97 98 99 100

This is going to look a little different on your screen because the formatting depends on the set-up of your fonts and window, but you should see that there’s a number in a pair of square brackets at the beginning of each line. This is a hint that this object is a container that is subdivided; the numbers in square brackets are like addresses for houses on a street, or mailboxes in an apartment building. This kind of object is called a vector.

In the next lesson, we’ll see how to manipulate individual parts of a vector using square brackets. Right now we’ll look at how we can make a vector and what we can do with it. Vectors are basically the simplest way to store data, so we’ll be seeing them a lot in simulations.

x = 10:1
x
[1] 10 9 8 7 6 5 4 3 2 1
length(x)
[1] 10
x + 1
[1] 11 10 9 8 7 6 5 4 3 2

Try this yourself, and notice that typing and executing “x + 1” doesn’t change the values in the variable x–instead, it just prints the result of adding 1 to each value in x.

This is a good time to reinforce the most important reality about programming: the computer does not do what you want it to do, it does only what you tell it to do. It is easy to write “x + 1” when what you want is to change the values in x. The strength of R over other programming environments is that you can test and check the results of each step you perform in a program. This is only a strength if you actually do this!

x = x + 1
x
[1] 11 10 9 8 7 6 5 4 3 2
mean(x)
[1] 6.5

One of the most important functions is c(), where the “c” stands for “concatenate”–join end-to-end. c() is one tool for building a vector from scratch.

x = c(1, 3, 5)
x
[1] 1 3 5
x = c(x, 7, 9, 11)
x
[1] 1 3 5 7 9 11
length(x)
[1] 6
sum(x)
[1] 36

I recommend you keep your own file with examples and notes for each of the functions you come across. Complex programs are generally just combinations of very simple components–often the first step in understanding is going to be understanding what each component does in isolation.

We can start to get a sense of how simple pieces add up to a complex program by building a simulation of a scientific model. Before we do any more coding, we need to be clear on the distinctions between these terms. A model in science is a representation of nature: we choose how to abstract specific properties and interactions between certain things in nature by writing down rules, either verbally or in math. A simulation is then a way of exploring how that model works by using a computer. For example, we might say that the game of chess is a model of war: you have limited assets with different properties and must deploy them strategically to defeat an opponent. This model can be written down in words. If we wanted to simulate this model, we could program a computer to play chess, meaning that we would have to program the computer to store information about the model (which pieces are where, what the rules are, etc.) and how to choose between different moves. We would have to add a lot of other elements, like an interface so we can tell it what to do, and some way to store and output information about the games that it played.

Imagine we have a population of organisms that starts off very small, with only a handful of individuals. As that population grows, the frequencies of different alleles in that population might change because of randomness in reproduction. For example, if we have two bacteria in our population, then just by chance one of them will happen to divide before the other one does. Let’s try to model this with bacteria in mind–so, haploids that reproduce by asexual, binary fission. We start with two individuals in our population–one with genotype A and one with genotype B. We can model growth by doing the following algorithm:

  1. Pick an individual at random (we want to pick each individual with an equal chance because we are modeling how pure randomness affects allele frequencies).
  2. Make a new individual that is identical to this parent and add it to the population.
  3. Go back to step 1.

To execute this algorithm, we need a way to do these steps repeatedly and to store the results. We’ll use c() with a vector to store the information about the population, and a for loop to do the steps repeatedly.

So far, all of our “programs” have fit on a single line of text. Generally, though, we want to write programs that contain many steps, and some of those steps might be easier to read if we spread them across multiple lines. Command-N will open a new window in R which is really just a simple text editor. We can then write a script–a bunch of lines forming a program–in this window, and save that script to use later. Open a new window and type the following. Then, select all that text and press command-return.

for(i in 1:10)
{
   print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

A for loop runs through the code within the squiggly brackets {} for a set number of times. Each time through the loop, the variable “i” takes on the next value in the sequence 1:10. We’ll play with these elements more in later lessons, but right now the for loop lets us do something repeatedly.

We need one more element to simulate our model, which is a way to randomly pick who will reproduce in each step. We can do that with the (very powerful) function sample().

sample(1:10, 1)
[1] 6
sample(1:10, 1)
[1] 8
sample(1:10, 1)
[1] 9

The first argument to sample() gives it a vector to pick from, and the second argument tells it how many things you want it to pick. Each time you run it you’ll get a random result, so it won’t look the same for you as the examples above.

Let’s put these elements together to simulate our model of growth. We’ll use the number 0 to represent bacteria with genotype A and 1 to represent genotype B. Then we can set up our initial population and let it grow until there are a hundred individuals. We’ll then use mean() to see the frequency of the B allele.

pop = c(0, 1)
for(i in 1:98)
{
   pop = c(pop, sample(pop, 1))
}
pop
[1] 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1
[40] 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1
[79] 0 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0
mean(pop)
[1] 0.71

This is going to come out pretty differently each time you run it, so your results won’t look the same as mine. In the next lesson we’ll work on visualizing repeated trials of this whole experiment. In the meantime, try the exercise below to get an understanding of what each part of this simple script is doing.


Exercise

Change the script above to add a 0 or 1 with an equal chance at each time step, regardless of the contents of the population. Run repeated trials with both versions–how do the results differ?


 

Footnotes:

  1. Variables have additional properties besides a name, but these are often specified implicitly. For example, our variable storing the sequence from 1 to 10 has a length of 10 and stores integers specifically. These characteristics were set when we assigned a vector of ten integers to the variable. You can use the functions length() to see how big a vector is and class() to see what type of variables it can hold.
  2. Some style guides and other sources recommend the slightly different assignment operator “<-“. There are arguments in favor of both which I won’t repeat here; what I do recommend is that you strive for consistency in your own code.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s