Evolving with R #3: Conditionals & replicates

In the last two lessons you’ve looked at evolutionary change in a growing population by simulating reproduction of two types of alleles. This is a model of how genetic drift–the name for random factors that change alleles frequencies from one generation to the next–works in populations that start very small. In this lesson we’ll make a more standard model of genetic drift in a population of a fixed size. This has the advantage that we can more easily leverage the mathematical theory of population genetics to predict the average outcome of our simulations, giving us a basis for comparison to see if we’ve coded things correctly.

Let’s start with the following code, which simulates genetic drift in a population with size N with allele B starting at a frequency p.

Fixed-N drift simulator

N = 100
p = 0.1
maxT = 100
alleleFrequencies = rep(0, maxT)
pop = rep(0, N)
pop[1:round(p*N)] = 1

for(t in 1:maxT)
{
   alleleFrequencies[t] = mean(pop)
   pop = sample(pop, N, replace=T)
}

plot(1:maxT, alleleFrequencies, type=”b”)

This code starts very simply by defining important parameters as variables; our intent isn’t to modify these parameters during the execution of the code, but merely to set them once and then use them throughout. As our code examples get more complicated you’ll notice more decisions that are intended to keep things organized and help readers understand the purpose of each piece of the whole. At a minimum, write your code so that you could come back to it six months later and quickly grasp it again.

In addition to defining what are essentially constants, this code also sets up several vectors to store the results of calculations. Because we know how big the population is we can make these once and reuse them.

In the for loop, we use a few tricks to make the evolutionary simulation simple. By encoding the alleles as ‘0’ and ‘1,’ we can use the mean() function to quickly get the frequency of the ‘1’ allele, which we can call B. Here, sample() acts on the whole population at once, giving us a new generation descended from the old alleles.

While this code is a little more complicated than previous examples, you can apply the same principles of testing individual pieces to understand the whole. Use that approach in the next exercise.


Exercise

The code below contains several bugs. Find each one and describe the consequences of that bug and how you found it.

N = 100
p = 0.1
maxT == 100
alleleFrequencies = rep(0, maxT)
pop = rep(0, N)
pop[round(p*N)] = 1

for(t in 1:maxT)
{
  alleleFrequencies[maxT] = mean(pop)
  pop = sample(pop, N, replace=F)
}

plot(1:maxT, alleleFrequencies, type=”b”)


The behavior of the Fixed-N drift simulator is quite variable, which makes it challenging to estimate its average behavior by eye. It would be better to summarize the results of many different simulations, and for that we need two innovations. The first is that we need code to efficiently run our simulations repeatedly and store the results. Of equal importance, we also need to make a good choice of what data we should collect, which requires focusing on a specific question. Guided by population genetics, I’ll pick this one to start: how often does the B allele fix (go to a frequency of 1)?

This choice of question creates its own problem; experimentation will show you that the B allele can take a very long time to fix, or might go extinct very quickly. If we make maxT too small we will have simulations that end without a definitive answer; if we make it sufficiently large we’ll waste a lot of time simulating populations in which we already have an answer. What we need is to continue a given simulation only if neither allele has fixed, and we can do this with a while loop combined with a conditional expression.

A conditional expression asks R to compare two variables and return either TRUE if they are equal or FALSE if they are not. This is done to control the flow of the program, meaning the order and number of times each part gets executed. We’ll look more at program flow in future lessons; for now we want to start practicing with the ‘equal to’ operator ‘==’. One concern is training ourselves to see the difference between the assignment operator ‘=’ and ‘==’ in a complicated piece of code.

A while loop behaves like a for loop in that a block of code is repeated until a condition is met. In fact, a for loop can be written with while with just a few extra lines.

x = 1:3

for(i in x)
{
  print(i^2)
}
[1] 1
[1] 4
[1] 9
 i = 1
while(i <= length(x))
{
  print(x[i]^2)
  i = i + 1
}
[1] 1
[1] 4
[1] 9

Here we used a different conditional operator, ‘<=’. In this case a for loop was much more efficiently, as it typically is when we know in advance how many times we want to go through the loop. The while loop shines when we can’t possibly know how many times to do the code in the loop without actually doing it, as is the case here.

Fixed-N drift simulator v.2

N = 100
p = 0.1
replicates = 100
results = rep(0, replicates)

for(i in 1:replicates)
{
  pop = rep(0, N)
  pop[1:round(p*N)] = 1
  while(mean(pop) > 0 & mean(pop) < 1)
  {
    pop = sample(pop, N, replace=T)
  }
  results[i] = mean(pop)
}

The revised code above uses a while loop to keep simulating evolution until one of the two alleles fixes. This is simple enough conceptually but uses three more conditional operators to code: ‘<‘ and ‘>’ for strictly less than and greater than, and the ‘&’ operator to specify that we want the mean(pop) to be BOTH above zero AND below one. The other trick here is nesting our while loop inside a for loop, so that our code can run many different, entirely independent simulations with the same starting conditions.


Exercise

Use the code above to measure and plot the relationship between the starting frequency of an allele and its chance of fixing. What did you discover?


Once we start using while loops it’s inevitable that you’ll eventually code an infinite loop– a loop with no attainable exit condition. You can hit the ‘Esc’ key or hit the stop sign symbol in your R console to halt the execution of an R command, which is the only way short of quitting the program to get out of an infinite loop.


Exercise

Our code is inefficient for large populations because it stores the population as a vector of alleles. We could instead store the same information in a single variable corresponding to the number of B alleles. Write a version of simulator v.2 that produces the same expected results but stores only one number to represent the population. Test it out and confirm that its behavior is indistinguishable to the other version.


 

Advertisements

Evolving with R #2: Loops & vectors

We’ll explore the simple simulation from the last lesson to learn more about controlling the program with loops, and storing data with vectors. To start, let’s try to understand the format and effect of the squiggly brackets used below.

pop = c(0, 1)
for(i in 1:98)
{
   pop = c(pop, sample(pop, 1))
}

We can actually rewrite this to be a little shorter without any change to how the program runs.

pop = c(0, 1)
for(i in 1:98) pop = c(pop, sample(pop, 1))

So why did we use the squiggly brackets? They create a block of code, which is basically a bunch of lines that are grouped together. This lets us control the execution of those lines as a group. Compare the execution of these two versions:

Code A

pop = c(0,1)
for(i in 1:98) newIndividual = sample(pop, 1)
pop = c(pop, newIndividual)

Code B

pop = c(0,1)
for(i in 1:98)
{
   newIndividual = sample(pop, 1)
   pop = c(pop, newIndividual)
}

Why didn’t version A work in the same way as version B? Simply because, without the brackets to define a block, the first statement following the declaration of the for loop is treated as the entirety of the code to loop over. Then, after 98 repetitions, pop = c(pop, newIndividual) executes once. Blocking the two lines together lets us control them together using the for loop.

Code B also has some formatting that doesn’t matter to R, but helps us; I’ve put each statement in the block on its own line and given each the same indent. This is optional; the following code, using the semicolon to separate statements, works just as well.

pop = c(0,1); for(i in 1:98) {newIndividual = sample(pop, 1); pop = c(pop, newIndividual)}

Generally, you want to use whitespace to format the code nicely and consistently to help you read it. This becomes much more important when you start to nest blocks–that is, write loops or other control structures within other loops. Nested loops are a simple way to perform replicate simulations.

Founder effect with replication

reps = 10
results = NULL
for(r in 1:reps)
{
   pop = c(0,1)
   for(i in 1:98)
   {
       newIndividual = sample(pop, 1)
       pop = c(pop, newIndividual)
   }
   results = c(results, mean(pop))
}

results
[1] 0.37 0.96 0.44 0.02 0.46 0.70 0.09 0.09 0.02 0.69

Note that we assembled this code inside-out: we wrote and tested the inner loop, which performs a single simulation, then wrapped another loop around it to form a more complex program.

The above code used a new trick: declaring results as equal to the special word NULL. This makes a variable with no contents and a length of zero which we can then grow by using c(). If you’re using R’s built-in editor then words like NULL should appear in a distinct color, along with other reserved words: terms that mean special things and can’t be reassigned to variables or functions. These are case-sensitive, so the special coloring helps you see if you’ve typed them correctly.

Using c() to grow a vector by adding results as you compute them is a pretty flexible way to store data because you don’t need to predict how many things you will have to store. However, the cost of this flexibility is execution speed: overuse of c() can be slow, because R has to allocate more memory to the variable as it grows. One option is to make an empty vector of the correct length before you start filling it in with data. We can easily do this using the rep() function.

results = rep(0, reps)
results
[1] 0 0 0 0 0 0 0 0 0 0

Now we need a way to access each box within the vector. There are actually a few options, but the simplest is to used the built-in address of each box. We call the address of a particular box in a vector an index; indices start with 1 and count up.

Note again the number in square brackets that prints to the side of the vector contents above; this is a little hint that we can use square brackets to access the contents of the vector by index.

results[5] = 10
results
[1] 0 0 0 0 10 0 0 0 0 0


Exercise

Modify the code ‘Founder effect with replication’ to make a vector of zeros to store the results of each replicate.


Accessing single boxes in a vector is useful, but R lets you do a lot more than that. You can access multiple boxes at once by putting a vector of indices inside the square brackets.

results[3:5] = 15
results
[1] 0 0 15 15 15 0 0 0 0 0
results[1:4] = c(10, 20, 30, 40)
results
[1] 10 20 30 40 15 0 0 0 0 0
results[1:4] = c(10, 20)
results
[1] 10 20 10 20 15 0 0 0 0 0

With this last example, you may be sensing a theme: R’s flexibility means that code might execute but not do what you want. Because the length of the subvector “results[1:4]” is a multiple of the length of the vector on the right, R cycles through the shorter vector to assign values to the longer one. If you made a mistake in constructing the vector on the right and left out two values, you might wish this command had generated an error. In fact, R will complain if we modify the example a little.

> results[1:5] = c(10, 20)
Warning message:
In results[1:5] = c(10, 20) :
number of items to replace is not a multiple of replacement length

Notice that this is a warning; a warning doesn’t stop the statement from executing, but it does send a message for your attention. Pay attention! Warnings can almost always be fixed and generally indicate real problems.

Another aspect of this flexibility is visible in how R treats the length of a vector. We’ve already seen that you can grow a vector by using c() to add elements to the end. You can also use square brackets to do this. For example, the code below would result in serious errors in a C-like programming environment, but in R it executes without error, but with some interesting side effects.

 x = 1:10
x
[1] 1 2 3 4 5 6 7 8 9 10
x[15]
[1] NA
x[15] = 12
x
[1] 1 2 3 4 6 7 8 9 10 NA NA NA NA NA 12

NA is a special type of object; we’ll look at its properties in the next lesson when we learn about comparing variables. Here it serves to indicate a value in a vector that you haven’t set, and it’s the only indication that you’ve extended this vector past the limits of its original contents.

One less intuitive ability of square brackets is in removing elements from a vector.

x = 1:10
x
[1] 1 2 3 4 5 6 7 8 9 10
x[-5]
[1] 1 2 3 4 6 7 8 9 10
x
[1] 1 2 3 4 5 6 7 8 9 10
x = x[-5]
x
[1] 1 2 3 4 6 7 8 9 10


Exercise

Modify the code ‘Founder effect with replication’ to start with a population of size 100, with fifty ‘1’s and fifty ‘0’s. Then modify the inner loop to remove a random individual after each offspring is added, so you simulate reproduction and death in equal quantities. Gather and analyze the results of doing 100 replication-and-death events for multiple replicate populations.


 

Evolving with R #1: Getting started & your first simulation

R is an application for doing statistics, graphing, and all kinds of computations. Because it is free, runs on Macs, PCs, and Unix, and is extremely versatile, it is used throughout the sciences and in medicine, business, government, and in non-profits–in short, everywhere people are asked to analyze data. Because proficiency with R is not trivially learned, it is a skill that demands respect and can open doors for you.

In this lab, we use R to simulate biology in order to learn about ecology and evolution. But we also use it to analyze data from simulations or experiments, to prepare statistical results and figures for publication, and to share our methods with colleagues after a paper is published. Learning R will therefore help you participate in every part of the scientific activities of this lab, and of most research environments you might find yourself in later on.

“R” is not very useful to Google on its own; you can look up the acronym “CRAN,” which stands for the Comprehensive R Archive Network. Or you can go to the URL https://cran.r-project.org/  Once there, you want to download the latest “base” version of R as a precompiled binary for your Mac or PC, then install it. Bookmark the CRAN site, as it has many resources that will prove helpful once you get started.

These notes are inspired by Michael Crawley’s R book and my own experience with R over the last ten years or so. While not very comprehensive, these notes should help you quickly learn to use R for doing simulations of population-level models in biology.

On a Mac, you should have an application called “R” which you can double-click to open. You’ll see a window open with some version information and a greater-than symbol with a blinking cursor next to it. This “>” symbol is called the command prompt; basically, you type next to it then hit “Enter” or “Return” to have R run whatever you typed.

I’ll use blue text to indicate things you should type into R, and orange text to indicate output that you get from R in response to commands. Following Crawley’s R book, I will bold text to indicate key terms that I want you to be sure to understand.

Let’s try typing the following at the command prompt and hitting return. You should see the output given below the blue command.

1:10

[1] 1 2 3 4 5 6 7 8 9 10

The command “1:10” made a set of integers from 1 to 10 and printed that to the console–that is, the window in which we can directly type commands and see text output. Instead of just printing the results, we can store the results of this command in a variable, which is just a container with a name1 that we use to store information. Try this out with the following use of the assignment operator2, “=”.

x = 1:10

x

[1] 1 2 3 4 5 6 7 8 9 10

Notice that assigning the results to a variable prevents them from being printed; we can then see what’s in the container “x” by just typing its name and hitting enter.

Let’s break down what we just did. An operator is a special symbol that tells R to perform some operation. There’s only a handful of operators to remember; mostly, you’ll use functions to tell R what to do. In this case, the single equals sign “=” assigns the results of whatever is to the right of it to the variable with the name on the left.

In fact, the “:” symbol is an operator as well; it handles the common task of making a sequences of integers. It has an equivalent function called seq(). A function is the basic unit of doing complex computations. Each function works like a little machine that takes input and does something to that input, producing output. A program is basically like an assembly line, where you might take the output from one machine and put it as input to another machine until you have a finished product.

Let’s try out seq() a bit. First, why am I writing it with a pair of empty parentheses after the name? This is to remind us that this is a function; functions are typically going to be used by putting inputs, or arguments, inside those parentheses. Even if you’re not giving the function any arguments, you still need the parentheses to help R realize that you’re giving it the name of a function. Arguments are either going to be pieces of data that we want the function to do something with, or instructions about what we want that function to do.

seq(from=1, to=10)
[1] 1 2 3 4 5 6 7 8 9 10

seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10

In the first example, I gave the function two arguments by name: “from” tells it where to start, and “to” tells it where to stop. In the second example I left off the names, which works because I gave those arguments in the expected order. To see the difference, we can reverse the arguments in both examples and see what the results look like.

seq(to=10, from=1)
[1] 1 2 3 4 5 6 7 8 9 10

seq(10,1)
[1] 10 9 8 7 6 5 4 3 2 1

We can see that if we write “seq(x,y)”, R interprets this as from=x and to=y. How do we know what the arguments are called and that it’s going to read the arguments in that way? Well, we can look at the help file, which you can call up by typing:

?seq

The help files are useful only if you’ve put in some effort to think through what you want to do with the function. Never be afraid to just try out examples!


Exercise

Using the help file for seq(), find a method for producing the odd integers starting with 1 and ending with 99.


 

seq() is an example of a function with an output–it makes an object and hands it back to you when it is done, allowing us to put that object in a container with a name by assigning it to a variable. This output is sometimes called a return value; in the help files, this is described under the heading Value. Just like we need to know what arguments a function expects to see, we need to know what the return value will look like to use it properly. Again, the easiest way to get started is just to experiment. Let’s try to figure out what kind of thing seq() is giving back to us.

seq(1:100)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
[20] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
[39] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
[58] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
[77] 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
[96] 96 97 98 99 100

This is going to look a little different on your screen because the formatting depends on the set-up of your fonts and window, but you should see that there’s a number in a pair of square brackets at the beginning of each line. This is a hint that this object is a container that is subdivided; the numbers in square brackets are like addresses for houses on a street, or mailboxes in an apartment building. This kind of object is called a vector.

In the next lesson, we’ll see how to manipulate individual parts of a vector using square brackets. Right now we’ll look at how we can make a vector and what we can do with it. Vectors are basically the simplest way to store data, so we’ll be seeing them a lot in simulations.

x = 10:1
x
[1] 10 9 8 7 6 5 4 3 2 1
length(x)
[1] 10
x + 1
[1] 11 10 9 8 7 6 5 4 3 2

Try this yourself, and notice that typing and executing “x + 1” doesn’t change the values in the variable x–instead, it just prints the result of adding 1 to each value in x.

This is a good time to reinforce the most important reality about programming: the computer does not do what you want it to do, it does only what you tell it to do. It is easy to write “x + 1” when what you want is to change the values in x. The strength of R over other programming environments is that you can test and check the results of each step you perform in a program. This is only a strength if you actually do this!

x = x + 1
x
[1] 11 10 9 8 7 6 5 4 3 2
mean(x)
[1] 6.5

One of the most important functions is c(), where the “c” stands for “concatenate”–join end-to-end. c() is one tool for building a vector from scratch.

x = c(1, 3, 5)
x
[1] 1 3 5
x = c(x, 7, 9, 11)
x
[1] 1 3 5 7 9 11
length(x)
[1] 6
sum(x)
[1] 36

I recommend you keep your own file with examples and notes for each of the functions you come across. Complex programs are generally just combinations of very simple components–often the first step in understanding is going to be understanding what each component does in isolation.

We can start to get a sense of how simple pieces add up to a complex program by building a simulation of a scientific model. Before we do any more coding, we need to be clear on the distinctions between these terms. A model in science is a representation of nature: we choose how to abstract specific properties and interactions between certain things in nature by writing down rules, either verbally or in math. A simulation is then a way of exploring how that model works by using a computer. For example, we might say that the game of chess is a model of war: you have limited assets with different properties and must deploy them strategically to defeat an opponent. This model can be written down in words. If we wanted to simulate this model, we could program a computer to play chess, meaning that we would have to program the computer to store information about the model (which pieces are where, what the rules are, etc.) and how to choose between different moves. We would have to add a lot of other elements, like an interface so we can tell it what to do, and some way to store and output information about the games that it played.

Imagine we have a population of organisms that starts off very small, with only a handful of individuals. As that population grows, the frequencies of different alleles in that population might change because of randomness in reproduction. For example, if we have two bacteria in our population, then just by chance one of them will happen to divide before the other one does. Let’s try to model this with bacteria in mind–so, haploids that reproduce by asexual, binary fission. We start with two individuals in our population–one with genotype A and one with genotype B. We can model growth by doing the following algorithm:

  1. Pick an individual at random (we want to pick each individual with an equal chance because we are modeling how pure randomness affects allele frequencies).
  2. Make a new individual that is identical to this parent and add it to the population.
  3. Go back to step 1.

To execute this algorithm, we need a way to do these steps repeatedly and to store the results. We’ll use c() with a vector to store the information about the population, and a for loop to do the steps repeatedly.

So far, all of our “programs” have fit on a single line of text. Generally, though, we want to write programs that contain many steps, and some of those steps might be easier to read if we spread them across multiple lines. Command-N will open a new window in R which is really just a simple text editor. We can then write a script–a bunch of lines forming a program–in this window, and save that script to use later. Open a new window and type the following. Then, select all that text and press command-return.

for(i in 1:10)
{
   print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

A for loop runs through the code within the squiggly brackets {} for a set number of times. Each time through the loop, the variable “i” takes on the next value in the sequence 1:10. We’ll play with these elements more in later lessons, but right now the for loop lets us do something repeatedly.

We need one more element to simulate our model, which is a way to randomly pick who will reproduce in each step. We can do that with the (very powerful) function sample().

sample(1:10, 1)
[1] 6
sample(1:10, 1)
[1] 8
sample(1:10, 1)
[1] 9

The first argument to sample() gives it a vector to pick from, and the second argument tells it how many things you want it to pick. Each time you run it you’ll get a random result, so it won’t look the same for you as the examples above.

Let’s put these elements together to simulate our model of growth. We’ll use the number 0 to represent bacteria with genotype A and 1 to represent genotype B. Then we can set up our initial population and let it grow until there are a hundred individuals. We’ll then use mean() to see the frequency of the B allele.

pop = c(0, 1)
for(i in 1:98)
{
   pop = c(pop, sample(pop, 1))
}
pop
[1] 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1
[40] 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1
[79] 0 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0
mean(pop)
[1] 0.71

This is going to come out pretty differently each time you run it, so your results won’t look the same as mine. In the next lesson we’ll work on visualizing repeated trials of this whole experiment. In the meantime, try the exercise below to get an understanding of what each part of this simple script is doing.


Exercise

Change the script above to add a 0 or 1 with an equal chance at each time step, regardless of the contents of the population. Run repeated trials with both versions–how do the results differ?


 

Footnotes:

  1. Variables have additional properties besides a name, but these are often specified implicitly. For example, our variable storing the sequence from 1 to 10 has a length of 10 and stores integers specifically. These characteristics were set when we assigned a vector of ten integers to the variable. You can use the functions length() to see how big a vector is and class() to see what type of variables it can hold.
  2. Some style guides and other sources recommend the slightly different assignment operator “<-“. There are arguments in favor of both which I won’t repeat here; what I do recommend is that you strive for consistency in your own code.