Evolving with R #3: Conditionals & replicates

In the last two lessons you’ve looked at evolutionary change in a growing population by simulating reproduction of two types of alleles. This is a model of how genetic drift–the name for random factors that change alleles frequencies from one generation to the next–works in populations that start very small. In this lesson we’ll make a more standard model of genetic drift in a population of a fixed size. This has the advantage that we can more easily leverage the mathematical theory of population genetics to predict the average outcome of our simulations, giving us a basis for comparison to see if we’ve coded things correctly.

Let’s start with the following code, which simulates genetic drift in a population with size N with allele B starting at a frequency p.

Fixed-N drift simulator

N = 100
p = 0.1
maxT = 100
alleleFrequencies = rep(0, maxT)
pop = rep(0, N)
pop[1:round(p*N)] = 1

for(t in 1:maxT)
{
   alleleFrequencies[t] = mean(pop)
   pop = sample(pop, N, replace=T)
}

plot(1:maxT, alleleFrequencies, type=”b”)

This code starts very simply by defining important parameters as variables; our intent isn’t to modify these parameters during the execution of the code, but merely to set them once and then use them throughout. As our code examples get more complicated you’ll notice more decisions that are intended to keep things organized and help readers understand the purpose of each piece of the whole. At a minimum, write your code so that you could come back to it six months later and quickly grasp it again.

In addition to defining what are essentially constants, this code also sets up several vectors to store the results of calculations. Because we know how big the population is we can make these once and reuse them.

In the for loop, we use a few tricks to make the evolutionary simulation simple. By encoding the alleles as ‘0’ and ‘1,’ we can use the mean() function to quickly get the frequency of the ‘1’ allele, which we can call B. Here, sample() acts on the whole population at once, giving us a new generation descended from the old alleles.

While this code is a little more complicated than previous examples, you can apply the same principles of testing individual pieces to understand the whole. Use that approach in the next exercise.


Exercise

The code below contains several bugs. Find each one and describe the consequences of that bug and how you found it.

N = 100
p = 0.1
maxT == 100
alleleFrequencies = rep(0, maxT)
pop = rep(0, N)
pop[round(p*N)] = 1

for(t in 1:maxT)
{
  alleleFrequencies[maxT] = mean(pop)
  pop = sample(pop, N, replace=F)
}

plot(1:maxT, alleleFrequencies, type=”b”)


The behavior of the Fixed-N drift simulator is quite variable, which makes it challenging to estimate its average behavior by eye. It would be better to summarize the results of many different simulations, and for that we need two innovations. The first is that we need code to efficiently run our simulations repeatedly and store the results. Of equal importance, we also need to make a good choice of what data we should collect, which requires focusing on a specific question. Guided by population genetics, I’ll pick this one to start: how often does the B allele fix (go to a frequency of 1)?

This choice of question creates its own problem; experimentation will show you that the B allele can take a very long time to fix, or might go extinct very quickly. If we make maxT too small we will have simulations that end without a definitive answer; if we make it sufficiently large we’ll waste a lot of time simulating populations in which we already have an answer. What we need is to continue a given simulation only if neither allele has fixed, and we can do this with a while loop combined with a conditional expression.

A conditional expression asks R to compare two variables and return either TRUE if they are equal or FALSE if they are not. This is done to control the flow of the program, meaning the order and number of times each part gets executed. We’ll look more at program flow in future lessons; for now we want to start practicing with the ‘equal to’ operator ‘==’. One concern is training ourselves to see the difference between the assignment operator ‘=’ and ‘==’ in a complicated piece of code.

A while loop behaves like a for loop in that a block of code is repeated until a condition is met. In fact, a for loop can be written with while with just a few extra lines.

x = 1:3

for(i in x)
{
  print(i^2)
}
[1] 1
[1] 4
[1] 9
 i = 1
while(i <= length(x))
{
  print(x[i]^2)
  i = i + 1
}
[1] 1
[1] 4
[1] 9

Here we used a different conditional operator, ‘<=’. In this case a for loop was much more efficiently, as it typically is when we know in advance how many times we want to go through the loop. The while loop shines when we can’t possibly know how many times to do the code in the loop without actually doing it, as is the case here.

Fixed-N drift simulator v.2

N = 100
p = 0.1
replicates = 100
results = rep(0, replicates)

for(i in 1:replicates)
{
  pop = rep(0, N)
  pop[1:round(p*N)] = 1
  while(mean(pop) > 0 & mean(pop) < 1)
  {
    pop = sample(pop, N, replace=T)
  }
  results[i] = mean(pop)
}

The revised code above uses a while loop to keep simulating evolution until one of the two alleles fixes. This is simple enough conceptually but uses three more conditional operators to code: ‘<‘ and ‘>’ for strictly less than and greater than, and the ‘&’ operator to specify that we want the mean(pop) to be BOTH above zero AND below one. The other trick here is nesting our while loop inside a for loop, so that our code can run many different, entirely independent simulations with the same starting conditions.


Exercise

Use the code above to measure and plot the relationship between the starting frequency of an allele and its chance of fixing. What did you discover?


Once we start using while loops it’s inevitable that you’ll eventually code an infinite loop– a loop with no attainable exit condition. You can hit the ‘Esc’ key or hit the stop sign symbol in your R console to halt the execution of an R command, which is the only way short of quitting the program to get out of an infinite loop.


Exercise

Our code is inefficient for large populations because it stores the population as a vector of alleles. We could instead store the same information in a single variable corresponding to the number of B alleles. Write a version of simulator v.2 that produces the same expected results but stores only one number to represent the population. Test it out and confirm that its behavior is indistinguishable to the other version.


 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s