Forward sampling the evolution process with a lot of diversity and a dense fitness spectrum is something I do a lot in my research. I've thought about how to get more speed and avoid cluster jobs. This is a post about what I learned in trying to make things go really fast in forward simulations.

Ignoring populaiton structure and crossing and stuff, there's many ways to forward simulate, all ending up with the same diffusion equation in the end (that's the important thing). But as far as I can tell there's a few ways that are so much better than the other ways in terms of CPU cycles, that if you use one of the other ways, well, you might as well be writing for-loops in an uncompiled language to do matrix multiplication.

Buuuurn.

Maybe this is all obvious. But people just assume everyone knows how to forward simulate, and the mathematical details are sparse. "Wright-Fisher" and "Moran" processes have standard meanings, that are so standard and obvious that it's possible no one really knows what they are in the context of time varying selection, or fluctuating population size or what the detailed algorithmic implementation entails.

What I imagine are standard Wright-Fisher and Moran implementations

First, let me give a plug for using units of real time, and not generations. As a physicist I'm used to checking my equations by checking my units.

Let the generation rate be $λ$, which we can always set to one later. Wright fisher simulations with $N$ individuals require $O(N^2)$ calculations per generation. This is because to do multinomial sampling from the Boltzmann distribution, \begin{equation} p_i = \frac{e^{f_i/λ}}{\sum_j e^{f_j /λ}} \end{equation} requires $O(N)$ comparisons, and we must do so $N$ times to fill up the next generation.

Sampling from a multinomial distribution involves inverting CDF and this requires O(N) comparisons. Shown are two random draws (red diamonds) from a six-outcome multinomial distributions resulting in 2 and 5.

The observable diversity parameter here for this process is $θ = (4 N / λ) μ$.

The story is much the same for (what I think of as) the standard Moran process. I think the most common way of implementing a Moran process is first selecting an individual accroding to the Boltzmann distribution ($O(N)$) to be born and then selecting a random individual to die. If we do this at a rate $Nλ$ then we cover a generation in $O(N^2)$.

The diversity resulting from this process is $θ = (2 N / λ) μ$, so it is twice as drifty as the WF. We can understand this by thinking about how well each of these processes approximate the deterministic evolution. Consider the neutral case, which means in the determinstic case, the frequencies won't change: The best estimate for the future frequency is determined by the observed frequency now. This is what's used in WF to determne the next gerneration. In the Moran process the errors in the frequency are continuously compounding on eachother. The first indvidual gets a genotype at the "optimal" WF estimate, but the second individual in the gerenation is off a little bit because of the randomness in the first individual's genotype. These errors lead to an increase in the stochasticity which results in exactly double the sampling error as the WF rate, and exactly half the effective population size for the same number of individuals.

The Wright stuff.

Half as fast! Why do it then? Just because continuous generations are prettier? Well most common way of implementing a Moran process is, quite frankly, shit. In fact, we can avoid the $O(N)$ cost of multinomial sampling altogether. For a population size of $10^5$, $10^5$ speed up is more than great, it's practically apotheosis!

To be fair, it's fairly easy to construct a faster sampler for the Boltzmann sampler using some fiddly rejection sampling or quick-sorting of the random reals. But IMO, it's really not even close to the full glory of a well implemented Moran-type process.

This is because we can store our population as a dense vector of reduced genotypes, think C++ allocated memory registers of say UInt64's each for a genotype of 32 selection mediating nucleotides (2 bits each for the four nucleotide possibilities). One individual is born, one individual dies. That's just copying one set of bits onto another register, in place. This is what CPU's are born to do, so it's dang fast. If we are not sampling anything from the Boltzmann distribution, we can avoid almost all allocations and each process becomes an $O(1)$ operation on the genes. A really cpu friendly $O(1)$ operation.

Computers love this algorithm! I can run trillions of birth death events easily in a coffee break on a single core. The gentle hum of happy CPU's playing with dense vectors of UInts, never stoppying for garbage collection, never defragmenting, for millions of generations, warms my cold nerd heart.

Chemical potential Moran

Stat-mech refresher: A cannonical ensemble is defined by a collection of interacting particles coupled to a heat bath through a constant-volume, thermally conductive, impermeable wall. The particle number is a defining parameter. In the grand-cannonical ensemble, a system coupled to a bath that allows particles across the boundary. The particle number is not fixed, instead the Lengendre-dual (dual with respect to the free energy in the cannonical ensemble) of particle number, the chemical potential, is fixed. The particle number is allowed to fluctuate. One of the coolest ideas ever.

Well lets use the same idea here, population size is allowed to fluctuate. We define a chemical-potential-like thing $φ$ that biases the birth-death balance to keep particles fluctuating around $N_k$. We satisfy the condition that the net growth rate obeys Lotka-Voltera dynamics, which leads to the correct replicator equation,

\begin{equation} β_i - δ_i = f_i - φ \end{equation}

where $φ = \sum_i n_i f_i / N_k$ is a global, linear parameter in the population that mediates the carrying capacity. Essentially, it plays the role of a chemical potential for our system, modulating the total particle number.

We also have the condition of constant variance for all genotypes. \begin{equation} β_i + δ_i = λ \end{equation} These equations enforce \begin{equation} β_i = (λ +f_i- φ)/2 \qquad δ_i = (λ - f_i + φ)/2 \end{equation}

The magic is that the variance is constant. Because every particle has the same total bith and rates, every particle is equally likely to have something happen to it. Therefore, we can do the gillespie algorithm in two parts:

Draw UNIFORMLY from the population (i.e. from 1:N). This is an O(1) operation since it's essentially just ceiling(U*N) = i,
Then, decide whether the particle chose to divide or die:

\begin{aligned} P(\text{birth}) &= β_i/λ \\ P(\text{death}) &= δ_i/λ \end{aligned}

This is also a $O(1)$ operation, as long as we know $φ$. We got rid of the $O(N)$ operation. We can keep running updates of $φ$, by keeping track of how the total fitness changes each operation.

\begin{equation} φ \leftarrow \begin{cases} φ + f_i/N_k \text{ for birth} \\ φ - f_i/N_k \text{for death} \\ φ + (f_i-f_j) /N_k \text{for mutation}\\ \end{cases} \end{equation} Now everything is O(1)!

Advanced moves, memory management

Birth is not a problem, new individuals can be appended to your population in constant (amortized) time. But if you draw a dude and he decides to die, deleting him from the middle of a vector leaves a hole in your memory, an empty register.

At some point, your vector registers will start looking like swiss cheese: more empty registers than living individuals. You will have to recopy your vector into a dense form once in a while if you don't want to be dealing with a population spread uniformly accross your computer's ram. Your computer will take care of this sooner or later, but at some point you will pay! (an $O(N)$ cost)

*If you just delete from inside your vector of individuals, your memory starts to look like swiss cheese from the death events.*

Instead, we can do our memory management manually. First, swap your randomly chosen dude with the dude to the end of the queue. Then if your dude dies, it's a pop! operation, so your memory stays dense everytime! Good memory management makes for happy CPUs.

Anyway that's is what I used to write my last paper, the algorithm probably fits easily in fifty lines of julia.

The ultimate Game of Life: The balanced Moran

For a while I thought the chemical potential Moran was just about the best way to do forward simulation.

Yes there was quite a bit of push!ing and pop!ing which can give the garbage collector in julia whiplash, but overall, compared to WF it's a blaze of glory where the steady stream of unavoidable twisted pseudo-rands is the dominant hit in profiling.

However last week I was up thinking in bed about a process which is so fast, so tight that I was immediately ashamed I hadn't thought of it before. Surely it couldn't give the right equations? Actually it has a factor of two speed up, and no memory management required and is especially mathermatically pretty. Here's what you do:

Instead of drawing one particle every $\frac{1}{N λ}$, you draw two, $i$, and $j$ at uniform from the population, every $\frac{2}{N λ}$. So we are waiting twice as long between events, but you can check that noise and fitness effects will be twice as big, so half as many events for the same diffusion process. One particle will get copied and replace the other one. We define the two rates. \begin{aligned} β_{ij} &= (λ + f_i - f_j )/ 2 \\ δ _{ij} &= (λ - f_i + f_j) / 2
\end{aligned}

Now we choose between the two processes with \begin{aligned} P(i \rightarrow j) &= \beta_{ij} / λ \\ P(j \rightarrow i ) &= \delta_{ij} / λ \end{aligned}

Note if you only look at what the i'th particle sees, in terms of average effect. \begin{aligned} β_i &= \sum_j x_j β_ij \\ β_i &= (λ + f_i - \sum_j x_j f_j )/ 2 = (λ + f_i - φ )/ 2 \end{aligned} So the fitness potential is there in a stochastic form, but since the Bernoulli trial already has so much entropy, there's no reason to calculate it exactly.

This principle of calculating markov transtion probablities approximately, while still sampling from the exact distribution underlies the magic of so called exact-approximate samplers. Very cool branch of statistics indeed. The only downside of this algorithm is that popuation size is fixed, but on the plus, the inplace overwriting of one of your vector registers means your vector will stay dense without the end-swapping described above.

This algorithm works especially well with time-varying fitness. Time varying fitness means you have to recompute $φ$ from scratch. But with this version, there is no keeping track of $φ$ required. So you can calculate the fitness everytime a particle gets hit by an event, and so you can have time varying fitness at no extra cost.

But the truly remarkable thing about this process is that the math is beautiful. You can just see the Kimura coorelation matrix, on the off diagonal, $λ x_i x_j$, and on the diagonal $(1-x_i)x_i$ in the activity of these processes (where you hit particles of types $i$ and $j$ and particles $i$ twice respectively).

In fishing for the right Wright-Fisher, we have surely fished our wish.

Fishing for the right Wright-Fisher

What I imagine are standard Wright-Fisher and Moran implementations

The Wright stuff.

Chemical potential Moran

Advanced moves, memory management

The ultimate Game of Life: The balanced Moran

Fast forward simulations of populations in Julia

Classifiers for Likelihoods: Why Logistic?