LabKitty: A Primer on Mathematical Epidemiology

A LabKitty Primer is an introduction to some technical topic I am smitten with. They differ from most primers as to the degree of mathematical blasphemy on display. I gloss over details aplenty and mostly work only the simplest cases. I show mistakes and dead ends and things any normal person would try that don't work. Mostly I want to get you, too, excited about the topic, perhaps enough to get you motivated to learn it properly. And, just maybe, my little irreverent introduction might help you understand a more sober treatment when the time comes.

Note for the confused: This Primer makes reference to events from the recent (2015) Ebola outbreak, which featured some arguably questionable decisions on the part of various healthcare professionals. FYI.

Now that we're all going to die in a global apocalyptic Ebola apocalypse, it's a good time to learn some epidemiology. This will soon be all over the "for your health" segment of the news, so you best get yourself prepared. When survivors have devolved into samizdat armed camps, you'll need something to make yourself useful to the local pod of crazies. Demonstrating facility with this material could be your golden ticket to an MRE and a place in the repopulation efforts.

The bad news is it's differential equations. The good news is it's epidemiology differential equations, and if recent actions of the CDC are any indication, plenty of knuckleheads understand epidemiology. Or don't understand epidemiology, at least the "keep the infected from flying coach" portion of the program. Although, to be fair to the CDC, perhaps the larger problem here is common sense. Apparently we're no longer covering the germ theory of disease in our nursing schools.

As always, I blame the hip hop.

The Basic Epidemiology Model

We begin simply. Consider a population of N individuals. This could be, say, the population of Earth (N = 7x10^9), the population of Circle Pines (N = 4918), or the population of passengers on a 737 (N = 100).

We categorize each individual in the population as either susceptible (S), infected (I), or recovered (R). You are susceptible if you can get the disease we are modeling, you are infected (and can infect others) if you have the disease, and you are recovered if you no longer have the disease and cannot get it again. We can summarize this state of affairs graphically using provocative arrows that demonstrate an individual moving through the three possible stages of existence:

    S → I → R

Hence, the basic epidemiology model is aka the SIR model.

During an epidemic, the number of susceptible, infected, and recovered changes over time. So we write S(t), I(t), and R(t), to indicate these numbers are functions of time (t). Since the three categories are exhaustive and exclusive, we have S(t) + I(t) + R(t) = N, at all time t. In essence, we are simply counting people. We get up every morning, and count how many villagers do not yet have the plague, how many currently have the plague, and how many have recovered from the plague, and enter these numbers into our ledger. Repeat daily until everyone is either healthy (or dead) and the crisis has passed. Day one of the infection is usually denoted time zero, and we need to pick S(0), I(0), and R(0) to get the party started. A convenient modeling choice is I(0) = 1, S(0) = N − 1, and R(0) = 0.

We desire equations describing the three subpopulations S(t), I(t), and R(t) at each time t. This would give us a tool to predict the behavior of the current outbreak, plan management strategies, and so on. A simplification that makes the math easier is to assume the counts are real numbers rather than integers. If half of Congress is barfing up Bloody Marys this morning, we write I(t) = 535/2 = 267.5. Clearly, you can't have half a person, but the error introduced is small because we typically work with large populations.

Footnote: Some epidemiologists prefer to divide S, I, and R by N and work in densities rather than counts. The two approaches are equivalent, but you may have to convert parameters when comparing models.

We seek equations for S(t), I(t), and R(t). As is often the case in modeling, we don't know what these quantities are, but we do have a good idea how these quantities are changing. Consider how you get infected. You have to be susceptible and you have to encounter an infected. Logic dictates that the more susceptible there are today (for a given number of infecteds) the more new infections there will probably be tomorrow. The more infecteds there are today (for a given number of susceptibles) the more new infections there will probably be tomorrow. This suggests that the rate of new infections will be proportional to the product S times I.

Yet, not all diseases are equally transmittable. If I can infect you with just a come-hither glance, the rate of new infections will be higher than if I have to literally poop on your head to transmit the disease, all else being equal. Epidemiologists quantify this state of affairs using a parameter called the (per capita) contact rate, usually denoted using the Greek letter β.

Putting all this together, we have the rate of infection is equal to βSI. The "rate of infection" is just the time derivative of I, ergo we can write a tentative equation:

    dI/dt = βSI [1a]

Each gain of one infected means the loss of one susceptible (the susceptible moves from the S category into the I category), so simple logic tells us dS/dt = −dI/dt. Convince yourself the minus sign makes sense -- if the number of infected is going up, the number of susceptibles must be going down, and vice-versa.

Combining this knowledge with Equation 1a, we can write a differential equation for S(t):

    dS/dt = −βSI [2]

Finally, consider recovery. In the basic SIR model, it is assumed you stay sick for a certain amount of time, then move to category R. The rate of new recoveries is the reciprocal of the infectious period. Example: if, on average, people are sick for two days, then the rate of recovery is 0.5 person per day. The recovery rate is usually denoted using the Greek letter γ. Since each infected contributes to this process, multiplying the recovery rate times the number of infected gives us the overall rate of change of R:

    dR/dt = γI [3]

However, each new recovery means one less infected, so we need to tweak Equation [1a]:

    dI/dt = βSI − γI [1]

Equations [1], [2], and [3] are collectively called the SIR epidemiology model, also known as the Kermack McKendrick model.

Footnote: I was a little sloppy in defining contact rate. Epidemiology wonks will tell you it's (usually) the density of infecteds not their raw numbers that drive new infections. Ergo, the infection rate should be written βSI/N (i.e., βS ⋅ I/N). To keep the equations (and, later, code) neat, we divide the per capita contact rate by N at the get-go, and that is the "β" that appears in all the equations (and, later, code). On the third hand, sometimes it really is the raw numbers that drive new infections, and when modeling those diseases we don't divide by N. It's all very confusing. Long story short: if you can't reproduce some published result, make sure you are using the correct contact rate.

Footnote: Just to mention some terminology you may encounter, the infection rate generalizes as f(I)⋅S, where f(I) is the force of infection. In the Kermack McKendrick model, f(I) = βI/N. More sophisticated models generate more realistic results by using more complicated f(I).

Footnote: Here are two illustrative values of β -- influenza: 0.01; leprosy: 0.001. It is much harder to transmit leprosy than the flu, hence its β value is much smaller. Here are two illustrative values of γ -- influenza: 0.1; leprosy: 0. You have the flu for about 10 days. You have leprosy forever.

Now that we have a model, we can use it to generate some epidemics. For this, we turn to the computer.

Modeling an Epidemic with a Computer

The SIR model is nonlinear and hence it does not have an analytic solution (the βSI term screws the pooch). However, it's rather straightforward to slap it into MATLAB (or Octave, if you're a freeware hippy) and solve it numerically.

The lion's share of the work is writing an ode function that defines the system to be solved. This is an ordinary MATLAB function with an argument list I will describe in a moment. The function is stored in an m-file having the same name as the function (usually MATLAB does this for you, filling in the name field when you Save). The ode function file must live somewhere in your MATLAB path. The current working directory, for example.

Here is an ode function I wrote that implements the SIR model:

function xdot = SIR(t, x, flag, beta, gamma)

    xdot = zeros(3,1);
    S = x(1); I = x(2); R = x(3);
    xdot(1) = -beta*S*I;
    xdot(2) = beta*S*I - gamma*I;
    xdot(3) = gamma*I;

The MATLAB ode solver calls the ode function repeatedly, passing in time t, a vector x containing the current values of the RHS of the system, a collection of optional flags (see "help odefile" for details), followed by any user-defined parameters (here, beta and gamma). The function must return a column vector of derivatives -- a 3x1 vector in our case -- which contains the LHS of the system. These need to go into the variable named in the function declaration (I have used the name xdot). Hopefully, the rest of the code is self-explanatory. The xdot = zeros(3,1) initialization is necessary to force xdot to be a column vector. Octave doesn't require this, but most (all?) versions of MATLAB do. There are, of course, many alternate approaches -- I was shooting for maximum code readability.

To run a simulation, pick a solver (I'll use ode45) and call it like so:

> [t,x]=ode45('SIR',[0 20],[762 1 0],[],0.0022,0.44);
> plot(t,x)

The six arguments here are: (1) a string specifying the ode function name, (2) the start and end times (here I ran the sim from t = 0 to t = 20), (3) a vector of initial values [S(0) I(0) R(0)], (4) a vector of optional flag settings (if any), (5) the contact rate (ß), and (6) the recovery rate (γ). I didn't specify any flags, but an empty vector needs to be passed as a placeholder. The ">" isn't part of the command -- its the default command line prompt. Of course, you will probably put your code in an m-file instead of typing it at the command line.

Footnote: the t values returned are not necessarily evenly spaced -- they are chosen by the solver according to some internal convergence criteria (i.e., you can't just do plot(x)). If you require the solution value at one or more specific times of interest, include those values explicitly in the time vector (e.g., [0 2 7.5 20]).

Footnote: You might come across the alternate syntax ode45(@SIR, ...), where "@" defines a function handle. These offer more flexibility compared to string passing. See the on-line help for details. Note function handles aren't implemented in some (read: older) versions of MATLAB.

Footnote: If you call the solver sans an output vector, it will use a built-in plotter and plot the results. This saves you a few lines of code, but it's generally slower and you won't have the data for post-processing.

Footnote: The syntax for the (default) ode solver in Octave is a little different. You need to pass in a complete time vector and it returns the solution at those values:

> t = linspace(0, 20);
> x = lsode('SIR', [762 1 0], t);
> plot(t,x)

Also, IIRC, you can't pass parameters to lsode, so you'll need to define β and γ in the ode file.

Footnote: The default plotter in Octave is pretty sucky, at least on a Mac. Upgrading to the Aqua version is worth the effort (although I seem to recall a nasty X11 bug in my release). YMMV.

Footnote: The hardest part of solving differential equations may well be deciding which solver to use. In MATLAB, there's ode45, and ode23, and ode23s, and a bunch of others, each with its own quirks. I suspect the names are a holdover from whatever pub domain FORTRAN package Mathworks cribbed their solvers from back in the day when function names had to be six characters or less. I'm told these functions are available in Octave, but I've only ever used the default solver lsode. The SIR equations are well-behaved for reasonable parameters, numerically speaking, so it really doesn't matter which solver you choose.

If you type in and run the code, you should get a plot like this:

The susceptible population is plotted in blue, the infected in green, and the recovered in red. There is an initial rise in the number of infecteds that reaches a peak value then falls off as the pool of susceptibles is exhausted. Eventually, all members of the population enter the recovered class.

I took this example from a famous (in epidemiology circles) 1978 study published in BMJ that applied the SIR model to an outbreak of influenza at an English boarding school. The data points are the number of sick children recorded by school authorities over the 14-day epidemic (3 6 15 73 222 294 258 237 191 125 69 27 11 4) which I've superimposed on the plot for comparison. The SIR rather nicely reproduces the observed disease progression, in no small part because most of the model assumptions (homogeneous population, complete mixing of susceptibles and infected, direct transmission of the pathogen, no demographic or spatial variation) are satisfied.

The boarding school results are typical of the SIR model. However, disease dynamics generated by the model can vary substantially depending on parameter values. Consider the following:

SIR influenca model - parameter exploration

Here, results from five additional simulations of the boarding school epidemic are superimposed. In the left panel, the recovery rate was fixed at 0.44 and the contact rate was repeatedly increased by 50%. At right, the contact rate was fixed at 0.0022 and the recovery rate was repeatedly decreased by 50%. Note the range of behaviors. Importantly, for some combinations of parameters, the disease does not spread. This is easier to demonstrate if we change I(0) to 100:

SIR influenca model - parameter exploration pt II

Using the original contact and recovery rates results in an epidemic (left panel). When these parameters are changed to 0.0011 and 0.88, respectively, the model predicts the number of infecteds falls monotonically from its initial value (right panel). That is, no epidemic. Additionally, a number of susceptibles escape the disease -- a demonstration of the (counterintuitive) epidemiology principle that epidemics fail (or terminate) not when the susceptible pool is exhausted, but rather when the chain of infection is broken.

Not surprisingly, identifying combinations of β and γ that result in an epidemic is of primary interest in epidemiology. We now investigate how these values can be determined analytically.

Modeling an Epidemic without a Computer

Kermack and McKendrick published their model in 1927, long before there were computers that could solve systems of nonlinear differential equations. You couldn't solve systems of nonlinear differential equations analytically back then either, so a fair response might have been: what's the point? This is where things would have likely remained had Kermack and McKendrick not applied a bit of cleverness and obtained a useful and important result.

The fundamental question epidemiologists ask is: given a model (and assuming the parameters are correct), does an epidemic occur? By "epidemic" we mean the number of infecteds increases after time zero. Clearly, if I(0) is some positive number there's going to be sick people in the population for a while (until they all move into the recovered category). That's not an epidemic. In an epidemic, the number of infecteds increases before tapering off. It turns out, this is determined by a simple expression involving N, β, and γ.

Recall, we have for the infected class:

    dI/dt = βSI − γI

Rewrite this as:

    dI/dt = (βS − γ)I

At the beginning of the epidemic, S is approximately equal to N. Thus:

    dI/dt = (βN − γ)I

We seek conditions under which dI/dt > 0 (i.e., the math equivalent of "the number of infecteds is increasing"). Since I is a non-negative number, dI/dt is positive only if (βN − γ) > 0. In other words, βN/ γ > 1. Here, then, is a simple indicator whether an epidemic occurs.

The quantity βN/ γ is usually denoted R0 and is called the basic reproduction ratio (aka the basic reproductive rate). It is more properly called the basic reproduction ratio of the disease, because each disease -- each epidemic, even -- has a characteristic R0. It is the single most important parameter that summarizes disease dynamics, used by epidemiologists to this very day.

Footnote: Here's how to think about R0 in words. The length of time you are infected is 1/γ. The rate at which you infect new people is βS, or, at the start of an epidemic, βN. Hence, the number of people each infected infects at the beginning of an epidemic is βN ⋅ 1/ γ aka βN/ γ. If, on average, each infected person infects more than one other person (R0 > 1), the disease will spread (epidemic). If, on average, each infected person infects less than one other person (R0 < 1) the disease will simply burn out (no epidemic).

For illustrative purposes, here are some R0 I found in various sources around the web:

Bubonic plague (England, 1667): 1.6
Ebola (Zaire, 1976): 2.6-8.6
Ebola (Zaire, 1995): 1.5- 5.0
Measles (England, 1950): 16-18
HIV (Uganda, 1985): 10- 11

Such numbers must be interpreted with care, subject as they often are to questionable modeling assumptions and/or dubious record keeping. Yet, they are instructive. A large reproductive ratio for measles is consistent with the high contact rate of the school-age children it affects (at least until a school holiday is called). Similarly, the prolonged symptomless infectious period of HIV leads to its high R0. The second outbreak of Ebola in Zaire exhibited a lower R0 than the first, suggesting lessons learned in 1976 were applied to contain the outbreak in 1995. We may yet have need for these lessons in days to come.

There is additional pencil-and-paper cleverness you can apply to extract additional useful tidbits from the SIR model, such as the maximum size of the epidemic, the initial per capita growth rate, and more. There's also much to be gleaned by working in the phase plane and examining qualitative solutions. Consult any epidemiology textbook for details.

A Menagerie of Epidemiology Models

As you might imagine, mathematical epidemiology has come a long way in the four score interim that has passed since Kermack and McKendrick published their model. Here is a brief tour of the varieties of epidemiology experience currently serving scientists manning the front lines of disease prevention and control.

A simple improvement of the SIR is to include more classification categories. This can be done to tailor the model to known features of a specific disease of interest. For example, if recovery does not confer immunity (think: the common cold or gonorrhea) we include a transition from R back to S and obtain the SIRS model. This can be simplified by omitting the recovered class, in which case we obtain the SIS model. If there is a latent period following infection during which an individual carries the disease but cannot yet infect others, we can model this by inserting an "exposed" class to obtain the SEIR model. (If nurse whats-her-name was E while jetting around the country, we dodged a bullet. If she had already moved into I, you are probably reading this from a well-fortified bunker, if at all.) In fact, you can pretty much add whatever category(s) you like to the model. The modified differential equations are not difficult to write down and code; the difficulty is estimating values for the additional parameters.

Time is treated as a continuous variable in all of the models considered so far. This leads to a system of differential equations. It is possible to treat time as discrete, in which case we obtain a discrete epidemiology model, leading to a system of difference equations. Discrete models can exhibit interesting chaotic dynamics not observed in continuous models. Additionally, our models are deterministic, in that no matter how many times you re-solve the system, you always get the same results. Clearly, chance can play a deciding role in disease spread, which leads us to stochastic epidemiology models. The core idea is a random walk, with the number of infecteds treated as a random variable. Flip a coin: if it comes up heads add one infected; if tails subtract one. Repeat until the epidemic is over. This is, of course, a gross oversimplification, and stochastic epidemiology models are more properly described as a discrete or continuous time Markov chain.

The SIR model and its cousins treat the members of the population as a homogeneous pool. However, real-life populations exhibit natural class structure that can dramatically alter disease dynamics. This can include age structure, with the young and the old generally less resilient to disease than adults, and therefore requiring separate parameters describing their infection and recovery. Dividing the population into male and female may be important to capture salient features of a disease, especially those which are sexually transmitted. Additionally, the SIR model and its cousins assume disease spreads directly from host-to-host (reflected in the βSI term). These are called microparasitic diseases. However, many pathogens pass through (literally) the host into an external vector such as a worm or tick which then infects another host at a later time. A proper model of such macroparasitic diseases must include categories for the vector. Some of the most devastating diseases known are macroparasitic. For example, malaria (and its protozoan vector) is, by some estimates, responsible for half of all human deaths, ever.

Finally, diseases do no just evolve in time, they also spread in space. Here, the math gets unpleasant right quick. As soon as we add a spatial dimension we enter into the realm of partial differential equations. At minimum we have an infected population moving in a 2D landscape (think: rabid foxes roaming the countryside), which is modeled using a diffusion-reaction equation borrowed from chemistry. However, the spatial spread in many diseases -- particularly human diseases -- is not well-modeled by diffusion. Rabid foxes don't usually fly coach (cf. nurse). For complex spatial modeling, the catchphrase is epidemiology on networks where a "network" can be a literal network of locations connected by, say, air routes, or a network of social connections, such as might occur in a high school or brothel. Standard mathematical approaches like differential equations begin to break down here. Instead, state-of-the-art approaches include individual-based models, which are exactly what they sound like: epidemiology models that implement hundreds (or thousands or millions) of simulated actors and follow them individually to assess disease spread. You may choose to find the thought that the CDC is using what is essentially glorified SimCity to decide public policy to be either amazing or terrifying.

Postscript: The End of the World

Of course, you mischievous little monkeys came here expecting to see the end of the world, current events being what they are. Fair dinkum. Let's see what Kermack and McKendrick have to say about the end times.

Professional epidemiologists will take me to task for using SIR to model a global Ebola epidemic, as it does not capture several important features of the disease such as its pre-infectious exposure period as well as any critical spatial dynamics. However, you don't see me clearing nurses who treated Ebola infecteds to get on an airplane, so maybe the professional epidemiologists can just have a seat for now.

The only model modification we require is a reclassification of the recovered class. Ebola has about a 70% mortality rate (the rate is lower with better treatment; higher in novels, cable news, and other works of fiction). We shall simplify things and assume the mortality rate is 100%. This allows us to use the SIR model as-is by reclassifying R as "dead" rather than "recovered." If you are R, you are neither infected nor susceptible, and that's all that matters as far as the dynamics are concerned. Epidemiologists use the term "removed" (it's more PC than "corpse mountain bonfire") so we don't even have to change any letters in the equations.

Next, we need β and γ for our global pandemic. Epidemiologists who studied Ebola outbreaks in Zaire estimated β to be about 0.5. The value will be smaller if people keep their wits about them, so I doubled it to 1.0. Data suggest a value of 15 days for the infectious period. However, people will die quicker once the health care infrastructure collapses along with the rest of society, so let's take γ = 1/10.

Lastly, the world population is approximately 7 billion, which we will assume is one big homogeneous bucket ripe for the plucking. Ergo, S(0) = 7x10^9. Into this we release a single infected: I(0) = 1. Press play...

Well, there it is. The epidemic burns through the entire population in about 90 days. At this plot scale, the early growth phase is eerily invisible. A flash point is reached around day 20, and 10 days later the epidemic peaks. After that, infecteds bleed out in overwhelmed health care facilities and back alleys at their petty pace from day to day to the last syllable of recorded time. See y'all in Valhalla.

The take home is if you wake up tomorrow with a 105° Ebola fever and go into work and cough on your boss, humanity would vanish in a couple of months. Except the three guys on the ISS, I guess.

If that's not depressing enough for you, if you need one more dark cloud darkening your door, one more epidemiology football yoinked out from under your little Charlie Brown of a life, consider this: the World Health Organization felt the need to warn everyone that the Ebola virus remains viable for several weeks in semen.

Yes, there are people with Ebola who are having more sex than you.

Wednesday, October 22, 2014

A Primer on Mathematical Epidemiology

No comments:

Post a Comment