By Eric Winsberg (external contributor)
It is a home truth in philosophy of science that models are not, in the first instance, evaluated for their truth or accuracy, but for their adequacy for purpose. A model of the same system or phenomenon that is adequate for one purpose might be inadequate for another. Consequently, if we want to evaluate how well models performed during the Covid-19 pandemic, we have to have a keen eye on their intended purposes.
So what purposes were these models intended to serve? Two obvious candidates are prediction and projection. When a model is used for prediction, we expect it to tell us what state it will, in fact, be in in the future (and perhaps what path it will follow to get there). When a model is used for projection, we expect it to tell us what state the system would evolve into, conditional on various choices that we make as agents who act on the system.
If I am using a model, for example, to predict how much demand there will be for intensive care unit beds, or ventilators, or support staff, in my hospital system during a pandemic, then I am using it for prediction. If, on the other hand, I am using a model to evaluate the likely effects of various possible public health interventions during a pandemic, then I am using the model for projection. How I evaluate a model will depend to a great extent on whether I intend to use it for prediction or projection.
When a model is intended for projection, it is likely that it will steer policymakers away from certain courses of action. For example, a microsimulation model developed by a team at Imperial College London (ICL, ie, the model used to generate the famous 'Report 9') is widely acknowledged to have steered UK policymakers away from ‘business as usual’ during the Covid-19 pandemic.
In general, if a model says 'if you do X, millions will die' and policymakers are consulting the model when they make policy choices, then it is unlikely that they will do X. This makes it difficult to determine whether the model made a correct projection concerning what would have happened if we had done X. It makes it difficult, but not impossible. Sometimes, it will be possible to find datasets that can be used as controls. So, for example, though the model used to generate 'Report 9' was never used to make a projection for Sweden, a close cousin of the model was run by a Swedish group, as was the slightly simpler model used in 'Report 13'.
We can use these outcomes to assess how good the model was at projecting outcomes under relatively mild interventions. If we assume that the projection made by the ICL team for the United States was intended to be more or less spatially homogeneous, we can use American states, like Florida, that used mild measures. In most such cases, it certainly appears that the ICL models (both the Report 9 and the Report 13 variety) made overly pessimistic projections for these scenarios, particularly if we pay attention not only to the total number of casualties projected, but also the pace and tempo at which those casualties were expected to arrive.
Of course, defenders of the ICL model will be quick to point out that Report 9 contained the following proviso: the projections for outcomes in the absence of mitigation were applicable only in 'the (unlikely) absence of any control measures or spontaneous changes in individual behaviour'. They will then point out that it is impossible to say of any place that there was a complete absence of any spontaneous changes in individual behavior. This is of course true, even if extremely convenient. Some have even argued that it is especially unfair to criticize a model like the ICL models for being overly pessimistic if it is in fact the case that it is the model’s projection itself that caused people to spontaneously change their individual behavior.
In light of this, many have explored the idea that, when a model causes changes in the behavior of the people it is modeling in such a way as to adversely affect its predictive abilities, this may not imply that a model’s 'suitability, adequacy, or usefulness is diminished' (van Basshuyen, et al). In fact, the authors suggest that we might want, under some conditions, to consider a model’s 'performative' impact to be a potential virtue.
Similar suggestions have been made in epidemiology literature: Biggs and Littlejohn (2021), for example, remark that '[i]nitial projections [of the ICL model] built in worst-case scenarios that would never happen as a means of spurring leadership into action' (92), while Ioannidis et al (2020) speculate that '[i]n fact, erroneous predictions may have even been useful. A wrong, doomsday prediction may incentivize people towards better personal hygiene'.
This raises an important question: should modelers pat themselves on the back when they build models that incorporate 'doomsday predictions' that 'incentivize people' towards the behavior that the models deem virtuous? Should we praise their efforts? I think the answer here has to be 'no'. Changes in behavior away from what people would have done in the absence of the 'doomsday predictions' typically have costs. And one of the most important functions of models for policymaking is that we expect them to facilitate cost-benefit analysis.
Somewhat crudely (but only somewhat), we need accurate and reliable models to determine which behaviors are in fact the best ones to carry out. If the very purpose of the model is constituted in part by a prior assumption of what the best course of action is, then things have become unacceptably circular.
The whole enterprise risks undermining both the credibility of science and the rights of the public to have policy choices that reflect their own values, rather than simply those of the model makers. At the very least, the public deserves a say in this matter. I talk about these issues in more detail in this recent film. Readers: do you want scientific model builders to have the goal of persuading people to change their behavior? Or only of producing the most accurate possible projections?
TWEET