The increase in the use of the term biomarker is a recent one. When one looks back at the use of this term in the literature over the last fifty years, there was an explosive increase in its use in the 1980s and 1990s, and it continues to grow today. However, biomarker research as we now know it has a much deeper history.
Here we are going to focus on just one paper, published in 1965, twelve years before the term “biomarker” appeared in either the title or abstract of any paper in the PubMed database[i]. This is a paper by Sir Austin Bradford Hill, which appeared in the Proceedings of the Royal Society of Medicine entitled “The Environment and Disease: Association or Causation?”.
Sir Austin neatly and eloquently describes nine factors that he feels should be taken into account when assessing the relationship between an environmental factor and disease. These are:
- Strength
- Consistency
- Specificity
- Temporality
- Biological gradient
- Plausibility
- Coherence
- Experiment
- Analogy
In this blog we discuss the applicability of each of these factors to biomarker research today. However, before we do, it is important to note that the aims of biomarker research today are much broader than the primary aim of Sir Austin’s paper – which was to discuss the ways in which an observed association between the environment and some disease may be assessed for the degree of causality involved. However, only a very few biomarkers lie directly on this causal path (some biomarkers change in response to the disease itself, others are only indirectly associated with the disease and its causes), but crucially their utility does not depend upon a causal association. However, particularly when biomarkers are used to aid the identification of disease, there are clear parallels between Sir Austin Bradford Hill’s assessment of causality and our current need to assess utility.
1. Strength. Sir Austin’s primary factor to consider in the interpretation of causality was the strength of the association. He argues that the stronger the association between two factors, the more likely it is that they are causally related. However, he cautions against the obverse interpretation – that a weak association implies a lack of causality. In fact, the strength of an association depends on the proportion of the variance in one factor that explained by the other over the relevant sampling timescale. In other words, there may be a completely causal relationship between X and Y, but X may be only one factor (possibly a small factor) controlling Y. The remaining variance in Y may even be random fluctuations (so X is the only factor causally associated with Y), yet the strength of the observed association will be weak, unless time-averaged measurements are taken for both variables.
The strength of the association is probably an even more important factor for assessing the utility of biomarkers than it was for assessing causality. Firstly, it is clear to all that the stronger the association between a putative biomarker and the disease under examination, the more likely it is to have clinical application. However, as with the arguments for causality there are important caveats to insert. The clinical utility of a putative biomarker often depends upon the shape of the receiver-operator curve, not just the area underneath the curve. For example, a test where the specificity remains at 100%, even with lower sensitivity may have far more clinical utility than a test where both sensitivity and specificity are 90% – depending on the application – even if the overall strength of the association was weaker.
It’s also possible to improve the strength of a crude association, for example by subsetting the patient population. A given biomarker may perform much better in, say, males than females, or younger people rather than older people. The applicability of the biomarker may be restricted but the strength, and hence clinical utility of the association may be improved dramatically. But despite these caveats, the strength of the association is a good “first pass” screening criterion for assessing the utility of biomarkers – much as for Sir Austen it yielded a good “first guess” as to whether an association was likely to be causal
2. Consistency. Sir Austin Bradford Hill puts this essential feature of any biomarker programme second on his list of causality factors. He states “Has [it] been repeatedly observed by different persons, in different places, circumstances and times?”. This is an absolutely crucial issue, and one on which many a biomarker programme has failed. One only has to look at the primary literature to realise that there have been dozens of potential biomarkers published, of which most have not been validated, as indicated by the lack of positive follow-on studies. Much of this attrition can be put down to study design, something that was discussed in an earlier blog.
3. Specificity. The discussion of specificity by Sir Austin Bradford Hill is also highly relevant to today’s biomarker research. We live in an ’omics world’, with the ability to measure levels of dozens, hundreds or even thousands of potential biomarkers with an ease that must have seemed like science fiction in 1965. As a result, it is often trivial (in both the technical logical sense of the word as well as the everyday use) to identify a biomarker apparently associated with a disease. Consider, however, how a marker of inflammation might behave: they will likely be strongly associated with any selected inflammatory disease, but they are unlikely to have any specificity over other inflammatory conditions. For example, serum levels of C-reactive protein correlate well with rheumatoid arthritis, but because it is also associated with dozens of other inflammatory conditions it has little clinical utility for the diagnosis of RA (although, of course, it may be useful for monitoring disease activity once you have secured a robust differential diagnosis by other means). Again, this raises the issue of study design: preliminary studies are often set up with the aim of identifying differences in levels of biomarkers between subjects with disease and healthy controls. Such studies may provide a list of candidates, but ultimately most of these will not show adequate specificity, an issue identified when a more suitable control population is used.
4. Temporality. This is perhaps the most obvious of Bradford-Hill’s concepts: for a causal relationship between X and Y, changes in X must precede changes in Y. Similarly, it is more useful in disease diagnosis when a biomarker changes before the disease is manifestly obvious. On the face of it, the earlier the change can be detected before the disease exhibits clinically-relevant symptoms, the more useful that advance warning becomes. In the limit, however, differences that are exhibited long before the disease (perhaps even for the whole life of the individual, such as genetic markers) become markers of risk rather than markers of the disease process itself.
5. Biological gradient. This feature of biomarker studies is just as important as it was when Sir Austin discussed it in relation to the causality of associations. Our assessment of the utility of a biomarker increases if there is a dose-response association between levels of the biomarker and presence or severity of disease. So, examining colorectal cancer for example, one might give greater weight to a biomarker whose levels are elevated somewhat in patients who have large polyps and strongly elevated in patients who have overt cancer. A gradient of elevation across patients with different stages of cancer would also add to the plausibility of the putative biomarker (see below)
6. Plausibility. Of all of the criteria put forward in the paper by Sir Austin Bradford Hill back in 1965, we find this is the most interesting. Prior to the ’omics era, the majority of experimental designs were already based on a hypothesis of some sort – that is plausibility was inherently built-in to all experiments, just because the act of measuring most analytes or potential biomarkers was expensive in both time and money. To Sir Austin, it must have been the norm rather than the exception that observed associations had at least a degree of plausibility.
In the modern era this is no longer the case. Thousands of genes, metabolites or proteins may now be examined in a very short period of time and (for the amount of data obtained) at a very reasonable cost. And because recruiting additional subjects into a clinical study is typically significantly more expensive than measuring an additional analyte or ten, one often finds that the resulting dataset for most modern studies is “short and fat” – that is, you have measured many more analytes (variables) than you had patients (observations) in the first place. Moreover, there is often no particular reason why many of the analytes have been measured – other than the fact that they composed part of a multi-analyte panel or some pre-selected group of biomarkers. Post-hoc justification becomes the norm. It is almost impossible to avoid. We find a few “statistically significant” differences[ii], and then rush to explain them either from our own background knowledge or by some hurried literature searches. The sum of biological knowledge (or at least published data) is orders of magnitude greater than it was in Hill’s day, and nowadays it is entirely undemanding to construct a plausibility argument for any association one might find in such a trawl.
We caution strongly against this approach, however. Tempting though it is to take this route, the likelihood that any biomarkers identified in such experiments have any validity is almost nil, and enthusiastic but unwitting over-interpretation is often the outcome. This does not mean that such dataset are cannot be mined successfully, but doing so is a job for a professional, wary of the pitfalls. And no such biomarker should be considered useful until it has been validated in some well-accepted manner.
Interestingly, from the perspective of 1965, Sir Austin Bradford-Hill came to the conclusion that it would be “helpful if the causation we suspect is biologically plausible”, but today we do not share that perspective. Armed with so much published data, an argument for plausibility can be built for any association – this lack of specificity therefore means that such plausibility has little predictive value as a criterion for assessing utility. He did, however, state that from the perspective of the biological knowledge of the day, an association that we observe may be one new to science and it must not be dismissed “light-heartedly as just too odd.” This holds true as much today as it did then. When faced with two associations, one plausible and one off-the-wall, the criteria of plausibility is not necessarily the primary criterion that we apply to determine utility.
7. Coherence. Similar to plausibility, this criterion highlights that while there may be no grounds to interpret something positively based on currently available biological knowledge, there may nevertheless be reason to doubt data based on existing scientific evidence. The arguments against using coherence to assess utility of candidate biomarkers are the same as for plausibility.
8. Experiment. This is another crucial factor that is just as relevant in today’s world of biomarkers as it was in 1965. Sometimes the fields of diagnostic medicine and experimental biology are not as well integrated as they should be. Interpretation of biomarker identification or biomarker validation experiments is often limited by the availability of samples or data. However, there is much to be said for taking the information learnt in the examination of biomarkers in patients back to the bench. Here much tighter control may be applied to your experimental system, and hypotheses generated in vivo may be tested in vitro. This may seem back-to-front, but it is an essential feature of any well-designed biomarker programme that it be tested experimentally. This may be possible in patients, but it may often be carried out more cheaply and quickly at the bench or in animal models of disease.
9. Analogy. Analogy falls into the same category as plausibility and coherence. The huge range of published data, much of which is carried out poorly and / or not followed through means that testing the validity of a finding by analogy to existing biological knowledge is becoming ever more difficult. It is not analogy that’s needed, but consistency – and that means more well-designed experiments.
Perhaps it’s time to bring Bradford-Hill’s criteria bang up to date for the 21st Century? Much of his pioneering work applied to assessing causality between environmental factors and disease is just as valuable in assessing modern biomarkers for clinical utility. For the initial assessment of biomarkers, as data begins to emerge from the first discovery studies it is consistency and specificity that carry the greatest weight, with temporality, strength of the association and biological gradient only a short distance behind. The key is to design efficient studies that allow each of these critical parameters to be assessed at the earliest stages of the biomarker discovery programme – too often biomarkers are trumpeted as ready for use before this checklist has been completed, and quite often before any experiment has even been conceived of that might properly test each of them.
Experiment is a crucial component of the eventual validation of any biomarker, but the effort involved means that preliminary prioritization of candidate biomarkers will likely have to be undertaken without it. Our Total Scientific Criteria (with appropriate deference to Sir Austin Bradford Hill) for assessing the utility of biomarkers might look something like this:
- Consistency
- Specificity
- Temporality
- Strength
- Biological gradient
There may be inflation in almost everything in the modern world, but at least when it comes to criteria for judging the utility of biomarkers we have gone from nine criteria to just five. The pleasures of living in a simpler world!
Dr. David Mosedale and Dr. David Grainger
CEO and CBO, Total Scientific Ltd.
References
[i] Source: PubMed search carried out in March 2011.
[ii] We are deliberately avoiding discussion of what might be statistically significant in such a short and fat dataset. Interestingly, Sir Austin’s paper finishes with a discussion on statistical tests, and their potential overuse back in 1965. This is well worth a read!