Smoke Screen: The intensifying debate about population screening generates more heat than light

If a test with prognostic value exists, should it be used for population screening? On the face of it, it’s a simple question, but it doesn’t have a simple answer. Like most things in life, it depends on the context: how prevalent and how dangerous is the disease? How invasive and how expensive is the test?

So if we are dealing with cancer, which can be fatal if not diagnosed early, and a screening test such as a mammogram or a blood test for PSA, then it seems obvious that the case for population screening must be impregnable. Such was the basis for the wave of enthusiasm for screening twenty or thirty years ago that lead to the introduction of a number of national screening campaigns, of which mammography was only the most high profile.

But the pendulum has swung the other way: October 2011 saw the US Preventative Services Task Force conclude that the mortality benefit of PSA screening for prostate cancer was small to none, while in the UK the NHS announced a review of the evidence for the effectiveness of its flagship breast cancer screening programme, after recent research suggested the benefits were being exaggerated.

If earlier diagnosis really does improve the outcome for those patients, what can possibly be the problem? The problems are two-fold: over-diagnosis and cost-effectiveness.

The “obvious” case for screening focuses entirely on the benefit gained by the ‘true positives’ – that is, the people who are correctly identified as having the disease. On the negative side is the harm done to the ‘false positives’ – the people who are treated for the disease, but who did not really have it. This harm can be significant, both physically and mentally. Being told you have cancer can be traumatic enough (interpreted by many people, even today, as an automatic death sentence), but undergoing an unnecessary mastectomy, or having an unnecessary course of radiotherapy or chemotherapy is arguably even tougher.

A quantitative accounting of benefit and harm is tricky because the benefit (in terms of the harm avoided) and the harm of over-diagnosis (in the terms of the side-effects of the treatment) are different and so difficult to compare. But the number of people affected by each outcome is easy enough to ascertain: for a test with 90% sensitivity and specificity (so better than most diagnostic tests in clinical use) applied to a disease like breast cancer with an incidence of 5 per 10,000 per year, and the numbers look something like this:

For every million people screened, you will make a correct early diagnosis of 450 of the people who will go on to get breast cancer; the remaining 50 will be missed (but of course, all 500 would have had to wait until clinical symptoms were obvious in the absence of a screening programme). That looks pretty good.

But a specificity of 90% means 10 ‘false positives’ in every hundred people screened. That is a shocking 10,000 people given a positive diagnosis when in fact they did not have cancer at all!

Suddenly, the performance of the test doesn’t look so great. Of the 10,450 people given a positive diagnosis only just over 4% really had cancer. Fully 20 people were given a wrong diagnosis for every one that was correctly identified. Clearly, that’s not a good enough performance to initiate treatment (whether mastectomy or chemotherapy).

Even if the test had been 99% specific, the ‘false positives’ still outnumber the real positives by more than two to one.

What this quantitative analysis clearly shows is that to have any chance of being useful for population screening (at least for a relatively rare condition, such as cancers) the usual kind of diagnostic performance criteria have to be replaced with a new paradigm where it is the decimal fractions after the 99% specificity that are being scrutinized prior to introducing the test. Few, if any, molecular tests can reach this level of performance (at least while retaining any useful degree of sensitivity at the same time). The US Preventative Services task force was certainly right to conclude that PSA testing, which most definitely doesn’t approach this level of diagnostic performance, has little value when used in screening mode.

Let me correct that: PSA testing, when used in screening mode, does a whole lot more harm than good. The US Preventative Services review found that over a 10-year period, 15-20% of men had a positive test triggering a biopsy (of which at least 80% were false positives). The biopsy itself is not free from harm, being accompanied by fever, infection, bleeding, urinary incontinence and pain. But the damning evidence comes from the trials of intervention in prostate tumour identified through screening. Here, there was a small reduction in all-cause mortality following surgery or radiotherapy, but only in men under 65; by contrast, there was a 0.5% peri-operative mortality rate associated with surgery and a big increase in bowel dysfunction and urinary incontinence in the radiotherapy group. The review rightly concluded that the screening programme yielded questionable benefits but at the cost of substantial harms.

With that kind of conclusion, there is no need to even enter into a cost effectiveness assessment. Clearly, population screening is inherently costly (because of the very large number of tests that must be performed). Even when the unit cost of the test is very low indeed, the cost burden is substantial. Even if there were a net benefit (and the argument is closer for mammographic screening in breast cancer than it is for PSA screening and prostrate cancer), the cost effectiveness of the screening programme would not approach the levels required to justify spending on a new therapeutic product (at least not based on current NICE cost effectiveness frameworks). A back of the envelope calculation suggests that mammography would have to be at least 10-fold cheaper than at present to win approval if it were a therapeutic.

Proponents of screening are quick to argue that the solution lies in proper stratification before applying the test – so instead of screening the whole population, only a higher risk sub-group is screened. The stratification might be on the basis of age, or symptoms or some other demographic (indeed, such stratification takes place even in the current ‘universal’ breast cancer screening programme in the UK, since males are not screened even though breast cancer can and does occur, albeit at a much lower prevalence, among men).

Fine. But if you want to incorporate stratification into the screening paradigm, it’s critical that the data on the performance of the test is gathered using that same paradigm. This kind of oversight can over-estimate the value of a test that discriminates very well between disease and the general healthy population but discriminates poorly between the disease and similar maladies with which it shares symptoms. This has proven to be the difficulty for many, if not all, of the new range of molecular colon cancer tests currently in development. These molecular tests typically have a reasonably good sensitivity and specificity when comparing colon cancer with the general healthy population (achieving, perhaps, 90% sensitivity and specificity in the best studies). That, though, as we have already seen, is nowhere near good enough performance to adopt as a general population screening tool. No matter, suggest the proponents of such tests: lets instead use it only in people with symptoms of colon cancer (such as fecal occult blood, intestinal pain or changes in bowel habits for example). Now, with a prevalence of colon cancer of 10-20% in this group, a test with 90% specificity would be more attractive – at least now the number of real positives might (just) outnumber the ‘false positives’. True, but only if the test still has 90% specificity in this selected patient group! In most cases, sadly diagnostic performance falls away once you have stratified the subjects, precisely because the chance of a positive test is increased by inflammatory bowel conditions as well as by cancer. There is nowhere left to go: for a test like this, there is no application in which it is sufficiently useful to justify clinical adoption (even if it were not a premium priced molecular test).

Janet Woodcock, Director of the Centre for Drug Evaluation and Research (CDER) at the FDA summed it up perfectly at the recent US conference on Rare Diseases and Orphan Products, saying “How can something that is so widely used have such a small evidence base? The FDA has never accepted PSA as a biomarker for that very reason – we don’t know what it means.”

What the analysis presented here proves is that you need a low cost, minimally burdensome test with superb diagnostic power coupled with a reasonably prevalent, but very nasty, disease that clearly benefits from early diagnosis and treatment. That’s a pretty demanding set of criteria.

Neither this analysis, nor the review of the US Preventative Services team, published on October 11^th, proves that PSA screening is not useful because it depends on a subjective trade-off of benefits and harms (and in any case, some statisticians have been quick to point out some inadequacies in the meta-analysis framework that was used). But the evidence that prostate cancer really does benefit a great deal from early diagnosis and aggressive treatment is weak, and PSA testing certainly doesn’t have outstanding diagnostic performance. So the weight of argument is certainly heavily stacked against it.

For colon cancer, there is no doubt that the disease is relatively prevalent and benefits from early diagnosis and treatment. By contrast, the tests that are available (whether immuno-FOBT or newer molecular tests) are nowhere near good enough in terms of diagnostic performance to justify use in a screening programme.

For breast cancer, the case is the strongest of the three. Again, there is clear benefit from early diagnosis and treatment, and the test itself has the greatest diagnostic power. The question is simply whether it is good enough. It will be interesting indeed to read the conclusions of Sir Mike Richards, National Cancer Director for the UK, who has been charged with reviewing the evidence. It will be even more interesting to see whether they use this opportunity to attempt a cost-effectiveness assessment, using a framework similar to NICE, at the same time. After all, the breast cancer screening programme is paid for out of the same global NHS budget as all the rest of UK healthcare, including, interestingly, treatment for breast cancer with expensive new drugs such as Herceptin™. It would be fascinating to know whether screening or more rapid treatment once symptoms appear would result in the best use of the available cash for the benefit of breast cancer sufferers in the UK. Sadly, if the nature of the debate on PSA is anything to go by, I doubt the review will yield that much clarity.

The emotional, but evidence-light, arguments in favour of screening exert enormous pressure on healthcare providers. For example, the American Urological Association (AUA) condemned the US Preventative Services report on prostate cancer screening, saying the recommendations against PSA “will ultimately do more harm than good to the many men at risk for prostate cancer” – although they provided no evidence to support their emotive statement. After all, the general population find it hard to imagine how screening can possibly be harmful. The debate will no doubt continue generating much heat, and only a little light. Sadly, despite all the evidence to the contrary it is very hard to see wasteful and possibly even harmful national screening programmes being halted any time soon.

Dr. David Grainger
CBO, Total Scientific

Smoke Screen: The intensifying debate about population screening generates more heat than light

TS Team