Read How to Read a Paper: The Basics of Evidence-Based Medicine Online
Authors: Trisha Greenhalgh
Question Two: Has the test been compared with a true gold standard?
You need to ask, first, whether the test has been compared with anything at all! Papers have occasionally been written (and, in the past, published) in which nothing has been done except perform the new test on a few dozen participants. This exercise may give a range of possible results for the test, but it certainly does not confirm that the ‘high’ results indicate that target disorder (the disease or risk state that you are interested in) is present or that the ‘low’ results indicate that it isn't.
Next, you should verify that the ‘gold standard’ test used in the survey merits the term. A good way of assessing a gold standard is to use the ‘so what?’ questions listed earlier. For many conditions, there is no absolute gold standard diagnostic test that will say for certain if it is present or not. Unsurprisingly, these tend to be the very conditions for which new tests are most actively sought! Hence, the authors of such papers may need to develop and justify a combination of criteria against which the new test is to be assessed. One specific point to check is that the test being validated here (or a variant of it) is not being used to contribute to the definition of the gold standard.
Question Three: Did this validation study include an appropriate spectrum of participants?
If you validated a new test for cholesterol in 100 healthy male medical students, you would not be able to say how the test would perform in women, children, older people, those with diseases that seriously raise the cholesterol level, or even those who had never been to medical school. Although few people would be naive enough to select quite such a biased sample for their validation study, it is surprisingly common for published studies to omit to define the spectrum of participants tested in terms of age, gender, symptoms and/or disease severity and specific eligibility criteria.
Defining both the range of participants and the spectrum of disease to be included is essential if the values for the different features of the test are to be worth quoting—that is, if they are to be transferable to other settings. A particular diagnostic test may, conceivably, be more sensitive in female participants than in male participants, or in younger rather than in older participants. For the same reasons, the participants on which any test is verified should include those with both mild and severe disease, treated and untreated and those with different but commonly confused conditions.
Whilst the sensitivity and specificity of a test are virtually constant whatever the prevalence of the condition, the positive and negative predictive values are crucially dependent on prevalence. This is why GPs are, often rightly, sceptical of the utility of tests developed exclusively in a secondary care population, where the severity of disease tends to be greater (see section ‘Whom is the study about?’), and why a good
diagnostic
test (generally used when the patient has some symptoms suggestive of the disease in question) is not necessarily a good
screening
test (generally used in people without symptoms, who are drawn from a population with a much lower prevalence of the disease).
Question Four: Has work-up (verification) bias been avoided?
This is easy to check. It simply means, ‘did everyone who got the new diagnostic test also get the gold standard, and vice versa?’. I hope you have no problem spotting the potential bias in studies where the gold standard test is only performed on people who have already tested positive for the test being validated. There are, in addition, a number of more subtle aspects of work-up or verification bias that are beyond the scope of this book but which are covered in specialist statistics textbooks [11].
Question Five: Has expectation bias been avoided?
Expectation bias occurs when pathologists and others who interpret diagnostic specimens are subconsciously influenced by the knowledge of the particular features of the case—for example, the presence of chest pain when interpreting an electrocardiogram (ECG). In the context of validating diagnostic tests against a gold standard, the question means, ‘did the people who interpreted one of the tests know what result the other test had shown on each particular participant?’. As I explained in section ‘Was assessment “blind”?’, all assessments should be ‘blind’—that is, the person interpreting the test should not be given any inkling of what the result is expected to be in any particular case.
Question Six: Was the test shown to be reproducible both within and between observers?
If the same observer performs the same test on two occasions on a participant whose characteristics have not changed, they will get different results in a proportion of cases. All tests show this feature to some extent, but a test with a reproducibility of 99% is clearly in a different league from one with a reproducibility of 50%. A number of factors that may contribute to the poor reproducibility of a diagnostic test are the technical precision of the equipment, observer variability (e.g. in comparing a colour with a reference chart), arithmetical errors and so on.
Look back again at section ‘Was assessment “blind”?’ to remind yourself of the problem of inter-observer agreement. Given the same result to interpret, two people will agree in only a proportion of cases, generally expressed as the Kappa score. If the test in question gives results in terms of numbers (such as the serum cholesterol level in millimole per litre), inter-observer agreement is hardly an issue. If, however, the test involves reading X-rays (such as the mammogram example in Section ‘Was assessment “blind”?’) or asking a person questions about their drinking habits [10], it is important to confirm that reproducibility between observers is at an acceptable level.
Question Seven: What are the features of the test as derived from this validation study?
All these standards could have been met, but the test may still be worthless because the test itself is not valid (i.e. its sensitivity, specificity and other crucial features are too low. That is clearly the case for using urine glucose to screen for diabetes; see section ‘Ten questions to ask about a paper describing a complex intervention’). After all, if a test has a false-negative rate of nearly 80%, it is more likely to mislead the clinician than assist the diagnosis if the target disorder is actually present.
There are no absolutes for the validity of a screening test, because what counts as acceptable depends on the condition being screened for. Few of us would quibble about a test for colour blindness that was 95% sensitive and 80% specific, but nobody ever died of colour blindness. The Guthrie heel-prick screening test for congenital hypothyroidism, performed on all babies in the UK soon after birth, is over 99% sensitive but has a positive predictive value of only 6% (in other words, it picks up almost all babies with the condition at the expense of a high false-positive rate) [11], and rightly so. It is far more important to pick up every single baby with this treatable condition who would otherwise develop severe mental handicap than to save hundreds of parents the relatively minor stress of a repeat blood test on their baby.
Question Eight: Were confidence intervals given for sensitivity, specificity and other features of the test?
As section ‘Probability and confidence’ explained, a confidence interval, which can be calculated for virtually every numerical aspect of a set of results, expresses the possible range of results within which the true value will lie. Go back to the jury example in section ‘Complex interventions’. If they had found just one more murderer not guilty, the sensitivity of their verdict would have gone down from 67% to 33%, and the positive predictive value of the verdict from 33% to 20%. This enormous (and quite unacceptable) sensitivity to a single case decision is because we only validated the jury's performance on 10 cases. The confidence intervals for the features of this jury are so wide that my computer programme refuses to calculate them! Remember, the larger the sample size, the narrower the confidence interval, so it is particularly important to look for confidence intervals if the paper you are reading reports a study on a relatively small sample. If you would like the formula for calculating confidence intervals for diagnostic test features, see the excellent textbook ‘Statistics with Confidence’ [12].
Question Nine: Has a sensible ‘normal range’ been derived from these results?
If the test gives non-dichotomous (continuous) results—in other words, if it gives a numerical value rather than a yes/no result—someone will have to say at what value the test result will count as abnormal. Many of us have been there with our own blood pressure reading. We want to know if our result is ‘okay’ or not, but the doctor insists on giving us a value such as ‘142/92’. If 140/90 were chosen as the cut-off for high blood pressure, we would be placed in the ‘abnormal’ category, even though our risk of problems from our blood pressure is very little different from that of a person with a blood pressure of 138/88. Quite sensibly, many practising doctors and nurses advise their patients, ‘Your blood pressure isn’t quite right, but it doesn't fall into the danger zone. Come back in three months for another check'. Nevertheless, the clinician must at some stage make the decision that
this
blood pressure needs treating with tablets but
this
one does not. When and how often to repeat a borderline test is often addressed in guidelines—you might, for example, like to look up the detailed guidance and prevailing controversies on how to measure blood pressure [13].
Defining relative and absolute danger zones for a continuous physiological or pathological variable is a complex science, which should take into account the actual likelihood of the adverse outcome that the proposed treatment aims to prevent. This process is made considerably more objective by the use of likelihood ratios (see section ‘Likelihood ratios’). For an entertaining discussion on the different possible meanings of the word ‘normal’ in diagnostic investigations, see Sackett and colleagues' [14] textbook, p. 59.
Question Ten: Has this test been placed in the context of other potential tests in the diagnostic sequence for the condition?
In general, we treat high blood pressure on the basis of the blood pressure reading alone (although as mentioned, guidelines recommend basing management on a series of readings rather than a single value). Compare this with the sequence we use to diagnose stenosis (‘hardening’) of the coronary arteries. First, we select patients with a typical history of effort angina (chest pain on exercise). Next, we usually do a resting ECG, an exercise ECG, and, in some cases, a radionucleide scan of the heart to look for areas short of oxygen. Most patients only come to a coronary angiogram (the definitive investigation for coronary artery stenosis)
after
they have produced an abnormal result on these preliminary tests.
If you took 100 people off the street and sent them straight for a coronary angiogram, the test might display very different positive and negative predictive values (and even different sensitivity and specificity) than it did in the sicker population on which it was originally validated. This means that the various aspects of validity of the coronary angiogram as a diagnostic test are virtually meaningless unless these figures are expressed in terms of what they contribute to the overall diagnostic work-up.
Likelihood ratios
Question Nine described the problem of defining a normal range for a continuous variable. In such circumstances, it can be preferable to express the test result not as ‘normal’ or ‘abnormal’, but in terms of the actual chances of a patient having the target disorder if the test result reaches a particular level. Take, for example, the use of the prostate-specific antigen (PSA) test to screen for prostate cancer. Most men will have some detectable PSA in their blood (say, 0.5 ng/ml), and most of those with advanced prostate cancer will have very high levels of PSA (above about 20 ng/ml). But a PSA level of, say, 7.4 ng/ml may be found either in a perfectly normal man or in someone with early cancer. There simply is not a clean cut-off between normal and abnormal [15].
We can, however, use the results of a validation study of the PSA test against a gold standard for prostate cancer (say, a biopsy) to draw up a whole series of two-by-two tables. Each table would use a different definition of an abnormal PSA result to classify patients as ‘normal’ or ‘abnormal’. From these tables, we could generate different likelihood ratios associated with a PSA level above each different cut-off point. Then, when faced with a PSA result in the ‘grey zone’, we would at least be able to say, ‘this test has not proved that the patient has prostate cancer, but it has increased [or decreased] the odds of that diagnosis by a factor of
x
’. In fact, as I mentioned earlier, the PSA test is not a terribly good discriminator between the presence and absence of cancer, whatever cut-off value is used—in other words, there is no value for PSA that gives a particularly high likelihood ratio in cancer detection. The latest advice is to share these uncertainties with the patient and let him decide whether to have the test [16].
Although the likelihood ratio is one of the more complicated aspects of a diagnostic test to calculate, it has enormous practical value, and it is becoming the preferred way of expressing and comparing the usefulness of different tests. The likelihood ratio is a particularly helpful test for ruling a particular diagnosis in or out. For example, if a person enters my consulting room with no symptoms at all, I know (on the basis of some rather old epidemiological studies) that they have a 5% chance of having iron-deficiency anaemia, because around one person in 20 in the UK population has this condition. In the language of diagnostic tests, this means that the pre-test probability of anaemia, equivalent to the prevalence of the condition, is 0.05.