Indian Journal of Medical Microbiology Home 

[Download PDF]
Year : 2017  |  Volume : 35  |  Issue : 2  |  Page : 184--193

Statistical analysis of microbiological diagnostic tests

CP Baveja1, Prabhav Aggarwal2,  
1 Department of Microbiology, Maulana Azad Medical College, New Delhi, India
2 Department of Microbiology, ESIC Dental College and Hospital, New Delhi, India

Correspondence Address:
Prabhav Aggarwal
E-49, Sector-55, Noida - 201 301, Uttar Pradesh


No study in medical science is complete without application of the statistical principles. Incorrect application of statistical tests causes incorrect interpretation of the study results obtained through hard work. Yet statistics remains one of the most neglected and loathed areas, probably due to the lack of understanding of the basic principles. In microbiology, rapid progress is being made in the field of diagnostic test, and a huge number of studies being conducted are related to the evaluation of these tests. Therefore, a good knowledge of statistical principles will aid a microbiologist to plan, conduct and interpret the result. The initial part of this review discusses the study designs, types of variables, principles of sampling, calculation of sample size, types of errors and power of the study. Subsequently, description of the performance characteristics of a diagnostic test, receiver operator characteristic curve and tests of significance are explained. Lack of a perfect gold standard test against which our test is being compared can hamper the study results; thus, it becomes essential to apply the remedial measures described here. Rapid computerisation has made statistical calculations much simpler, obviating the need for the routine researcher to rote learn the derivations and apply the complex formulae. Thus, greater focus has been laid on developing an understanding of principles. Finally, it should be kept in mind that a diagnostic test may show exemplary statistical results, yet it may not be useful in the routine laboratory or in the field; thus, its operational characteristics are as important as the statistical results.

How to cite this article:
Baveja C P, Aggarwal P. Statistical analysis of microbiological diagnostic tests.Indian J Med Microbiol 2017;35:184-193

How to cite this URL:
Baveja C P, Aggarwal P. Statistical analysis of microbiological diagnostic tests. Indian J Med Microbiol [serial online] 2017 [cited 2020 Nov 27 ];35:184-193
Available from:

Full Text


The importance of statistical knowledge can never be undermined in any scientific study. Even in the simplest scientific study, statistics always has a role to play, more so, in the studies evaluating the newer diagnostic tests. However, very little emphasis is laid on this subject during undergraduate teaching and almost nil during post-graduation; thus, statistics remains the most neglected subject in the curriculum. Despite the enormous amount of material available both online as well as in the text, microbiologists are often left stranded when it comes to statistics. There are only a handful of good materials available dealing specifically with microbiological diagnostic tests.

History of medical science has seen objections being increased due to improper or inappropriate statistical test or methods being applied.[1],[2],[3] Not only correct analysis of data is important but also correct interpretation and citation of data by the readers are equally important. Otherwise the wrong information keeps on passing from one article to another. Therefore, a basic knowledge of statistics is necessary to plan the study, analyse and interpret the results. If assistance is sought from a statistician, the researcher should be able to discuss the matter in more intelligent manner and make conscious decisions for himself.

In this review, the principles of statistics applicable for statistical analysis of microbiological diagnostic tests are discussed. As the world is rapidly progressing towards computerisation, so is statistics; most statistical calculations are now performed using statistical software or online statistical calculators. Thus, the detailed discussions of formulae, derivations and methods of manual calculations would be minimised which would also simplify the understanding of the principles.

 Choosing the Study Design: Retrospective Versus Prospective

Whenever we have diagnostic tests at hand that needs to be compared with one another or with a previously established 'gold standard' test, the first step is the preparation of the study protocol; and the first step for that is deciding the study design. The two basic study designs are 'prospective' or 'retrospective'. In simple terms, prospective studies, by definition, are designed to go from present to the future, whereas retrospective studies go from present to the past.

While carrying out a prospective study, brief basic steps would include determining the diagnostic tests that need a comparison, designing the study, specimen collection, performing the diagnostic tests and finally analysing the results. On the other hand, in a retrospective study, the diagnostic tests have already been performed; the researcher plans a study, goes back in time, searches for the results in old records and finally analyses them. In an another type of study, well-preserved archived specimens may be used to carry-out more tests, sometimes called as 'prospective-retrospective' studies.[4]

Prospective studies have several advantages. Since it is prospective, we have the choice of tests that we wish to evaluate; control groups can be included, bias can be minimised, case history and patient information can be collected, informed consent can be taken if required. For lack of these factors, retrospective studies are considered to be inferior. While prospective studies are a clear winner here, several studies are actually retrospective in nature for a simple reason that they are easier to perform: Specimen collection and performance of tests has already been performed, so the researcher has to only search for the test results in records and compare. Studies that use archived specimens choose a middle path; however, it is essential to ensure quality storage and validate the test for the use of archived samples.[5]

 Type of Data Variable: Qualitative Versus Quantitative

The statistical analysis heavily depends on the type of variable; the study is likely to generate. There are two broad categories: Qualitative and quantitative variables.[6] The majority of microbiological diagnostic tests generate qualitative variable. In 'nominal' qualitative variable, there is no significance attached to magnitude or size of the characteristic; the test results may be binary with two possible outcomes (e.g., positive/negative, reactive/non-reactive, growth/no growth) or there may be more than two categories (non-binary); such as colony morphology and colour on growth media. Another types of qualitative variables are the 'titers' or the data expressed in terms of logarithms (such as copies of RNA or DNA obtained in real-time polymerase chain reaction (PCR) expressed as logarithm) or other data with logical ordering (such as 1+, 2+, 3+, 4+ grading of acid-fast stained slides for tuberculosis diagnosis); these are called 'ordinal' qualitative variable or sometimes loosely called 'semi-quantitative' variable.[7]

Relatively few microbiological tests generate quantitative data in which the variable can take any numerical value signifying a quantity or amount. A good example is data generated while analysing the serum sepsis markers such as prolactin, which gives levels of biomarkers in ng/ml.[8]


A large part of the science of statistics deals with calculations of the probability of our results being caused by chance. This part can be avoided if the entire population is included in the study. However, this is almost never the case due to obvious practical issues. No study can aim to test all the individuals with the suspected disease with the diagnostic test under evaluation to come to a conclusion. It would require astronomical amounts of workforce, resources, time and commitment. This is where sampling plays an important role. Sampling involves selecting only a few of the patients/subjects, which would be representative of the entire population; making it easier, economical and quicker to conduct the study. Since this sample represents the population, it is essential that the samples are selected randomly, which gives the good probability that the results obtained by studying samples would be close if the entire population were studied. The term 'population' here refers to the all the target persons to whom the test would be applicable, once evaluated. Furthermore, important in several studies is the inclusion of another group of healthy control individuals.

If attention is not paid, it is easy to introduce sampling errors, which would defeat the very purpose of the study by giving erroneous results. The most common errors are biased sampling technique and inadequate sample size. Selection bias in sampling can be introduced if sampling technique is not truly random and some members of the population are less likely to be included as compared to others.[9] Selecting samples from patients with a particular clinical presentation or from a particular area/region and then extrapolating the results to the entire population is just one example of such bias.

Sample size

Results obtained from a study done on a sample is unlikely to be exactly same if the entire population was tested. Therefore, the adequate sample size is mandatory if we want the results of this study to be close to population results with good probability. An inadequate sample size will lead to results that are imprecise, and it is difficult to rule out if the results obtained are due to chance. On the other hand, too large a sample size will result in the wasteful use of resources. There are numerous formulae for calculation of sample sizes depending on the nature of variables (quantitative or qualitative) and purpose of the study. Since the majority of microbiological tests are qualitative in nature with binary (positive/negative) outcome, with the aim to determine the performance characteristics (sensitivity/specificity/etc.) of the tests, sample size calculation of this type will be discussed.[10]

However, important concepts to understand before dwelling on calculation of sample size are those of statistical errors, confidence interval (CI) and power of the study. They have great influence on the sample size.

Errors and power of the study

Type 1 error (also known as alpha 'α' error) is the probability of incorrect rejection of a true null hypothesis (that is, there is no significant difference between groups, yet we say there is). Acceptable level of α error is fixed at the beginning of study by choosing the level of significance; most commonly at α = 0.05. While Type 2 error (beta 'β' error) is the probability of incorrect acceptance of a false null hypothesis (i.e., there is a significant difference between groups, yet we say there isn't).

The power of the study is basically the probability that a study will be able to correctly identify a difference between two groups under evaluation, if such a difference actually exists in the population from which samples were drawn.[11] Obviously, greater the power, better is the study. It can be calculated as 1− β error. For a good study, this value should be 0.8 or higher. The power is dependent on the sample size, the chosen value of α, the magnitude of differences obtained between the groups and the nature of statistical tests.

Confidence intervals

CIs are based on the concept that whenever we derive a sample set from a population and obtain a result (e.g., the sensitivity of the diagnostic test), that same result may not be obtained again if we repeat the experiment using a different sample set obtained from the same population. Therefore, a range of values is described (95% CI) within which 95% of the values would fall if we were to repeat the whole experiment an infinite number of times, each time choosing a different sample set from the same population. It is often interpreted as: We can be 95% certain that the true value would lie in this interval.[5] Similar to 95% CI, we can calculate 90% or 99% CIs; however, traditionally 95% CI is most frequently used. A wide range of CIs indicates that our results are imprecise, as it means that the true value may exist over a wide range. As can be seen in the formula below, the CI can be narrowed down by increasing the sample size, thereby making our results more precise:


(95% CI = 95% CI; P = proportion for which 95% CI is to be calculated, e.g., prevalence, sensitivity, specificity, etc.; n = sample size).

e.g., for a test with a calculated sensitivity of 90% (0.9), when the study included 100 subjects, 95% CI are calculated to be 84.2%–95.88%, as follows:


If the study included 10,000 subjects with same sensitivity (90%; 0.9), the 95% CI are much narrower (89.4%–90.6%):


Calculation of sample size

Hajian-Tilaki has reported that there are four essential determinants for calculation of sample size for studies estimating sensitivity and specificity of diagnostic tests.[12] These are: (i) Approximate expected sensitivity or specificity of the test based on previous studies. (ii) The confidence level (1 – α) for the results obtained (α level or the highest risk of making a false positive error that the researcher is willing to accept; 'α' is usually kept at 0.05). (iii) Precision of estimates of sensitivity (or specificity) that is the margin of allowable error pre-determined by the judgment of investigators. (iv) Prevalence of the disease. Calculations for the required sample sizes for estimation of sensitivity or specificity are discussed here. For other purposes, the reader is referred to the review of the literature by Hajian-Tilaki.[12]



Where n1 = Sample size for estimation of sensitivity; n2 = Sample size for estimation of specificity; zα/2 = 1.96 (for α set at 0.05); Se = Approximate expected sensitivity; Sp = Approximate expected specificity; Prev = Prevalence of disease; d = Precision of estimate (margin of allowable error).

For example, the sample size required for the test with an estimated sensitivity of 80%, in a population with a disease prevalence of 20%, with 'α' set at 0.05 and maximum allowable error of 10% is calculated as follows:


If the sample sizes for sensitivity and specificity are different, which is likely to be the case, and then one should always opt for the larger number. Allowances must be kept for patients who might not meet the inclusion criteria, refuse to be included in the study or prematurely wish to exit the study. Despite being aware of the requirement for optimum sample sizes, often researchers choose sample sizes based on conveniences or choose a time frame and include all or randomly selected patients (such as every 10th patient) presenting during that time frame. Such methods are often used in small-scale studies or more commonly during post-graduate dissertation. Unfortunately, while these are definitely more convenient, the results obtained may not be reliable.

Paired or unpaired samples

In addition to the qualitative or quantitative characteristics of the data, statistical tests often require one more essential information; that is, whether the samples in two groups are dependent on each other or not. By dependent samples, we mean that the samples in one group (in our case samples being tested by one test) are linked or related to the other group (being tested by the second test). In most studies which evaluate diagnostic tests, the same specimen is tested by two or more diagnostic tests. This makes the specimens paired (or dependent on each other because sample positive by one test is also likely to positive by another test). On the other hand, if two diagnostic tests are being performed on two different sets of unrelated specimens, they are said to be unpaired or independent of each other.

It is important to note that even if specimens are derived from separate groups, they may still be treated as a dependent, if each individual from one group is very closely matched to individuals from another group on a one-on-one basis (e.g., on the basis of age, sex, clinical presentation, etc.).[7]

 Presentation of Data in Tables

Like mentioned previously, most of the microbiological tests are qualitative in nature with binary results (positive/negative, growth/no growth, etc.). The most popular method of presenting data comparing two diagnostic tests with this kind of result outcomes (either positive or negative) is presented in the form of a 2 × 2 contingency table [Table 1]. The positive and negative results of 'gold standard' test (or the true disease status) are arranged vertically, whereas those of the test under evaluation are arranged horizontally.{Table 1}

True positives (cell 'a') are those patients with the disease (positive by gold standard test) who test positive by the diagnostic test under evaluation; true negatives (cell 'd') are those without the disease who also test negative by our test. On the other hand, false negative values (cell 'c') are those who have the disease (positive by gold standard test), but the test under evaluation falsely gives a negative result; false positives (cell 'b') are those without the disease, but our test falsely indicates the presence of disease.

Presentation of data other than above-mentioned category requires larger tables with more complicated calculations which are beyond the scope of this review.

 Performance Characteristics of Diagnostic Test

Among the performance characteristics of diagnostic test, sensitivity and specificity are most frequently reported; however, there are several other measures of the performance of the tests, namely, positive and negative predictive values (PPVs and NPVs); false positive rates (FPRs), false negative rates (FNRs), positive and negative likelihood ratios (LR+ and LR−). These are explained below with the help of a typical two-by-two contingency table [Table 1],[Table 2],[Table 3].{Table 2}{Table 3}

Sensitivity and specificity

In simple terms, 'sensitivity' is the probability (presented as a percentage) that the diseased patients (as determined by gold standard test) will have a positive result when tested by the diagnostic test under evaluation. To put in another term, it is calculated by dividing true positives ('a') by total positives given by gold standard test ('a + c').

On the other hand, 'specificity' is the probability (presented as a percentage) that the non-diseased patients (as determined by gold standard test) will have a negative result when tested by the diagnostic test under evaluation. In other words, it is calculated by dividing true negatives ('d') by total negatives given by gold standard test ('d + b').

Predictive values

Sensitivity and specificity basically give information that if the patient has the disease or does not have the disease, what are the chances that the new test will be able to pick up or not be able to pick up, respectively. While these concepts are important from microbiologist's point of view; the patient (and clinicians) actually wants to know the reverse; that is, if the patient has tested positive or negative, what are the chances that he/she actually has the disease or not, respectively. These are answered by the PPV and NPVs, respectively. PPV is calculated by dividing true positives ('a') by the total positives given by the test under evaluation ('a + b'). Similarly, NPV is calculated by dividing true negatives ('d') by total negative results given by the test under evaluation ('c + d').

An important difference between sensitivity/specificity and PPV/NPV is that while sensitivity and specificity are not influenced by the prevalence of disease in the population, PPV and NPV are greatly influenced by it. Lower prevalence of the disease will result in lower PPV and higher NPV, and vice versa.[13] This can be seen in the example shown in [Table 3]. Since the PPV and NPV are dependent on the prevalence of the disease in the population being tested, these values hold meaning only if the prevalence of disease in the population, in which the test was evaluated is same as the prevalence of disease in the population being tested now (in real world usage). This reduces the significance of these predictive values. Alternatively, PPV and NPV can be recalculated for the current population to which the test is being applied, if the sensitivity and specificity of the test and disease prevalence in that population are known. This is performed using Bayes' theorem:



Likelihood ratios

LR+ and LR− offer a key advantage over PPV and NPV; since they are ratios, they are not influenced by the prevalence of disease.[14] LR+ is the ratio between the probability of the test being positive in those with the disease and the probability of the test being positive in those not diseased ([a/a + c]/[b/b + d]). Likewise, LR− is the ratio between the probability of test being negative in the diseased and the probability of the test being negative in those not diseased ([c/a + c]/[d/b + d]). Therefore, higher LR+ and lower LR− indicate a superior test. LR+ can also be calculated as: Sensitivity/(1 − Specificity), and LR− as: (1 − Sensitivity)/Specificity.

False positive and negative rates

FPR is calculated by false positives divided by the total negatives given by the gold standard tests (i.e., all who are not diseased). FNR is false negatives divided by total positives given by the gold standard test (i.e., all those who are diseased).

 Testing Significance of Differences Obtained between Two or More Tests - choosing the Correct Statistical Tests

Till now, we have known the performance characteristics of the diagnostic test under evaluation. We might have found out that one test, seemingly has better performance than the other test. However, we do not know if the differences in the results obtained by two tests are statistically significant or they are mere chance findings. For this, the following process is suggested:

Set-up a null hypothesis. In the present situation, the null hypothesis states that there is no real difference between the proportion of positive results obtained by the diagnostic methods (in the case of binary qualitative results) or between the means (in the case of quantitative results)Set-up appropriate alpha level; which is the highest risk of making a false positive error that investigator is willing to accept. This is usually set at 0.05Perform the suitable test of statistical significance:

Large number of statistical tests are available for estimation of P value [Table 4]. However, choosing the right test is important to avoid major errors as have been reported in the past.[15] Each of these tests has certain assumptions that help in choosing the correct statistical test; such as, quantitative or qualitative (with two or more outcomes, or ordinal) variables, dependent or independent sample set, number of groups to be compared, sample size, Gaussian or non-Gaussian distribution and homogeneity of variances.

Even though the data from a single study may have non-Gaussian distribution, the statistical tests for Gaussian distribution are frequently used, particularly for large sample sizes (usually > 30). This is explained on the basis of central limit theorem, according to which on repeating the experiment for an infinite number of times with different sample set obtained from the same population, the individual means obtained from each repeat experiment will have normal or Gaussian distribution around the population mean.[16]The majority of the microbiological studies evaluating diagnostic tests generate qualitative (binary) data on dependent samples (paired); for these studies, McNemar's test is most appropriate. For comparison of more than two dependent groups, Cochran's Q test is applied. Fisher's exact test may be used if the samples are unpaired particularly when the sample size is low with the expected value in any cell of 2 × 2 table <5. Chi-square test may also be used when we are comparing two or more diagnostic tests with two or more outcomes; provided sample size is large. Yate's correction is applied with Chi-square test, for sample sizes below thirty; however, the expected value in any cell should not be <5.The choice of tests for studies generating quantitative data is entirely different. For comparison of two groups with paired or unpaired data, use paired t-test or the unpaired t-test, respectively. Similarly, when comparing more than two tests, use repeated measures ANOVA or the one-way ANOVA test for dependent or independent variables, respectively. If the sample is low (usually <30, with non-Gaussian distribution) or for ordinal data, Wilcoxon signed-rank test or Mann–Whitney U test is used for comparison of two paired or unpaired groups, respectively; Kruskal–Wallis test is used comparison of more than two independent groups, whereas Friedman test is used for more than two dependent groups.

Comparison of P value with the previously chosen alpha valueRejecting or failing to reject the null hypothesis. If the P value obtained is lower than alpha level (0.05), the null hypothesis is rejected, thus indicating the observed differences are statistically significant; while if it is > 0.05, null hypothesis is acceptedCorrect interpretation.

Nickerson states that null hypothesis statistical testing can provide important information regarding interpretation of experimental data; however, it can be easily misunderstood and misinterpreted. Reader is referred to this excellent review, in which important misconceptions are cleared with detailed examples; such as the (1) 'belief that P is the probability that the null hypothesis is true and that 1 − P is the probability that the alternative hypothesis is true' whereas, in fact, P value is the probability that a difference as large as the one observed might have occurred by chance; (2) 'belief that rejection of the null hypothesis establishes the truth of a theory that predicts it to be false'; (3) 'belief that a small P is evidence that the results are replicable (or reliable)'; (4) 'belief that statistical significance means theoretical or practical significance' among several others.[17]

While interpreting the P values, it is also important to keep in mind that low P value is not a proof of 'difference', likewise, the high P value is not a proof of 'no difference'. High or low P values also do not indicate greater or lesser magnitude of the difference between the tests.[11],[18]{Table 4}

 More Definitions and Concepts Like Reproducibility, Repeatability, Accuracy, Precision, Etc.

Repeatability and reproducibility are similarly sounding terms with distinct meanings.[19] Repeatability denotes the closeness of independent results to one another on performing the test again and again when no conditions are changed (for, e.g., testing method, laboratory, physical conditions, personnel, etc., remaining the same). On the other hand, reproducibility refers to the closeness of independent results to one another on performing the same test again on the identical specimen, but under different conditions (e.g., different personnel, different kit lots, different laboratories). Therefore, depending on the variable factor here, reproducibility may be defined in terms of intra- or interobserver, lot-to-lot, or laboratory-to-laboratory reproducibility.

The terms 'accuracy' and 'precision' are another group of similarly sounding that need to be distinguished. Accuracy refers to how close the value obtained by a test is to the 'true value' on average. On the other hand, precision measures the spread of individual values, or how close the values are to each other on repeat testing.[20] For example, a new test measuring serum prolactin levels in patients with sepsis gives values centred around the true value but vary widely on repeat testing; another test gives values which are very close to each other, but those values are far from the true value; yet another test gives values close to one another as well to the true value. The first test is accurate but not precise, second is precise but not accurate, while the third test is both accurate and precise [Table 5].{Table 5}

 Correlation and Agreement

It is easy to get confused between these two terms, even though they have entirely different usage. Correlation attempts to determine if the value of one variable increases or decreases as the value of second independently measured variable increases (positive or negative correlation, respectively). On the other hand, the agreement aims to measure the extent to which the results obtained by two diagnostic tests, measuring the same variable, are similar. Correlation is generally used when one wants to find an association between two distinct variables (e.g., serum procalcitonin levels and acute physiology and chronic health evaluation II score).[8] It is not recommended for repeated measures data. Whereas an agreement has its utility when the same variable is measured by two different tests. Correlation evaluates the relationship between two variables and not their differences, and is, therefore, not recommended as a method for assessing the comparability between methods.

Correlation between two variables is usually presented using either of the two statistical coefficients: Pearson's (r) and Spearman's (ρ) coefficients. The details of these are beyond the scope of this review, but, briefly, Pearson's correlation is based on the assumption that the data are normally distributed (Gaussian distribution), and there exists a linear and homoscedastic relationship between the two variables. Spearman's correlation has no such assumptions; the only condition is that scores on one variable must be monotonically (either entirely increasing or entirely decreasing) related to the other variable.[21]

The agreement may be presented as 'percent agreement', that is, the proportion of total samples which have same results by both tests and the total number of samples presented as a percentage [Table 6]. Percent agreement can vary from 0% to 100%. Another statistic developed by Cohen in 1960s, known as Cohen's kappa statistic, provides more robust results as it accounts for the possibility that the tests may sometimes actually agree by chance. Kappa is expressed as a ratio of the observed improvement over the chance agreement to the maximum possible improvement over chance agreement [Table 6]. The value of kappa coefficient varies from −1 (perfect disagreement) to +1 (perfect agreement); the value of '0' indicates no agreement. One major disadvantage with agreement measures is that they are not a measure of correctness; both diagnostic tests could agree and be wrong.{Table 6}

For determining agreement between diagnostic test with quantitative results, Bland and Altman proposed a graphical plot in 1986, the Bland–Altman plot. In this, the differences (or alternatively the ratios) between the two diagnostic techniques are plotted against the averages of the two diagnostic techniques, and the discrepancy between two techniques is noted. For details, the reader is referred to the original article which has received more than 35,000 citations.[22]

Receiver Operator Characteristic Curve

Although less frequently, situations arise in the field of diagnostic microbiology where a new test is devised that generates numerical values. This test has to be standardised, and cut-off points have to be decided above which the levels are said to be positive and below which they are said to be negative. It is commonly known that as we raise the cut-off point, the sensitivity goes down and specificity increases, while if we decrease the cut-off, then sensitivity goes up and specificity decreases. Therefore, trade-off has to be made such that at the cut-off point, sensitivity and specificity are optimum.[23] This is made possible by preparing a receiver operator characteristic (ROC) curve.

ROC curve was initially used during World War II for analysis of radar signals.[24] It is now being extensively exploited in medical science to determine the optimum cut-off values resulting in a best possible combination of sensitivity and specificity.

For plotting a ROC curve, first sensitivities and specificities of the diagnostic test at various cut-offs points are calculated. Subsequently, sensitivities are plotted on the y-axis, and their corresponding (1 − specificity) are plotted on the x-axis. Thus, each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular cut-off threshold. As a rule, on selecting a higher criterion value, the false positive fraction will decrease with increased specificity, but on the other hand, the true positive fraction and sensitivity will decrease.

The ROC curve is interpreted as follows:

The area under the ROC curve (AUC) is a measure of how well the diagnostic test can distinguish between the diseased and non-diseasedThe curve for a test with perfect discrimination passes through the left upper corner of the graph with both sensitivity and specificity as 100%, with no false positives and negativesCloser the graph line is to the upper left corner, better is the testIf the point falls on the 45° diagonal line (also known as no-benefit line or no-discrimination line), it indicates completely random guessPoints below 45° diagonal line indicate outcomes even worse than random guessesYouden's J statistic is often used in conjunction with the ROC curve and is calculated as Youden's J index = Sensitivity + Specificity −1. This statistical index is calculated for all possible cut-off points, and the cut-off point with highest J index is chosen as the optimum cut-off point. Its value ranges from 0 to 1. J index of zero indicates that the test is, basically useless, giving an equal number of positives in those with disease and those without the disease. J index of 1 indicates a perfect test with perfect distinction between those diseased and not diseased. On the ROC graph, it is represented by the maximum vertical distance between the ROC curve line and the 45° diagonal line [Figure 1]. The point of intercept on the curve indicates the best possible combination of sensitivity and specificity and thus the optimum cut-off point.{Figure 1}

 What to Do in the Absence of a True Gold Standard Test?

A major fallacy of the statistical evaluation of diagnostic tests is the assumption that the 'gold standard' test is the absolute reference standard which can perfectly distinguish between the disease and the non-diseased. However, this is rarely the case. While, the gold standard may give the most accurate results among the available tests, yet in most circumstances, they are far from perfect. They have been in use for sufficiently long-time for the people to believe in them. Nevertheless, in the absence of any other diagnostic modalities, these are used to classify the individuals into diseased or not diseased. In some situations, the gold standard test may be too invasive to perform and associated with great risks, for example, splenic biopsy for Kala-Azar. In other situations, tests that detect the microorganism directly (e.g., culture, PCR, antigen detection tests, etc.) cannot be suitably compared to serological tests detecting antibodies as they are likely to be positive at different stages of the disease.

As the technology advances, new problems have crept in; newer tests under evaluation are often more accurate than the previously set 'gold standards'. A classical example is in use of molecular techniques for diagnosis of bacterial infections. Previous 'gold standard' test was the traditional culture; however, due to a plethora of reasons such as prior antibiotic therapy, non-viable state of bacteria, and inadequate sample collection and transport bacteria fail to grow in culture. Thus, if we still consider culture to be the gold standard, all the extra positive results obtained in PCR would be classified as false positives and resulting low specificity, this defeats the very purpose for which these tests were developed. Hence, the statistical evaluation in such circumstances cannot be the same as has been discussed so far. Inaccuracies in the 'gold standard' tests have led researchers to replace this term with a non-absolute term 'reference standard', thereby indicating their imperfect nature.[25]

Several methods suggested by other authors may provide solutions on an individual basis:

In situations where there is no gold standard, or it is insufficiently accurate, it has been opined that the concepts of sensitivity, specificity, etc., should not be used. Instead, the 2 × 2 table of results comparing the new test with the imperfect reference test and measures of agreement between tests may be better used with discussions and investigations of the causes of disagreements; however, they provide incomplete knowledge about the performance of the tests [9]In the absence of the true 'gold standard' test, the best available test may be used as the 'reference standard'. This is likely to introduce some errors in calculation, resulting in what has been called as 'reference standard bias'. In such situations, corrections should be made for the imperfections in the reference standard by making adjustments based on previous research regarding the degree and causes of imperfections in the reference test [25]Creating a 'composite reference standard' using a combination of two or more laboratory tests and/or clinical features which define the case. This criterion may be used as an alternative reference standard, but it is essential that they are chosen using standard methods and based on strong evidence. In combination testing, considering at least one positive result as positive increases the sensitivity, whereas setting a minimum requirement of more than one positive result increases the specificity [26]Discrepant analysis: In this method, the discordant results between imperfect reference test and test under evaluation are retested by another test of good accuracy (generally more invasive or costlier)[25]Latent class analysis: One classic example of this type is the studies attempting to evaluate tests for diagnosis of latent tuberculosis, which are hampered by a lack of any confirmatory gold standard test. This methodology has been demonstrated by Girardi et al. where they employed results from at least three different diagnostic tests on the same individual, based on the concept that different tests for the same disease are influenced by a common latent variable, the disease status, which cannot be measured directly [27]Correlation with clinical data: The results of the tests may be correlated with the clinically available data such as the distribution of disease in time, place and person, people at risk, known exposure, clinical presentation, panel diagnosis (diagnosis made by a panel of experts in the field), etc.

None of the above methods can, however, replace a true gold standard test. Whenever any of the above methods are employed, it is imperative to clearly mention the methods that have been used for statistical calculations and comparison.

 Statistical Software and Online Statistical Calculators

Rapid progress in information technology has made statistical calculations within reach of all the researchers. Previously, great effort went into manual calculations consuming a lot of time, often requiring a dedicated statistical expert. However, now several computer softwares are available for all commonly used operating systems. These can be purchased and downloaded online. Some of the popular software includes Statistical Package for Social Sciences, (GraphPad Software, Inc., California, USA), (Minitab Inc., Pennsylvania, USA), etc. EpiInfo is free software available for Microsoft Windows, developed by Centers for Disease Control and Prevention, Atlanta, for epidemiological purposes. For new users, inputting and calculating data can be a daunting task; however, several tutorial texts and videos can be searched on the web which will aid their application.

For calculations and analysis of data not involving a very large number of samples and requiring simpler statistical tests, numerous options are available in the form of online website-based calculators. Most of them provide this service free of cost, while also providing suggestions, brief explanations and interpretation of the statistical tests. Thus, these self-explanatory online calculators can be easily used by novices. Some popular examples include,,, although there are many more.

 Beyond Statistics: Operational Characteristics of the Tests in Real World Scenario

A diagnostic test may seem to be an ideal test on paper yet in a real world scenario; it may turn out to be a least appropriate test. Hence, it becomes essential to understand that a statistically sound test must be evaluated for its practical applicability and usability for diagnosis in routine.[5] It is mandatory to understand that the host, agent and environmental factors should be same in the population in which the test was evaluated, and in which it is intended for use. Minor genetic variations in the agent may render the test useless in a different geographical region. Likewise, host characteristics are important particularly in the case of serological tests, where people from one region may already have high antibody titres due to continual exposure (e.g., Widal test for diagnosis of typhoid fever or Mantoux test for tuberculosis).

The choice of the test also depends on the purpose for which the test in intended to be used for. The test may have its utility for patient management when clinical features suggest possible diagnosis, or it may be useful for screening of patients who are at risk for particular infection. There may be a requirement for a test to do surveillance to determine the prevalence of disease in a community. Some tests may have their use for monitoring the progress of disease (such as CD4 cell count) or to determine prognosis. It is, generally, recommended that the test for screening and surveillance should have high sensitivity, even if the specificity has to be sacrificed to some extent. On the other hand, the test required for patient care ought to have good specificity.

Then, there are practical issues that require attention; such as the cost-effectiveness, time required to perform the test, time required since infection for the test to become positive, technical simplicity and ease of use, applicability of test in peripheral regions, stability of test in ambient conditions, shelf life, invasiveness of the test and acceptability to the users.


While there is a huge amount of information available on statistics, a microbiologist is not expected to be aware of all of it. This review was, therefore, not designed as a detailed operational manual. Basic principles were discussed in this review which should aid most researchers to design their studies and statistically analyse the results. Nevertheless, it is advisable to supplement this review with more detailed descriptions of statistical methods, guides and references whenever required. The advice of a professional statistician should always be sought whenever a need is felt, for it is always better to err on the side of safety.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.


1Lang T. Twenty statistical errors even you can find in biomedical research articles. Croat Med J 2004;45:361-70.
2Strasak AM, Zaman Q, Pfeiffer KP, Göbel G, Ulmer H. Statistical errors in medical research – A review of common pitfalls. Swiss Med Wkly 2007;137:44-9.
3Ercan I, Demirtas H. Statistical errors in medical publication. Biom Biostat Int J 2015;2:21.
4Simon RM, Paik S, Hayes DF. Use of archived specimens in evaluation of prognostic and predictive biomarkers. J Natl Cancer Inst 2009;101:1446-52.
5Banoo S, Bell D, Bossuyt P, Herring A, Mabey D, Poole F, et al. Evaluation of diagnostic tests for infectious diseases: General principles. Nat Rev Microbiol 2006;4 9 Suppl: S17-29.
6Ganesh GN, Hiware SK, Shinde HT, Mahatme MS. Basic biostatistics for post-graduate students. Indian J Pharmacol 2012;44:435-42.
7Ilstrup DM. Statistical methods in microbiology. Clin Microbiol Rev 1990;3:219-26.
8Meisner M, Adina H, Schmidt J. Correlation of procalcitonin and C-reactive protein to inflammation, complications, and outcome during the intensive care unit course of multiple-trauma patients. Crit Care 2006;10:R1.
9U.S. Department of Health and Human Services. Food and Drug Administration. Guidance for Industry and FDA Staff. Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests; 2007. Available from: [Last accessed on 2016 Jun 20].
10Jones SR, Carley S, Harrison M. An introduction to power and sample size estimation. Emerg Med J 2003;20:453-8.
11Kyrgidis A, Triaridis S. Methods and biostatistics: A concise guide for peer reviewers. Hippokratia 2010;14 Suppl 1:13-22.
12Hajian-Tilaki K. Sample size estimation in diagnostic test studies of biomedical informatics. J Biomed Inform 2014;48:193-204.
13Manja V, Lakshminrusimha S. Principles of use of biostatistics in research. Neoreviews 2014;15:e133-50.
14Attia J. Moving beyond sensitivity and specificity: Using likelihood ratios to help interpret diagnostic tests. Aust Prescr 2003;26:111-3.
15Ranganathan P. The (mis) use of statistics: Which test where? Perspect Clin Res 2014;5:197.
16Singh A, Lucas AF, Dalpatadu RJ, Murphy DJ. Casino games and the central limit theorem. UNLV Gaming Res Rev J 2013;17:45-61.
17Nickerson RS. Null hypothesis significance testing: A review of an old and continuing controversy. Psychol Methods 2000;5:241-301.
18Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 2016;31:337-50.
19Slezák P, Waczulíková I. Reproducibility and repeatability. Physiol Res 2011;60:203-4.
20Katz DL, Elmore JG, Wild DMG, Lucan SC. Bivariate Analysis. Jekel's Epidemiology, Biostatistics, Preventive Medicine, and Public Health: 4th ed. Philadelphia, USA: Elsevier Saunders; 2014. p. 81-90.
21Katz DL, Elmore JG, Wild DMG, Lucan SC. Bivariate Analysis. Jekel's Epidemiology, Biostatistics, Preventive Medicine, and Public Health: 4th ed. Philadelphia, USA: Elsevier Saunders; 2014. p. 134-52.
22Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307-10.
23Kumar R, Indrayan A. Receiver operating characteristic (ROC) curve for medical researchers. Indian Pediatr 2011;48:277-87.
24Fan J, Upadhye S, Worster A. Understanding receiver operating characteristic (ROC) curves. CJEM 2006;8:19-20.
25Rutjes AW, Reitsma JB, Coomarasamy A, Khan KS, Bossuyt PM. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess 2007;11:iii, ix-51.
26Weinstein S, Obuchowski NA, Lieber ML. Clinical evaluation of diagnostic tests. AJR Am J Roentgenol 2005;184:14-9.
27Girardi E, Angeletti C, Puro V, Sorrentino R, Magnavita N, Vincenti D, et al. Estimating diagnostic accuracy of tests for latent tuberculosis infection without a gold standard among healthcare workers. Euro Surveill 2009;14. pii: 19373.