This document discusses receiver operating characteristic (ROC) curves and their use in evaluating diagnostic tests. It begins by defining sensitivity and specificity as metrics for diagnostic test performance. It then explains that ROC curves plot the sensitivity vs 1-specificity for varying diagnostic thresholds. The area under the ROC curve (AUC) provides a single measure of test accuracy. Methods for calculating AUC include parametric and nonparametric approaches. The document also discusses extensions of ROC analysis like free-response ROC (FROC) curves which evaluate tests with multiple lesion detections. It concludes by outlining a study that used JAFROC analysis to evaluate the effect of a computer-aided detection (CAD) system on radiologist performance in detecting lung nodules on
Heart Disease Classification Report: A Data Analysis Project
(20180524) vuno seminar roc and extension
1.
2.
3. ▪ Most important and widely used metric for evaluating the performance of diagnostic test
– Sensitivity : Num of true positive decision/the number of positive cases
– Specificity : Num of false negative decision/the number of negative cases
Performance Measures
Diagnostic Test
4. ▪ Diagnostic decision making itself is ambiguous
– No clear-cut between ‘Normal’ and ‘Abnormal’
– Therefore it is more natural to rate the case using some scale.
– Ex) Five-point scale for nodules in chest radiograph
• 1(definitely benign), 2(probably malignant), 3(possibly malignant), 4(probably malignant), 5(definitely
malignant)
• There are four cut-off values : 2≥, 3≥, 4≥, 5
– Then we have multiple points pair of (sensitivity, specificity) values which can be plotted on the graph with
sensitivity as the y-axis and (1-specificity) as the x-axis
– These discrete points are called as ‘operating points’.
– We need a way to assess the performance of diagnostic test independently of the decision threshold
Why Do We Need a Curve for Performance Measure?
Operating Points
5. ▪ The ROC curve is the estimation of all possible pairs on the graph from these operating points(A).
– Fitted or Smoothed ROC Curve(B) : Parametric estimation
• Smooth curve estimated from the operating points based on a binormal distribution assumption on the test results for both
positive and negative cases.
– Empirical ROC Curve(C) : Nonparametric estimation
• Connect all operating points with straight lines
▪ Why is it called ROC?
– The term ROC refers to the performance of a human or mechanical observer(the receiver) that has to
discriminate between radio signals contaminated by noise and noise alone. It is developed in 1950s.
Receiver Operating Characteristic
ROC Curve
6. ▪ Even Googler …
Receiver Operating Characteristic
ROC Curve
7. ▪ AUROC or AUC
– Average value of sensitivity for all possible value of specificity
– The value of AUC takes any value between 0 and 1 and independent from disease prevalence
– AUC of 1 means perfectly accurate test while the practical lower bound is 0.5 for random guess.
– The rating scheme(discrete or continuous) is important for the reduction of bias in the estimation of AUC
– It can be interpreted as the figure of merit(FOM), the probability that positive case is rated higher than negative cases.
▪ Frequentist Method
– Parametric AUC
• Obtained with fitted ROC curve.
• Based some assumption(Well distributed Binormal distribution
of test results, sample cases are not extremely small)
– Nonparametric AUC
• Estimated by the summation of trapezoids formed under
empirical ROC curve
• Underestimates AUC when discrete ratings are used.
▪ Bayesian Method
– Exploit prior or latent variable to express the unknown disease status
– Especially useful when the ‘gold standard’ is absent or uncertain.
Measure of Overall Diagnostic Performance
Area Under ROC Curve
8. ▪ BiNormal Assumption
▪ Proof(Caution! Proof by KH, thus not guaranteed)
Fitted ROC Curve
Parametric ROC Curve
http://www.navan.name/roc/
9. ▪ AUC can vary according to the sample cases.
– With same diagnostic test, the performance will vary according to the test samples.
– We can therefore choose a range of AUC in which the true value lies with certain degree confidence.
– 95% confidence interval is often used.
▪ Computation of confidence interval for AUC
– Confidence Interval :
where
,
Assessing Statistical Significance of AUC
Confidence Interval of AUC
J. A. Hanley and B. J. McNeil(1982)
https://pubs.rsna.org/doi/pdf/10.1148/radiology.143.1.7063747
10. ▪ Overall performance of different diagnostic test can be compared using AUCs
– However, same AUCs do not mean two tests are identical.
– The equality of two ROC curves can be statistically tested using ‘a’ and ‘b’,
which completely specify the shape of ROC curve.
▪ Partial AUC
– According to the diagnostic situations,
full AUC will not be clinically meaningful.
– For screening serious disease in a high risk
group, high sensitivity is important.
– For a disease with low prevalence and
risky subsequent confirmatory test,
high specificity is important.
– In these cases, we can set a specific FPR range
(or sensitivity range) to calculate mean
sensitivity(or FPR) within that range
Comparison of Overall Diagnostic Performance
Comparing AUCs
11. ▪ The Need for Extension to ROC
– ROC can only deal with binary decision and don’t encompass lesion locations.
– Location ROC(LROC) handle predefined regions in the image separately and compute ROC based on the number of regions and
their decision(ex) left, right lung or lobe). The readers are informed that there can be at most one lesion per image.
– Both ROC and LROC is problematic to handle multiple lesions or suspicious location in the images.
– In Region-of-interest(ROI) method, similar to LROC but deals with regions independently.
– Both LROC and ROI method cannot account for the correlations among the regions in the same image.
– Free response task means the reader is given no prior information regarding the number of lesion in the image, and therefore
it is free for the reader how many(or no) lesions to mark.
▪ Free-response ROC(FROC)
– Plot of lesion locations performance test in a way that
y-axis corresponds to fraction of lesions detection and x-axis
corresponds to false positive per image.
– Most widely used plot used to assess lesion detection tasks
such as lung nodule detection or liver tumor detection.
– True positive is defined when an indicated location falls within
a specified distance of a true lesion.
– Here, the x-axis has no upper bound.
The Free-Response Task
Free-response ROC
12. ▪ Generation of FROC Curve
– Below, green circles means true positive while red circles means false positives.
– The circles are ordered with the confidence level(z) increasing to the right.
– Starting on the extreme right hand side, from the positive infinity, we move the cutoff to the left.
Whenever we pass the green circle, we move up the operating point by 1/L, where L is the number of lesions.
– Whenever we pass the red circle, we move right the operating point by 1/N, where N is the number of images.
▪ Pros and Cons of FROC
– Pros
• It visualizes the utilization of rating scales -> Ideally, the FROC curve should end in plateau.
• We can deal with multiple lesion marks and corresponding ratings.
– Cons
• It does not account for unmarked non-diseased cases(true negative), which account for most of the cases in many diagnostic
imaging.
• The x-axis is unconstrained making it impossible to assess the figure of merit.
Interpretation of FROC
Free-response ROC
13. ▪ AFROC Definition
– When we change x-axis of FROC to false positive fraction, then it is called alternative FROC or AFROC.
– The plot is constrained to lie within the unit-square and figure-of-merit is computable.
– However, AFROC ignores intra-image lesion correlations and used in limited situations.
Solving Problem of FROC by Bounding Characteristics
Alternative FROC
15. ▪ Bootstrapping
– A method for evaluating the variance of an estimator
Bootstrapping and Jackknifing
JAFROC
▪ Jackknifing
– Instead of generating a set of random samples, we generate n
samples of size n-1 by leaving out one observation at a time.
16. ▪ Method for analyzing free-response multiple-reader multiple-case (MRMC) study.
Jackknife Analysis of Free-Response ROC Data
JAFROC
=> Probability that a lesion rating exceeds non-lesion rating
17. ▪ Excel File Format
– The worksheets must be named Truth, TP and FP.
– The first row of each worksheet is reserved for data labels.
– Truth denotes ground truth information for each image.
– TP = the ratings "true positives", i.e., lesions that are correctly localized.
– FP = ratings for "false positive", i.e., ratings of marked normal region
Data Format in JAFROC Analysis Software
JAFROC
20. ▪ Inclusion Criteria
– 300 PA and lateral chest radiographs are retrospectively selected from 4 hospitals in Netherland
(Radboud University Medical Center, University Medical Center, Academic Medical Center, Meander Medical Center)
– Presence of a solid solitary nodules(< 30mm in diameter, mean 16.2mm) and the availability of a PA and lateral chest
radiograph and a chest CT scan obtained within 3 months. (189 negative, 111 positive cases).
– Radiograph showing signs of other disease(except COPD) were excluded.
– All subjects were older than 40. (44-88 years with average 65 years, 177 male, 123 female.).
– Absence of disease was ascertained by radiograph and CT scans(taken within 6 months) with negative findings.
– To contain wide range of lesion conspicuities, two experience radiologists rated the visibility in consensus.
• Category 1(Well visible), Category 2(Moderately subtle), Category 3(Subtle), Category 4(Very subtle)
– Nodule volume was assessed using CT scan and diameter was calculated assuming each nodule to be a sphere.
▪ Image Acquisition
– Chest radiographs are obtained with digital x-ray devices from Agfa Healthcare, Philips Healthcare and Siemens.
▪ Image Processing
– Commercially available CAD(ClearRead +Detect 5.2, Riverain Technology) was used.
– This CAD is optimized for the detection of nodules between 9 to 30mm in diameter which are marked by circles.
– Bone suppression images were computed by using software(ClearRead Bone Suppression 2.4, Riverain Technologies) which
digitally removes ribs and clavicles.
– Both software are FDA approved.
Data
JAFROC in Clinical Applications
21. ▪ Readers
– Five radiologists(5, 13, 3, 17, 17 years of experience), and three residents(2nd-year, 4th year and 4th year).
– No experience with CAD and BSIs
▪ Reading Setting
– Evaluation was performed in different randomized orders.
– Readers reviewed the cases first without and subsequently with the use of CAD.
– BSIs were always available.
– Training session was provided to familiarize the readers with the softwares(40cases, 22 w/, 18 w/o nodules)
▪ Reading Method
– Readers mark suspicious regions in the chest radiograph with the degree of suspiciousness(confidence) that a nodule was
present(0, not suspicious, 100, definitely suspicious).
– Readers were allowed to mark multiple regions per image and did not have ability to change their decision.
– After first scoring phase without CAD but with BSIs, CAD marks were automatically displayed and could be toggled on/off.
– The readers were asked to score new region, remove marked region in the first phase, or change the score of the marked
region.
– The readers were informed that maximum of one nodule is present at each case and there are more normal cases than nodule
cases. But they did not know exact numbers.
Reading Method
JAFROC in Clinical Applications
22. ▪ Statistics
– Multireader multiple-case jackknife alternative free-response receiver operating characteristic(AFROC) analysis was performed.
– A finding by the reader was considered a TP finding when the marking was within 1cm of the center of the ground-truth
annotation.
– As input for jackknife AFROC analysis, only one reader score per image is used.
– For cases with negative findings, FP finding with the highest score was used.
– For cases with positive findings, markings of nonlesion locations are ignored and only TP markings are used.
– AUC which represent the probability that a lesion is rated higher than nonlesion in the negative case was calculated by using
the trapezoidal integration method(a.k.a. Wilcoxon rank-sum test).
– AUCs without and with the help of CAD were compared with the Dorfman-Berbaum-Metz method(DBMMRMC, ver. 2.33)
Statistical Analysis
JAFROC in Clinical Applications
OR-DBM MRMC Data Format
(http://perception.radiology.uiowa.edu/Software/ReceiverOperatingCh
aracteristicROC/MRMCAnalysis/tabid/116/Default.aspx)
23. ▪ Stand-alone CAD
– CAD reached sensitivity of 74%(82 of 111) at 1.0 FP mark per image. (0~5 marks per image)
• 91%(28 of 32) for well-visible nodules, 88%(28 of 32) for moderately subtle nodules.
• 62%(18 of 29) for subtle and 39%(7 of 18) for very subtle nodules.
• 91%(21 of 23) of the nodules > 20mm, 62%(20 of 32) for nodules between 15 and 20mm, 77%(36 of 47) for nodules
between 10 and 15mm, 56%(5/9) for nodules < 10mm
• CAD reached AU-AFROC curve of 0.656. CAD generated 196 FP in cases with negative findings(189)
▪ Observer Performance
– AUAFROC for human readers was 0.812 without CAD vs 0.841 with CAD(p=0.0001)
– CAD detected 53%(127 of 239) of the nodules that were missed by
the readers.
– Reader dismissed 55%(70 of 127) of these TP CAD candidates.
– CAD helped the readers to place new correct lesion label(57)
and increase confidence score to lesion(220).
– CAD counteracted by placing wrong new label(92) or increase
confidence score to nonlesion(66)
Results
JAFROC in Clinical Applications
24. ▪ Positive and Negative Effect of CAD
Results
JAFROC in Clinical Applications
25. ▪ Advantage of using CAD
Results
JAFROC in Clinical Applications
27. ▪ Advantage of using CAD
Results
JAFROC in Clinical Applications
28. ▪ Conclusion
– CAD improves observer performance for the detection of lung nodules on chest radiographs, beyond the
application of bone suppression alone
– CAD detected 3/7 nodules missed by all radiologists and 12/25 most of the radiologists.
– The CAD was most helpful for moderately subtle and subtle lesions.
– For well-visible nodules both CAD and radiologists had high sensitivity
– For very subtle nodules, the sensitivity of CAD was much better than readers(39% vs 22%) but readers could
not take advantage of CAD because of the difficulty of differentiate TP from FP.
– The beneficial effect of CAD is limited by the insufficient ability of the observers to differentiate true-positive
from false-positive CAD candidates.
– Combination of CAD and bone suppression in chest radiography improves detection of potentially early lung
cancer.
=> We need a principled and clinically realistic method to assess more complex use cases of CAD(multiple
disease with lesion markings).
Discussion and Conclusion
JAFROC in Clinical Applications