Tasks ๐ฉโ๐ป๐จโ๐ป
The PI-CAI: AI Study (grand challenge) aims to evaluate the performance of modern AI algorithms at patient-level diagnosis and lesion-level detection of csPCa (ISUP โฅ 2 cancer) in bpMRI. Similar to radiologists, the objective of AI is to read bpMRI exams (imaging + clinical variables), produce clinically significant lesion detections, their likelihood scores of harboring csPCa, and an overall patient-level score for csPCa diagnosis.
Figure. (top) Lesion-level csPCa detection (modeled by 'AI'): For a given patient case, using the bpMRI exam (and optionally all clinical/acquisition variables), predict a 3D detection map of non-overlapping, non-connected csPCa lesions (with the same dimensions and resolution as the T2W image). For each predicted lesion, all voxels must comprise a single floating point value between 0-1, representing that lesionโs likelihood of harboring csPCa. (bottom) Patient-level csPCa diagnosis (modeled by 'f(x)'): For a given patient case, using the predicted csPCa lesion detection map (and optionally all clinical/acquisition variables), compute a single floating point value between 0-1, representing that patientโs overall likelihood of harboring csPCa. For instance, f(x) can simply be a function that takes the maximum of the csPCa lesion detection map, or it can be a more complex heuristic (defined by the AI developer).
We require detection maps as the model output (rather than softmax predictions), so that we can definitively evaluate object/lesion-level detection performance using precision-recall (PR) and free-response receiver operating characteristic (FROC) curves. With volumes of softmax predictions, there's a lot of ambiguity on how this can be handled โe.g. what is the overall single likelihood of csPCa per predicted lesion, what constitutes as the spatial boundaries of each predicted lesion, and in turn, what constitutes as object-level hits (TP) or misses (FN) as per any given hit criterion?
Similar to clinical practice, PI-CAI mandates coupling the tasks of lesion detection and patient diagnosis to promote interpretability and disincentivize AI solutions that
produce inconsistent outputs (e.g. a high patient-level csPCa likelihood score
without any significant csPCa detections, and vice versa). Organizers will
provide end-to-end baseline solutions, adapted from the standard U-Net (Ronneberger et al., 2015), the nnU-Net (Isensee et al., 2021) and the nnDetection
(Baumgartner et al., 2021) models in a
GitHub repo.
Preprocessing scripts for 3D medical images, geared towards csPCa detection in MRI:
Evaluation ๐
Performance Metrics
Patient-level diagnosis performance is evaluated using the Area Under Receiver Operating Characteristic (AUROC) metric. Lesion-level detection performance is evaluated using the Average Precision (AP) metric. Overall score used to rank each AI algorithm is the average of both task-specific metrics:
Overall Ranking Score = (AP + AUROC) / 2
Free-Response Receiver Operating Characteristic
(FROC) curve is used for secondary analysis of AI detections (as recommended in Penzkofer et al., 2022). We highlight the performance on the FROC curve using the SensX metric. SensX refers to the sensitivity of a given AI system at detecting clinically significant prostate cancer (i.e., Gleason grade group โฅ 2 lesions) on MRI, given that it generates the same number of false positives per examination as the PI-RADS โฅ X operating point of radiologists. Here, by radiologists, we refer to the radiology readings that were historically made for these cases during multidisciplinary routine practice. For instance, if you refer to the FROC curve in the top-right corner of Fig 4. in Bosma et al., 2023, the Report-guided SSL model has a Sens5 of around 0.62, Sens4 of around 0.75 and Sens3 of around 0.78. Across the PI-CAI testing leaderboards (Open Development Phase - Testing Leaderboard, Closed Testing Phase - Testing Leaderboard), SensX is computed at thresholds that are specific to the testing cohort (i.e., depending on the radiology readings and set of cases). While it doesn't make sense to compute this metric for a different dataset using the exact same thresholds, you can compute this general metric (i.e., lesion detection sensitivity at a given false positives per examination rate) using the picai_eval repo as shown here (using metrics.lesion_TPR_at_FPR).
Performance Metrics
Patient-level diagnosis performance is evaluated using the Area Under Receiver Operating Characteristic (AUROC) metric. Lesion-level detection performance is evaluated using the Average Precision (AP) metric. Overall score used to rank each AI algorithm is the average of both task-specific metrics:
Overall Ranking Score = (AP + AUROC) / 2
Free-Response Receiver Operating Characteristic (FROC) curve is used for secondary analysis of AI detections (as recommended in Penzkofer et al., 2022). We highlight the performance on the FROC curve using the SensX metric. SensX refers to the sensitivity of a given AI system at detecting clinically significant prostate cancer (i.e., Gleason grade group โฅ 2 lesions) on MRI, given that it generates the same number of false positives per examination as the PI-RADS โฅ X operating point of radiologists. Here, by radiologists, we refer to the radiology readings that were historically made for these cases during multidisciplinary routine practice. For instance, if you refer to the FROC curve in the top-right corner of Fig 4. in Bosma et al., 2023, the Report-guided SSL model has a Sens5 of around 0.62, Sens4 of around 0.75 and Sens3 of around 0.78. Across the PI-CAI testing leaderboards (Open Development Phase - Testing Leaderboard, Closed Testing Phase - Testing Leaderboard), SensX is computed at thresholds that are specific to the testing cohort (i.e., depending on the radiology readings and set of cases). While it doesn't make sense to compute this metric for a different dataset using the exact same thresholds, you can compute this general metric (i.e., lesion detection sensitivity at a given false positives per examination rate) using the picai_eval repo as shown here (using metrics.lesion_TPR_at_FPR).
Intersection over Union (IoU) on its own is not used for evaluating detection or diagnostic performance, given that IoU is ill-posed to accurately validate these tasks (Reinke et al., 2022).
Intersection over Union (IoU) on its own is not used for evaluating detection or diagnostic performance, given that IoU is ill-posed to accurately validate these tasks (Reinke et al., 2022).
Hit Criterion for Lesion Detection
A โhit criterionโ is a condition that must be satisfied for each predicted lesion to count as a hit or true positive. For csPCa detection in recent prostate-AI literature, hit criteria have been typically fulfilled by achieving a minimum degree of prediction-ground truth overlap, by localizing predictions within a maximum distance from the ground-truth, or on the basis of localizing predictions to a specific region (as defined by sector maps).
For the 3D detections predicted by AI, we opt for a hit criterion based on object overlap: