Tasksย ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป

The PI-CAI: AI Study (grand challenge) aims to evaluate the performance of modern AI algorithms at patient-level diagnosis and lesion-level detection of csPCa (ISUPย โ‰ฅ 2 cancer) in bpMRI. Similar to radiologists, the objective of AI is to read bpMRI exams (imaging + clinical variables), produce clinically significant lesion detections, their likelihood scores of harboring csPCa, and an overall patient-level score for csPCa diagnosis.
Figure. (top) Lesion-level csPCa detection (modeled by 'AI'):ย For a given patient case, using the bpMRI examย (and optionally all clinical/acquisition variables), predict a 3D detection map of non-overlapping, non-connected csPCa lesions (with the same dimensions and resolution as the T2W image). For each predicted lesion, all voxels must comprise a single floating point value between 0-1, representing that lesionโ€™s likelihood of harboring csPCa. (bottom) Patient-level csPCa diagnosis (modeled by 'f(x)'):ย For a given patient case, using the predicted csPCa lesion detection map (and optionally all clinical/acquisition variables), compute a single floating point value between 0-1, representing that patientโ€™s overall likelihood of harboring csPCa. For instance, f(x) can simply be a function that takes the maximum of the csPCa lesion detection map, or it can be a more complex heuristic (defined by the AI developer).
We require detection maps as the model output (rather than softmax predictions), so that we can definitively evaluate object/lesion-level detection performance using precision-recall (PR) and free-response receiver operating characteristic (FROC) curves. With volumes of softmax predictions, there's a lot of ambiguity on how this can be handled โ€”e.g. what is the overall single likelihood of csPCa per predicted lesion, what constitutes as the spatial boundaries of each predicted lesion, and in turn, what constitutes as object-level hits (TP) or misses (FN) as per any given hit criterion?
Similar to clinical practice, PI-CAI mandates coupling the tasks of lesion detection and patient diagnosis to promote interpretability and disincentivize AI solutions that produce inconsistent outputs (e.g. a high patient-level csPCa likelihood score without any significant csPCa detections, and vice versa). Organizers will provide end-to-end baseline solutions, adapted from the standard U-Net (Ronneberger et al., 2015), the nnU-Net (Isensee et al., 2021) and the nnDetection (Baumgartner et al., 2021) models in a GitHub repo. Preprocessing scripts for 3D medical images, geared towards csPCa detection in MRI:github.com/DIAGNijmegen/picai_prep/ Baseline AI models for 3D csPCa detection/diagnosis in bpMRI: github.com/DIAGNijmegen/picai_baseline


Evaluation๐Ÿ“Š


Performance Metrics

Patient-level diagnosis performance is evaluated using the Area Under Receiver Operating Characteristic (AUROC) metric. Lesion-level detection performance is evaluated using the Average Precision (AP) metric. Overall score used to rank each AI algorithm is the average of both task-specific metrics:
Intersection over Union (IoU)ย on its own is not used for evaluating detection or diagnostic performance, given that IoU is ill-posed to accurately validate these tasks (Reinke et al., 2022).

Hit Criterion for Lesion Detection

A โ€œhit criterionโ€ is a condition that must be satisfied for each predicted lesion to count as a hit or true positive. For csPCa detection in recent prostate-AI literature, hit criteria have been typically fulfilled by achieving a minimum degree of prediction-ground truth overlap, by localizing predictions within a maximum distance from the ground-truth, or on the basis of localizing predictions to a specific region (as defined by sector maps).

For the 3D detections predicted by AI, we opt for a hit criterion based on object overlap:

  • True Positives: For a predicted csPCa lesion detection to be counted as a true positive, it must share a minimum overlap of 0.10 IoU in 3D with the ground-truth annotation. Such a threshold value, is in agreement with other lesion detection studies from recent literature (Bosma et al., 2023,ย Duran et al.,ย 2022, Baumgartner et al., 2021, Saha et al., 2021, Hosseinzadeh et al., 2021, McKinney et al., 2020, Jaeger et al., 2019).
  • False Positives: Predictions with no/insufficient overlap count towards false positives, irregardless of their size or location.
  • Edge Cases: When there are multiple predicted lesions with sufficient overlap (โ‰ฅ 0.10 IoU), only the prediction with the largest overlap is counted, while all other overlapping predictions are discarded. Predictions with sufficient overlap that are subsequently discarded in such a manner, do not count towards false positives to account for split-merge scenarios.
Performance evaluation utilities for 3D csPCa detection/diagnosis in bpMRI: github.com/DIAGNijmegen/picai_eval