Infrared thermography is a promising non-invasive modality for breast cancer screening. However, ensuring the clinical reliability of deep learning models under small-data conditions remains challenging. This exploratory study examines the trade-off between predictive performance and model interpretability. DenseNet121 and VGG16 were evaluated across four input configurations: full versus segmented images, with or without edge-enhancement channels. Although DenseNet121 achieved the highest performance on full images, Explainable AI (Grad-CAM) revealed reliance on non-pathological artifacts, such as axillary folds. This shortcut learning was confirmed by a marked performance drop when segmentation and edge detection removed these cues. In contrast, VGG16 demonstrated greater robustness and more consistent anatomical focus, maintaining stability across input variations despite lower sensitivity. These findings provide empirical evidence that high accuracy may obscure underlying decision...
Infrared thermography is a promising non-invasive modality for breast cancer screening. However, ensuring the clinical reliability of deep learning models under small-data conditions remains challenging. This exploratory study examines the trade-off between predictive performance and model interpretability. DenseNet121 and VGG16 were evaluated across four input configurations: full versus segmented images, with or without edge-enhancement channels. Although DenseNet121 achieved the highest performance on full images, Explainable AI (Grad-CAM) revealed reliance on non-pathological artifacts, such as axillary folds. This shortcut learning was confirmed by a marked performance drop when segmentation and edge detection removed these cues. In contrast, VGG16 demonstrated greater robustness and more consistent anatomical focus, maintaining stability across input variations despite lower sensitivity. These findings provide empirical evidence that high accuracy may obscure underlying decision biases in limited datasets. Therefore, interpretability should be considered an essential component for validating clinical reliability rather than merely a visualization tool.