This study investigates the relationship between software complexity metrics and defect density using five NASA-PROMISE datasets (JM1, CM1, KC1, KC2, and PC1). Five widely validated complexity metrics—Cyclomatic Complexity, Lines of Code (LOC), Halstead Volume, Halstead Difficulty, and Essential Complexity—were analyzed through both statistical and machine learning approaches. Pearson and Spearman correlation analyses reveal that while linear correlations are weak and negative, monotonic relationships are consistently positive and statistically significant (𝑝 < 0.01), with LOC demonstrating the strongest association. Four machine learning models (Logistic Regression, Support Vector Machine, Random Forest, and Gradient Boosting) were evaluated under two resampling techniques (SMOTE and Borderline-SMOTE), resulting in 40 experimental configurations. Random Forest achieved the best overall performance, with the highest result obtained on KC1 using Borderline-SMOTE (AUC = 0.868, F1 =...
This study investigates the relationship between software complexity metrics and defect density using five NASA-PROMISE datasets (JM1, CM1, KC1, KC2, and PC1). Five widely validated complexity metrics—Cyclomatic Complexity, Lines of Code (LOC), Halstead Volume, Halstead Difficulty, and Essential Complexity—were analyzed through both statistical and machine learning approaches. Pearson and Spearman correlation analyses reveal that while linear correlations are weak and negative, monotonic relationships are consistently positive and statistically significant (𝑝 < 0.01), with LOC demonstrating the strongest association. Four machine learning models (Logistic Regression, Support Vector Machine, Random Forest, and Gradient Boosting) were evaluated under two resampling techniques (SMOTE and Borderline-SMOTE), resulting in 40 experimental configurations. Random Forest achieved the best overall performance, with the highest result obtained on KC1 using Borderline-SMOTE (AUC = 0.868, F1 = 0.500, MCC = 0.432). Feature importance analysis indicates that LOC, Halstead Volume, and Halstead Difficulty collectively contribute over 81% of the predictive power, whereas Essential Complexity shows minimal impact. Compared to prior work, this study improves performance on multiple datasets through class balancing techniques and provides a more comprehensive evaluation using AUC and MCC. The results confirm that a small set of well-established complexity metrics can effectively support software defect prediction when
combined with appropriate preprocessing and learning strategies.