O’Connell et al. published a recent study on the prediction of transplant risk from biopsy transcriptome data. The study collected 204 renal allograft biopsies after kidney transplantation. Microarrays were used to find genes that correlate with kidney damage 12 months after the biopsy was taken. Kidney damage was measured using the Chronic Allograft Damage Index. A machine learning model was developed using penalized regression, and it consisted of 13 genes that predicted the development of transplant complications. The study concluded that the model had superior predictive potential (Area Under the Curve=0.967) – better than clinical data normally used to identify complications. However, criticism of the study was quick, concerning the possibility of overfitting (when a machine learning model is fitting noise instead of true signal). The authors of the original study did not separate the test set from the training set when selecting genes for the model. This results in obvious “contamination” of the test set, which should be kept completely isolated until when it is time to report the final performance metrics. The authors replied, stating that they undertook steps to avoid overfitting (leave-one-out cross-validation), which gave similar gene sets as the original method. We are curious to why the authors choose AUC as their performance metric, if the aim was to predict a continuous variable? Perhaps other metrics such as root mean square error would have been more appropriate? From our experience it is very easy to end up with overfitted models when biomolecular data is the predictor. Extreme care must be taken from the beginning in the development of a model.