Background: Class prediction models have been shown to have varying performances in clinical gene expression\ndatasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class\nprediction models differs from dataset to dataset and depends on the type of classification function. While a\nsubstantial amount of information is known about the characteristics of classification functions, little has been done\nto determine which characteristics of gene expression data have impact on the performance of a classifier. This\nstudy aims to empirically identify data characteristics that affect the predictive accuracy of classification models,\noutside of the field of cancer.\nResults: Datasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded.\nNine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree\nbased, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models\nwere built for each dataset using the same procedure and their performances were evaluated by calculating their\naccuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/\ncell types and sample size) together with characteristics of the gene expression data, namely the number of\ndifferentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a\nclass prediction model were statistically assessed by random effects logistic regression. The number of differentially\nexpressed genes and the average fold change had significant impact on the accuracy of a classification model and\ngave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random\neffects logistic regression with forward selection yielded the two aforementioned study factors and the within class\ncorrelation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study\nvariation.\nConclusions: We evaluated study- and data-related factors that might explain the varying performances of\nclassification functions in non-cancerous datasets. Our results showed that the number of differentially expressed\ngenes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class\nprediction models.
Loading....