Reasearch Awards nomination

Email updates

Keep up to date with the latest news and content from BioData Mining and BioMed Central.

Open Access Research

Prediction of Drosophila melanogaster gene function using Support Vector Machines

Nicholas Mitsakakis1*, Zak Razak2, Michael Escobar3 and J Timothy Westwood24*

Author Affiliations

1 Toronto Health Economics and Technology Assessment (THETA) Collaborative, University of Toronto, Toronto, Canada

2 Canadian Drosophila Microarray Centre, University of Toronto at Mississauga, Mississauga, Canada

3 Dalla Lana School of Public Health, University of Toronto, Toronto, Canada

4 Department of Cell and Systems Biology, University of Toronto at Mississauga, Mississauga, Canada

For all author emails, please log on.

BioData Mining 2013, 6:8  doi:10.1186/1756-0381-6-8

Published: 2 April 2013

Abstract

Background

While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross‐validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un‐annotated genes. A total of approximately 5043 different genes, or about one‐third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un‐annotated.

Results

39 Gene Ontology Biological Process (GO‐BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO‐BP term for 1422 previously un‐annotated genes or about 77% of the un‐annotated genes represented on the microarray and about 19% of all of the un‐annotated genes in the D. melanogaster genome.

Conclusions

Our study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.

Keywords:
Gene ontology; Support Vector Machines; Drosophila melanogaster; Gene expression data; Gene function prediction