Skip to main content

Prediction of Drosophila melanogaster gene function using Support Vector Machines

Abstract

Background

While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross‐validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un‐annotated genes. A total of approximately 5043 different genes, or about one‐third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un‐annotated.

Results

39 Gene Ontology Biological Process (GO‐BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO‐BP term for 1422 previously un‐annotated genes or about 77% of the un‐annotated genes represented on the microarray and about 19% of all of the un‐annotated genes in the D. melanogaster genome.

Conclusions

Our study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.

Peer Review reports

Background

While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. For example, for Drosophila melanogaster, approximately 28% of the 14,029 predicted genes have no Gene Ontology (GO) term (either Molecular Function, Biological Process and/or Cellular Component) associated with them (including both curated and electronic annotations) and only 58% have a GO‐BP (Biological Process) term [1]. If one excludes genes that only have an electronic annotation, then only 41% have a GO‐BP term [1]. Functional genomic data and in particular microarray mRNA expression data have been used by numerous researchers as a means to help predict the function of un‐annotated genes. The analysis of such data is based on the premise that genes involved in a particular biological, molecular, and/or biochemical process are often co‐expressed. This co‐expression is dependent on the presence of common cis‐regulatory elements of the co‐regulated genes that bind one or more common transcription factors. A common approach to examine co‐expression profiles from microarray experiment data is to use clustering analysis. In this type of analysis, genes are organized and grouped based on their expression profile, with genes having similar expression pattern being “clustered” (grouped) together. The results of clustering analysis depend (a) on the metric used for comparing the expression profiles of genes that are analyzed and (b) on the clustering algorithm used. Euclidean distance and Pearson’s correlation coefficient are two simple commonly used metrics [2]. Popular clustering algorithms are hierarchical clustering, k‐means clustering and Self Organizing Maps [3].

Although clustering provides an insightful way of exploring gene co‐expression patterns, it does not directly convey any information regarding potential functions of un‐annotated genes. For that purpose, researchers have applied a further layer of analysis to the results of clustering, making use of additional data containing annotations of genes being analyzed. According to this type of analysis, clusters that are “enriched” for a particular gene function (i.e. they contain a larger than by chance number of genes having that function) are “assigned” that function. Subsequently, and following the “guilt‐by‐association” principle, all un‐annotated genes in that cluster are predicted to have the same function [4]‐[7].

In addition to the clustering analyses of genes in an unsupervised fashion and without annotation information, a number of supervised methods have been proposed for predicting gene functions. These methods make use of available information about the annotations of genes in order to discover gene expression patterns that characterize those annotations. Then, functional predictions for un‐annotated genes are made based on how well their expression profiles are matched with “annotated” patterns. Perhaps the most popular class of supervised methods are the binary classifiers, where first, expression data are separated into two classes either having or not having a particular annotation, and then un‐annotated genes are predicted of having or not having the annotation, based on which class their expression profile falls under. Examples of applications of these methods include Support Vector Machines (SVM) [8, 9], random forest [10], neural networks [11], factor analysis [12], logistic regression, linear discriminant and quadratic discriminant analysis [13]. The application of the aforementioned approaches for gene function prediction benefits from the use of large data sets where many different experimental treatments or conditions make up the microarray expression dataset (e.g. developmental time points, mutations, specific tissues, environmental conditions, drug treatments, etc.). For the most part, the studies that have tried to do large‐scale gene prediction assignment have used well‐known model organisms where large microarray datasets were available. These include studies in Saccharomyces cerevisiae[7, 14], Arabidiopsis thaliana[13], D. melanogaster[15], and Mus musculus[9]. Other approaches include assembling and integrating a large number of datasets from different experimental approaches (e.g. microarray expression, genetic interaction, and protein‐protein interaction) to create a network which in turn improves the robustness of the gene function predictions. These include studies in Saccharomyces cerevisiae[16], C. elegans[17], D. melanogaster[10, 18] and Mus musculus[19].

In this study we focus on the use of SVM for the prediction of D. melanogaster gene functions. SVM is a popular machine learning method for classification and regression. Its proven high performance as well as its solid theoretical basis justify its frequent use in many fields, including bioinformatics and predictions of gene functions. As a two‐class classification tool, SVM attempts to separate the data points not in the original feature space but in an “enlarged” higher‐dimensional space instead. The seemingly highly computationally expensive data transformation is not performed but instead, ingeniously, the separation is performed “implicitly” based on their distances measured with the use of a kernel function. The SVM algorithm uses optimization techniques to find the surface that gives the optimal margin between the points of the two classes [20, 21].

Despite its popularity in bioinformatics along with other research areas, SVM has been used for the prediction of GO‐BP annotations for genes of various organisms (such as S. cerevisiae[8] and M. musculus[9]) but not for D. melanogaster. Our study fills this gap, investigating how well this very popular method works for D. melanogaster gene expression data, in particular when the dataset is of a specific structure imposed by the nature of time‐course experiments, as the one we use in our study.

Using microarray data from the life cycle of D. melanogaster[22], and a controlled vocabulary for annotation of biological processes associated with D. melanogaster genes from the Gene Ontology Consortium (GO‐BP) [23], in this study we propose a method of predicting gene function of un‐annotated genes in the D. melanogaster genome by using Support Vector Machines and a two‐level data splitting rotation scheme for validation (double cross‐validation). Our prediction method was evaluated also externally with the use of an independent dataset.

Using this approach we have been able to provide a putative GO‐BP term for about 77% of the un‐annotated genes represented in the dataset and about 19% of all of the un‐annotated genes in the D. melanogaster genome helping to bridge the gap for the large number of genes that have little or no annotation. In addition, this SVM approach provides a precision and probability estimate that can help guide users as to the likelihood a given gene belongs to GO annotation class.

Methods

Microarray data and annotation sources

The microarray data used in this study was obtained from the series of 138 cDNA microarrays spanning the life cycle of Drosophila melanogaster[22]. The microarrays contained probes from 6765 cDNA clones and represented 5043 genes, roughly one third of the total number of genes in the D. melanogaster genome. From those genes, 1854 were not annotated with any GO‐BP term at the time of the analysis. Data were obtained from the Stanford Microarray Database [24] and normalized using a ratio‐based method according to the original publication. The dataset can also be obtained from the Gene Expression Omnibus [25], GEO accession number GSE4347. cDNA clone names were converted to primary Flybase Gene Identifiers (FBgn ids) from release 4.2 of the D. melanogaster genome using annotation available at Flybase [26]. Biological process annotation was downloaded from the Gene Ontology Consortium [23] in February 2006. Only GO‐BP categories containing a minimum of 10 and a maximum of 999 genes in the dataset were included, and 788 categories met these criteria. Clones with duplicate computer gene (CG) numbers were purposely not removed as we were interested in investigating the consistency of the predictions across the duplicates.

Support Vector Machines

A Support Vector Machine (SVM) is a classification and regression method originally developed by Vapnik [21]. Given a set of p‐dimensional vector data x i and their labels y i taking the values {-1,+1}, a linear Support Vector Machine finds the optimal hyperplane that separates the “positive” from the “negative” class. This plane is maximizing the margin between the two classes. According to the mathematical formulation of the problem, the solution refers to a weight vector w and a scalar b that satisfy the optimization problem

min | | w | | , y i ( w T x i + b ) 1 , i = 1 , , N ,

where N is the number of samples.

When classes are overlapping and misclassifications are allowed, the above constraints become

y i ( w T x i + b ) 1 - ξ i , ξ i 0 , ( ξ i ) C ,

where ξ i are slack variables and C a constant. Solutions can be generated by using the “Lagrange” formulation of the problem and its “Wolfe dual” problem. Given the solution, the weight vector w, for a new point x, the function f(x)=w T x+b is calculating the discriminant value for x, which can be used for its classification to the positive or negative class. It turns out that only a number of training points are important for the determination of the solution, the support vectors.

This mechanism can be applied to problems of non‐linear separation, after mapping the data onto a higher dimensional space, with the use of a mapping Φ. Φ is determined by a kernel K(.,.) such that K(x,y)=<Φ(x),Φ(y)>. The optimization problem is similar with the linear separation case, and following the Lagrange formulation and given a solution of Lagrange multipliers α i , the discriminant function is given by f(x)= i N s α i y i K(x, x i )+b, where x i ,i=1,…,N s are the support vectors. It is important to note here that the solution and discriminant function depend on the data only through the kernel, and also that the explicit expression of the mapping Φ is not needed for the solutions. This is one of the reasons that make SVM computationally attractive. More information on SVM can be found in [27].

It is evident that SVM depends on the choice of the kernel. After performing a small set of experiments with pilot runs evaluating a number of kernel choices, we decided to use a radial basis kernel,

K ( x , y ) = exp { | | x - y | | 2 / σ 2 } ,

where ||X||= < x , x > = x T x , over a linear or polynomial kernel. Our decision was also supported by the large popularity of this kernel (previously used, for instance, in [8, 9]). We used as σ a heuristic value equal to the median value of the distances between positive training points and their nearest negative training points. To combat asymmetric classes, a suggestion from [28] is implemented, where K(x,x) is augmented by λ·|C|/N where |C| is the size of the class C that x belongs to, and λ a tuning parameter. For our calculations we use λ=m/2, where m is the median value of the diagonal of the kernel matrix. For the implementation of the SVM algorithm we use the publicly available Gist package [29].

Estimation of class membership probabilities

For the translation of the discriminant values f(x) to posterior probabilities of class membership, p(y=1|x), we use the method proposed in [30]. The author proposes a sigmoid function model, where

p ( y = 1 | x ) = 1 1 + exp ( a · f ( x ) + b ) ,

with parameters a,b. The estimation of the parameters a,b is done by solving a maximum likelihood problem

min a , b L ( a , b ) ,

where

L ( a , b ) = i = 1 N ( t i log ( p i ) + ( 1 - t i ) log ( 1 - p i ) )

and

p i = 1 1 + exp ( a · f i + b ) , f i = f ( x i ) , t i = N + + 1 N + + 2 if y i = 1 1 N - + 2 if y i = - 1 , i = 1 , , N ,

where N +,N - are the sizes of the positive and negative class respectively.

For the sigmoid model we use the model fitting algorithm proposed by [31], which is implemented in the Gist package.

Cross‐validation and performance evaluation

For each selected GO‐BP category a set of annotated genes consisting from the positively annotated and a subset of the negatively annotated of size equal to up to four times the size of the positively annotated is participating in the evaluation of the SVM prediction algorithm. If I is the index set of all these genes participating in the cross‐validation, a partition {I 1,I 2,I 3,I 4} is generated, where each I j has equal size and equal number of positively labelled genes (to the degree that this is achievable). Two of those sets are used as “training set” for the SVM, one set is used for calibration and sigmoid fitting, and the last set is used as “test” set, where probability estimate values are output from SVM and compared with the known annotations of the genes for the evaluation of the method. For each permutation (α,β,γ,δ) of {1,2,3,4}, and if gene g i belongs to set I α , we denote with p β|γ,δ (i) the probability estimate for gene g i from an SVM classifier that was trained with training set I γ I δ and calibrated with the set I β . Keeping the test set fixed we can have three different ways of constructing the training and calibration sets, and therefore each gene of the test set has three different probability estimate values. For example, for gene g i in I 1, we have p 2|3,4(i),p 4|2,3(i) and p 3|4,2(i). Those three values are averaged to produce a unique probability estimate for every gene. For each test set (or fold) the probability estimates and true annotations of the genes are used for the calculation of performance measures such as precision at 20, 30 and 40. The values of those measures are then averaged to produced one final measure for the whole SVM procedure. Note here the distinction between the probability estimate generation for each gene that involves averaging over the three possible arrangements of training and calibration sets, and the calculation of the performance measure that is done independently for each one of the four folds (test sets), using each time only the genes of the test set.

For the calculation of precision and recall and the generation of the precision‐recall plots the R package ROCR [32] was used. For the precision‐recall plots, the curves corresponding to the different folds were averaged vertically (i.e. precision values were averaged for the same recall value).

Results

Gene function prediction using Support Vector Machines

The prediction of gene functions was performed with the use of a number of SVM classifiers. Annotation predictions were made independently for each of 788 GO‐BP categories selected based on their size (see Microarray data and annotation sources in Methods section), based on the results of an SVM classifier. Each SVM was trained with the use of a training set of data points of known labels (in this case, genes with Gene Ontology Biological Processes (GO‐BP) annotation). To remedy any model fitting problem caused by the high imbalance between the positive and negative classes, for every GO‐BP category and SVM classifier we limited the negative examples by randomly selecting a subset of them with a ratio of positive to negative examples being 1:4. The trained system was used for the classification of new, unlabelled data (i.e. prediction of functions for un‐annotated genes). The classification was based on the discriminant value of the data point, which is output by the SVM. As the discriminant value measures the “distance” of the data point from the hyperplane that separates the two classes, traditionally points with discriminant value larger than some threshold (typically zero) are classified as positives and the rest as negatives. Here, we refrain from translating the discriminant values to binary classifications, but, instead, we used a previously published algorithm to estimate the posterior probability of a gene membership to each GO‐BP category. The algorithm fits a sigmoid function to the discriminant values of labeled data and it uses this function to approximate posterior class probabilities for unlabelled data [31]. A flowchart describing the probability estimation procedure using SVM and the sigmoid fitting function is shown in Figure 1.

Figure 1
figure 1

Flowchart Describing the Probability Estimation Procedure using SVM and the Sigmoid Fitting Function. First, SVM is trained using dataset A (SVM training set). Then, classification predictions (in the form of discriminant values) for dataset B (sigmoid training or tuning set) are generated. Those predictions along with the known labels of B are used for the fitting of the sigmoid function. Finally, classification results for dataset C (test set) are mapped to estimated class membership probabilities using the fitted sigmoid.

We validated the performance of SVM and sigmoid fitting function with the use of stratified cross‐validation, where the initial set of known data points (positively and selected negatively annotated genes) was randomly partitioned into four equal size sets, with equal number of positively annotated genes [33]. For every possible set combination, two fourths were used as “SVM‐training” set (used to train the SVM classifier), one fourth as a “sigmoid‐training” set (used for fitting the sigmoid function to the discriminant values) and one fourth as an overall “test” set, for which the membership probabilities were estimated. Predictions results for the latter set can then be validated, since the true labels (gene functions) are known. For more details in the cross‐validation procedure used see Methods section.

For any chosen probability threshold value p t we can calculate the corresponding precision value, defined as the ratio of the number of genes that have average probability estimate equal or larger than p t and belong to the class, divided by the number of genes that have a probability estimate equal or larger than p t (i.e. True Positives / Predicted Positives). Similarly, recall is defined as the proportion of the genes annotated with the term that also have average probability estimate equal or larger than the threshold (i.e. True Positives / Total Positives). While precision gives a measure of the accuracy of the prediction, recall measures the coverage or completeness. The prediction performance of SVM for a specific GO‐BP category was evaluated with the use of a measure equal to the precision value corresponding to a specific recall value (e.g. 20%, 30%, 40%). Precision at 20, 30 and 40 were calculated for each fold and averaged over all four folds, and the number of categories with high performance measure values were calculated. More specifically, the method identified 39 high‐precision categories reaching precision at 40 values equal or larger than 0.75. Table 1 presents these categories. Some of these categories are related through ancestor‐predecessor relationships and therefore have a certain amount of redundancy. (The relationship between the categories is shown in a GO graph presented in Additional file 1: Figure S1.) Taking into account these relationships as well as the size of the categories, we came up with a subset of 24 categories of minimum redundancy.

Table 1 Selected GO‐BP categories with high precision values

Two of the high‐precision GO‐BP categories, DNA‐dependent DNA replication and oxidative phosphorylation, were chosen for further analysis. Precision‐recall plots were generated for both categories and presented in Figures 2A and 2B.

Figure 2
figure 2

Precision‐recall Plots for Two of the GO‐ Biological Process Categories. Plots show the fold‐average precision that corresponds to a recall value for (A) DNA‐dependent DNA replication, and (B) oxidative phosphorylation GO categories. Vertical averaging method was used.

Developmental gene expression profiles for the genes in these two categories are shown in Figure 3. For all of the expression profiles, the gene expression ratio for a given gene at a given time point is calculated by dividing the intensity signal for that gene at that time point by the average intensity for that gene throughout development. For the genes belonging to the DNA‐dependent DNA replication category, there is a strong up‐regulation of these transcripts from the earliest time points in development (0.5‐1.5h embryos) through approximately mid‐embryogenesis (11‐12h embryos) (Figure 3A). By way of comparison, the gene expression profiles of an equivalent number of genes selected at random are shown in the lower half of the figure and in general, the pattern seen for the DNA‐dependent DNA replication genes is not observed for other genes. The genes belonging to the oxidative phosphorylation category have a very different transcriptional profile. They are generally expressed at low levels during early embryogenesis and begin to show increased expression beginning in late embryogenesis and throughout the three larval stages. At late third instar (L 96h and 105h), prepupal and the majority of pupal stages (M00h‐M48h) these genes are down‐regulated and then are dramatically up‐regulated in late metamorphosis (M60h) and stay high until early adulthood (Figure 3B). It should be noted that during embryogenesis and metamorphosis the organisms are not feeding or moving but during the larval and adult stages they are. Interestingly, the oxidative phorphorylation genes are up‐regulated well in advance of the organisms moving or feeding, suggesting that the organism is preparing itself for the next stage of the development that will require the enzyme activities to carry out this process.

Figure 3
figure 3

Developmental Transcription Profiles for Two of the GO‐ Biological Process Categories. The gene expression profiles of two of the GO‐BP categories: (A) DNA‐dependent DNA replication, and (B) oxidative phosphorylation are shown. The top portion of each of the category figures contains the expression profiles throughout development for each gene in the category in the same order as they appear in either Table 2 or 3 respectively. The bottom portion of each category represents an equal number of genes randomly selected from the expression profiles in the entire data set. Red denotes genes having up‐regulated transcription at a given time point and green down‐regulated genes. The scale at the top of the figure indicates the degree of up‐ (in red) and down‐ (in green) regulation (in fold change).

Table 2 Gene Lists and Fluorescent In Situ Hybridization (FISH) analysis for DNA‐dependent DNA replication GO‐BP category
Table 3 Gene Lists and Fluorescent In Situ Hybridization (FISH) analysis for oxidative phosphorylation GO category

Gene function predictions for un‐annotated genes

One approach of declaring an un‐annotated gene as predicted to be annotated with some GO‐BP term, is to ascertain if its probability estimate output from the SVM specific for this term was larger than some threshold value (e.g. 0.5) and then assign the un‐annotated gene with that term. However, drawbacks to this approach include the arbitrary choice of the threshold and the fact that it ignores the evaluation of the performance of the classifier, as measured by cross‐validation. In this approach, any gene that has probability estimate score larger than 0.5 (or any other pre‐specified threshold value) for two categories A and B, will be predicted to belong to both A and B, regardless how well SVM is performing for these categories. A third problem with this method is that it assumes that probability estimate values are comparable between different GO‐BP categories. Although the purpose of the sigmoid function fit method is to provide with comparable estimates, one cannot be certain that this is necessarily obtained.

We have chosen an approach that takes into account the predictive behaviour of SVM for a specific GO‐BP term, when a prediction is made for an un‐annotated gene, as well as it addresses the other issues mentioned above. For each gene‐GO category pair a “gene precision” score is assigned to be equal to the maximum precision that can be achieved using thresholds smaller or equal to the probability estimate of the gene (output from SVM). If the gene is to be included in the “positive” predictions from SVM (and therefore to be assigned the GO‐BP function), the threshold has to be such that it will give a precision value for the classifier not larger than the “gene precision” score for the gene‐GO category pair. This score measures how probable is for the gene to truly belong to GO‐BP, given its probability estimate output from SVM, and also given the precision value of SVM for the particular category. This gene precision score is calculated separately for each one of the four folds (since for each fold probability estimates are generated for all un‐annotated genes) and the four fold‐specific scores are averaged.

We can then declare as gene function predictions all the gene‐category pairs that have “gene precision” score equal or larger than some threshold value. More specifically, 9887 new predictions representing 1422 genes (CG numbers) were generated corresponding to a gene precision score larger than 0.75. A total of approximately 5043 different genes, or about one‐third of the predicted genes in the D. melanogaster genome, are represented in the dataset. 1854 (or 37%) of these genes are un‐annotated and therefore this method is providing gene function predictions for about 77% (1422/1854) of the un‐annotated genes represented on the microarray and about 19% of all of the un‐annotated genes in the D. melanogaster genome. A graphical representation of these results is shown in Figure 4. The complete list of the these predictions can be found in Additional file 2: Table S1.

Figure 4
figure 4

The Proportion of genes that are Annotated, Un‐annotated, and Predicted by the SVM to Belong to a GO‐BP Category. Genes from the developmental time course data set that had no GO‐BP annotation were broken down into genes that were predicted with high confidence to belong to a GO‐BP category (predicted), and those that had low prediction values (not predicted).

Experimental support of predicted gene functions

The un‐annotated genes contained in two of the high‐precision GO‐BP categories, DNA‐dependent DNA replication and oxidative phosphorylation are listed in Tables 2 and 3. For each gene, the gene precision score is indicated. The transcripts for genes involved in the same function are often co‐localized within cells. Each of the genes (i.e. CG numbers) on each list was searched in a D. melanogaster FISH database [34, 35] to visually examine the spatial and temporal mRNA expression for that gene during early embryogenesis. Genes that are marked with a (+) or (-) had images present in the database. For the genes with images present in the FISH database, their spatial and temporal pattern were compared with those of known genes annotated with the category. Genes that had patterns that largely matched that of known genes in the category are marked with a (+). If either their temporal or their spatial pattern did not match the known gene pattern, they are marked with (-).

One annotated gene from the dataset known to belong to the GO‐BP and one un‐annotated gene from each of the two GO‐BP categories listed in each of Tables 2 and 3 were chosen and FISH images for those genes was retrieved from the FlyFISH database. Four different stage categories of early embryogenesis are shown in Figure 5. Green fluorescence represents the mRNA localization pattern for that transcript/gene and red fluorescence is showing the position of nuclei within the organism at that stage. For the known gene belonging to the GO‐BP DNA‐dependent DNA replication, the annotation for gene CG1584 (Gene name = Orc6, i.e. Origin recognition complex subunit 6) has a molecular function described as DNA binding. It is involved in the following biological processes: DNA replication initiation; DNA‐dependent DNA replication; chromatin silencing. The un‐annotated gene, CG1578 (from Table 2), had a very similar in situ hybridization pattern (Figure 5A). For the oxidative phophorylation GO‐BP category, the known gene is CG1970, annotated as having NADH dehydrogenase (ubiquinone) activity. It is involved in the biological process mitochondrial electron transport. The un‐annotated gene, CG1715 (from Table 3), has an in situ hybridization pattern closely resembling the pattern seen for CG1970.

Figure 5
figure 5

Fluorescent In Situ Hybridization (FISH) Images of Annotated and Un‐annotated Gene mRNAs. One annotated gene and one un‐annotated gene from each of the two GO‐BP categories shown in Tables 2 and 3 were chosen and FISH images for those genes was retrieved from the FlyFISH database. Four different stage categories of early embryogenesis are shown. Green fluorescence represents the mRNA localization pattern for that transcript/gene and red fluorescence is showing the position of nuclei within the organism at that stage. CG1584 = Orc6, Origin recognition complex subunit 6. Its molecular function is described as DNA binding and it is involved in the biological processes: DNA replication initiation; DNA‐dependent DNA replication; chromatin silencing. CG1970‐ NADH dehydrogenase (ubiquinone) activity. It is involved in the biological process mitochondrial electron transport.

Of the un‐annotated genes that are predicted to belong to the DNA‐dependent DNA‐replication GO‐BP genes, 22 had images in the FlyFISH database and 18 out of 22 (82%) had FISH images that were the same or very similar to the known genes belonging to the category (Table 2). Similarly, of the un‐annotated genes that are predicted to belong to the oxidative phosphorylation GO‐BP category, 19 had images in the FlyFISH database and 16 out of 19 (84%) had FISH images that were the same or very similar to the known genes belonging to the category (Table 3).

Discussion

We have applied Support Vector Machines (SVM) and a stratified cross‐validation approach to analyze a large microarray experiment dataset from D. melanogaster in order to predict possible functions for previously un‐annotated genes. This approach successfully generated a preliminary GO Biological Process annotation for a large number of these genes.

The SVM analysis, employing a sigmoid fitting function, generated annotation probability values for all the genes in the dataset. Using a recall value of 40%, 39 high‐precision GO‐BP categories were identified and 77% of the un‐annotated genes in the dataset have been assigned to one or more of these GO‐BP categories. Assigning new annotations based on gene‐precision scores takes into account both the overall precision of the GO‐BP category, as well as the individual probability of the gene, outcome of SVM and the sigmoid function. In that way, gene predictions of high confidence can be considered even for GO‐BP categories that have an overall poor precision performance.

We chose two of the high‐precision GO‐BP categories to validate using an independent data set (the FlyFISH database). These two categories, DNA‐dependent DNA replication and oxidative phosphorylation, were selected for a number of reasons but chiefly because each contained a fairly large number of previously unannotated genes allowing for a larger number of genes that could be cross‐checked in the FlyFISH database and that the GO‐BP description was fairly precise allowing for the potential to examine whether the unannotated genes are actually involved in the biological process they had been assigned to. While it could be argued that the use of FlyFISH data is perhaps too similar of a data type to be useful (i.e. is also gene transcription data), our main goal was to use data that: was from a completely independent source; had at least some data that differed in type from the original microarray data (in this case spatial data in addition to temporal data); where the raw data could be inspected independently at the gene level so that any computer or user assigned functional classification of the genes would not bias the validation. Given the high degree of precision that some of the unannotated genes have for belonging to a specific GO‐BP category, especially for the oxidative phosphorylation category, we feel quite confident that a large number of these genes will be shown in the future to be directly or indirectly involved in these biological processes. Gene expression pattern data has been used in the past to uncover the function of a gene. In a study by Hughes and coworkers [14], a large compendium of yeast gene expression data was examined for co‐expression patterns using hierarchical clustering and among several findings, they discovered that a previously unannotated gene, YER044c (now ERG28), had a gene expression pattern that was highly similar to genes that were known to be involved in ergosterol biosynthesis. Further biochemical and genetic experiments supported that the gene was indeed involved in this process [14].

In the original Arbeitman and coworkers study [22], the authors performed different hierarchical clustering analyses to identify groups of genes that were temporally co‐regulated and/or restricted to specific biological sample type (e.g. mutant or sex). The authors also identified several GO classes that could be assigned to subsets of the co‐regulated genes. While many of the GO classes reported were fairly low (i.e. general) in the GO annotation hierarchy (e.g. enzyme, metabolism) others were more detailed. Five of the more detailed GO categories reported by Arbeitman and coworkers [22] overlap with the 39 GO biological categories identified in this study including: mitochondria/electron transport; ribosome; cell cycle; cell adhesion; and phototransduction.

We attempted to compare our results with those from the study of Yan and co‐workers [10], who also assigned GO terms to groups of similarly behaving genes to a dataset which in part included the data we used in our study. The authors of that study apply a different computational method (Random Forest) to a number of data sources (including gene expression data) for the prediction of GO‐BP annotations. For the comparison with our study, first, predictions based on expression data only have to be extracted and, subsequently, precision and recall measures need to be calculated using methods similar to those used in our study. There were a few issues that made the comparison between the two studies difficult. First, Yan and co‐workers do not apply stratification in their 10‐fold cross‐validation. As a result, for some of the GO‐BP categories (and the associated classifiers) there are a number of folds that contain only negative classes, and therefore classification (and calculation of precision) cannot be performed. Estimates of measures such as precision at 40 could be generated based on the remaining folds, but this practice is generally not recommended as it may introduce bias [33]. Second, the number of categories that are represented in both studies (273) is much smaller than those that are represented in each study individually. If we exclude those that give folds without positively labeled representatives, we are left out with a very small number of GO‐BP categories for comparison. Due to these issues a strict comparison of the results from the two studies is not particularly valuable. For reference purposes we mention here that in Yan et al. study, 8 GO‐BP categories gave a precision at 40 value equal or larger than 0.75 after excluding the GO terms with less than 5 folds representing both classes, while the number of selected categories dropped to 1 if we exclude those with less than 8 folds representing both classes. Also, the reported new predictions in Yan et al. involve 2062 genes at the confidence level of 0.2, 213 genes at the confidence level of 0.5 and 11 genes at the confidence level of 0.75. It should be noted, though, that the confidence level is Yan et al. study was computed differently from the gene precision score in our study and the two are not necessarily comparable.

We believe that the methods used in this study can be applied to other microarray and RNA‐seq datasets. Our own experience trying this approach on other data suggests that the robustness of the results is largely dependent on both the number of array experiments that are in the dataset as well as on the type of experiment(s) the data represent. In this case, a developmental time course with numerous time points provided the ability to identify a large number of co‐expressed genes.

Conclusions

In this study we have applied Support Vector Machines for the prediction of GO‐BP annotation terms for D. melanogaster genes, using published time‐course gene expression data. We have assessed the predictive ability of the method using an elaborate stratified double cross‐validation procedure, involving the fitting of a sigmoid function on the raw output from SVM. Precision and recall values for each GO‐BP function were calculated, describing in detail the performance of the prediction. On a second level, our results were validated externally using independent data from FlyFISH database. Finally, we applied SVM for the prediction of GO‐BP annotations of un‐annotated genes, providing certainty estimates for those predictions, based on a method that uses the results from the cross‐validation and the predictive performance of SVM for the specific GO‐BP category.

We believe that this is the first study that investigates the performance of SVM for GO‐BP prediction for D. melanogaster genes and provides a comprehensive list of new annotation predictions that can be further researched experimentally. We believe that our study is of great value to researchers interested in computational annotation of genes, as well as to the community of D. melanogaster researchers.

Abbreviations

GO‐BP:

Gene Ontology Biological Processes

SVM:

Support Vector Machine

References

  1. Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi‐Sutherland D, Schroeder A, Seal R, Zhang H, the Fly Base Consortium: FlyBase: enhancing Drosophila Gene Ontology annotations. Nucl Acids Res. 2009, 37: D555-D559. 10.1093/nar/gkn788.

    Article  CAS  PubMed  Google Scholar 

  2. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome‐wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P: Clustering methods for the analysis of DNA microarray data. Tech. rep., Department of Statistics, Stanford University, Stanford, California; 1999

  4. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome‐wide prediction of protein function. Nature. 1999, 402: 83-86. 10.1038/47048.

    Article  CAS  PubMed  Google Scholar 

  5. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge‐based approach for interpreting genome‐wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22: 281-285. 10.1038/10343.

    Article  CAS  PubMed  Google Scholar 

  7. Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler S: Large‐scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet. 2002, 31: 255-265. 10.1038/ng906.

    Article  CAS  PubMed  Google Scholar 

  8. Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, M Ares J, Haussler D: Knowledge‐based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA. 2000, 97: 262-267. 10.1073/pnas.97.1.262.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirngibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng WT, Krogan N, Greenblatt J, Fehlings M, van der Kooy D, Aubin J, Bruneau BG, Rossant J, Blencowe BJ, Frey BJ, Hughes TR: The functional landscape of mouse gene expression. J Biol. 2004, 3: 21-10.1186/jbiol16.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Yan H, Venkatesan K, Beaver J, Klitgord N, Yildirim M, Hao T, Hill D, Cusick M, Perrimon N, Roth F, Vidal M: A genome‐wide gene function prediction resource for Drosophila melanogaster. PLoS ONE. 2010, 5: e12139-10.1371/journal.pone.0012139.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Mateos A, Dopazo J, Jansen R, Tu Y, Gerstein M, Stolovitzky G: Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res. 2002, 12: 1703-1715. 10.1101/gr.192502.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Kustra R, Shioda R, Zhu M: A factor analysis model for functional genomics. BMC Bioinformatics. 2006, 7: 216-10.1186/1471-2105-7-216.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Lan H, Carson R, Provart NJ, Bonner AJ: Combining classifiers to predict gene function in Arabidopsis thaliana using large‐scale gene expression measurements. Bioinformatics. 2007, 8: 358-

    PubMed  PubMed Central  Google Scholar 

  14. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell. 2000, 102: 109-126. 10.1016/S0092-8674(00)00015-5.

    Article  CAS  PubMed  Google Scholar 

  15. Zhang W, Zou S, Song J: Term‐tissue specific models for prediction of gene ontology biological processes using transcriptional profiles of aging in Drosophila melanogaster. BMC Bioinformatics. 2008, 9: 129-10.1186/1471-2105-9-129.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Lee I, Li Z, Marcotte EM: An improved, bias‐reduced probabilistic functional gene network of baker’s yeast, Saccharomyces cerevisiae. PLoS ONE. 2007, 2: e988-10.1371/journal.pone.0000988.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Lee I, Lehner B, Crombie C, Wong W, Fraser A, Marcotte E: A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet. 2008, 40: 181-188. 10.1038/ng.2007.70.

    Article  CAS  PubMed  Google Scholar 

  18. Costello JC, Dalkilic MM, Beason SM, Gehlhausen JR, Patwardhan R, Middha S, Eads BD, Andrews JR: Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function. Genome Biol. 2009, 10: R97-10.1186/gb-2009-10-9-r97.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Pena‐Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M, Pagnani A, Kim WK, Krumpelman C, Tian W, Obozinski G, Qi Y, Mostafavi S, Lin GN, Berriz GF, Gibbons FD, Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde‐Farley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble WS, Morris Q, Klein‐Seetharaman J, Bar‐Joseph Z, Chen T, Sun F, Troyanskaya OG, Marcotte EM, Xu D, Hughes TR, Roth F: A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol. 2008, 9: S2-

    Article  PubMed  PubMed Central  Google Scholar 

  20. Noble WS: What is a support vector machine?. Nat Biotechnol. 2006, 24: 1565-1567. 10.1038/nbt1206-1565.

    Article  CAS  PubMed  Google Scholar 

  21. Vapnik V: Statistical Learning Theory. 1998, New York: Wiley‐Interscience

    Google Scholar 

  22. Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP: Gene expression during the life cycle of Drosophila melanogaster. Science. 2002, 297: 2270-2275. 10.1126/science.1072152.

    Article  CAS  PubMed  Google Scholar 

  23. The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556. [http://www.geneontology.org]

    Article  PubMed Central  Google Scholar 

  24. Stanford Microarray Database. [http://smd.stanford.edu]

  25. Gene Expression Omnibus. [http://www.ncbi.nlm.nih.gov/geo/]

  26. FlyBase: A Database of Drosophila Genes & Genomes. [http://flybase.org]

  27. Burges CJC: A tutorial on support vector machines for pattern recognition. Data Min Knowl Dis. 1998, 2: 121-167. 10.1023/A:1009715923555.

    Article  Google Scholar 

  28. Veropoulos K, Campbell C, Cristianini N: Controlling the sensitivity of support vector machines. Proceedings of the International Joint Conference on AI. 1999, San Francisco: Morgan Kaufmann Publishers, 55-60.

    Google Scholar 

  29. Gist: Support vector machine and kernel principal components analysis, Version 2.0.9. [http://www.bioinformatics.ubc.ca/gist]

  30. Platt J: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers. Edited by: Smola A, Bartlett P, Schölkopf B, Schuurmans D. 2000, Cambridge, MA: The MIT Press

    Google Scholar 

  31. Lin HT, Lin CJ, Weng RC: A note on Platt’s probabilistic outputs for support vector machines. Mach Learn. 2007, 68: 267-276. 10.1007/s10994-007-5018-6.

    Article  Google Scholar 

  32. Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics. 2005, 21: 3940-3941. 10.1093/bioinformatics/bti623.

    Article  CAS  PubMed  Google Scholar 

  33. Forman G, Scholz M: Apples‐to‐apples in cross‐validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsl. 2010, 12: 49-57. 10.1145/1882471.1882479.

    Article  Google Scholar 

  34. Fly-FISH: A database of Drosophila embryo mRNA localization patterns. [http://fly-fish.ccbr.utoronto.ca]

  35. Lécuyer E, Yoshida H, Parthasarathy N, Alm C, Babak T, Cerovina T, Hughes TR, Tomancak P, Krause HM: Global analysis of mRNA localization reveals a prominent role in organizing cellular architecture and function. Cell. 2007, 131: 174-187. 10.1016/j.cell.2007.08.003.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by CIHR Emerging Teams Grant to M.E., J.T.W. and others. J.T.W. is also supported by an NSERC Discovery Grant. We would like to thank Sarah Gonsalves (Department of Cell and Systems Biology, University of Toronto) for helping prepare Figure 3, Eric Lècuyer and Henry Krause (Banting and Best Department of Medical Research, University of Toronto) for the use of the Fly‐FISH data found in Tables 2 and 3 and Figure 5, Quaid Morris (Banting and Best Department of Medical Research, University of Toronto) for scripts to parse data from Gene Ontology, Han Yan for making available the data used in the Yan et al. study [10] and William Noble (Departments of Genome Sciences and of Computer Science and Engineering, University of Washington) for personal communication about the SVM algorithm.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Nicholas Mitsakakis or J Timothy Westwood.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

NM participated in the conception and design of the study, performed the analysis of the data, participated in the creation of the figures and tables and wrote part of the manuscript. ZR pre‐processed the data and participated in the creation of the figures and tables. ME participated in the conception and design of the study and revised the manuscript for critically important content. TW participated in the conception and design of the study, participated in the creation of figures and tables, wrote part of the manuscript and revised the manuscript for critically important content. All authors read and approved the final manuscript.

Electronic supplementary material

13040_2012_89_MOESM1_ESM.tiff

Additional file 1: The Parent‐Child Relationship Between Ontology Biological Process Categories. Figure showing a joined Gene Ontology graph providing an overview of the relationship between the 39 Gene Ontology Biological Process (GO‐BP) categories that were identified from the output of the SVM as having a precision‐at‐40 value equal or larger than 0.75, as indicated in the highlighted polygonal nodes. (TIFF 1 MB)

13040_2012_89_MOESM2_ESM.txt

Additional file 2: Selected GO‐BP Predictions for Previously Unannotated Genes. Table showing all new predictions with average gene‐precision score equal or larger than 0.75. Detailed information on gene‐precision scores and probability estimates for each one of the four folds is shown. (TXT 2 MB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Mitsakakis, N., Razak, Z., Escobar, M. et al. Prediction of Drosophila melanogaster gene function using Support Vector Machines. BioData Mining 6, 8 (2013). https://doi.org/10.1186/1756-0381-6-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1756-0381-6-8

Keywords