<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1756-0381-2-3</ui>
   <ji>1756-0381</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Partitioning clustering algorithms for protein sequence data sets</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Fayech</snm>
               <fnm>Sondes</fnm>
               <insr iid="I1"/>
               <email>sondes_el_feyech@yahoo.fr</email>
            </au>
            <au id="A2">
               <snm>Essoussi</snm>
               <fnm>Nadia</fnm>
               <insr iid="I1"/>
               <email>nadia.essoussi@isg.rnu.tn</email>
            </au>
            <au id="A3">
               <snm>Limam</snm>
               <fnm>Mohamed</fnm>
               <insr iid="I1"/>
               <email>mohamed.limam@isg.rnu.tn</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Computer Science, LARODEC Laboratory, Higher Institute of Management, University of Tunis, Tunis, Tunisia</p>
            </ins>
         </insg>
         <source>BioData Mining</source>
         <issn>1756-0381</issn>
         <pubdate>2009</pubdate>
         <volume>2</volume>
         <issue>1</issue>
         <fpage>3</fpage>
         <url>http://www.biodatamining.org/content/2/1/3</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19341454</pubid>
               <pubid idtype="doi">10.1186/1756-0381-2-3</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>13</day>
               <month>11</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>02</day>
               <month>4</month>
               <year>2009</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>02</day>
               <month>4</month>
               <year>2009</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2009</year>
         <collab>Fayech et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods.</p>
            </sec>
            <sec>
               <st>
                  <p>Methods</p>
               </st>
               <p>We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>In bioinformatics, the number of protein sequences is more than half a million, and it is necessary to find meaningful partitions of them in order to detect their functions. Early approaches of comparing and grouping protein sequences are alignment methods. In fact, pair-wise alignment is used to compare and to cluster sequences. There are two types of pair-wise sequence alignments, local and global <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. Smith and Waterman local alignment algorithm <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> helps in finding conserved amino acid patterns in protein sequences. Needleman and Wunsch global alignment algorithm <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> attempts are made to align the entire sequence using as many characters as possible, up to both ends of each sequence. In order to cluster a large data set of proteins into meaningful clusters, the pair-wise alignment is computationally expensive because of the large number of comparisons carried out. In fact, each protein of the data set should be compared to all others of the data set.</p>
         <p>For this reason the pair-wise alignment methods are not efficient to cluster a large set of data. These approaches do not consider the fact that the data set can be too large and may not fit into the main memory of some computers.</p>
         <p>The main objective of the unsupervised learning technique is to find a natural grouping or meaningful partition using a distance function <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. Clustering is a technique which has been extensively applied in a variety of fields related to life science and biology. In sequence analysis, clustering is used to group homologous sequences into gene or protein families.</p>
         <p>Many methods are currently available for the clustering of protein sequences into families and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among these various methods, most are based on hierarchical or graph-based techniques and they were successfully established. In fact, COG <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> uses a hierarchical merging of clusters and a manual validation to prevent chaining of multi-domain families. ProtoNet <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> uses a special metric as described by Sasson et al. <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> to merge clusters. Picasso <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> uses multiple alignments as profiles which are then merged hierarchically. ClusTr <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and GeneRage <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> use standard single linkage clustering approaches. SYSTERS <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> combines hierarchical clustering with graph-based clustering. ProtoMap <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> and N-cut <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp> methods use graph-based clustering approaches. ProClust <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> uses an extension of the graph-based clustering approach proposed by <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. ProClust algorithm is based on transitivity criterion and it is capable of handling multi-domain proteins. TribeMCL <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> applies the Markov clustering approach (MCL) described by Van Dongen <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. This method operates on a graph that contains similarity information obtained by pair-wise alignment of sequences.</p>
         <p>A small amount of partitioning techniques is used in the protein sequence clustering field. Guralnik and Karypis <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> have proposed one method based on a standard k-means approach where proteins are represented by vectors. However, no tool or database resulting from this interesting work has been made available to the scientific community. JACOP <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> uses the partitioning algorithm implemented under the name PAM (Partitioning Around Medoids) in the R statistical package <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. JACOP is based on a random sampling of sequences into groups. It is available on the MyHits platform <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> where user can submit his own data set. Methods presented bellow are not generally tools since they cannot be applied to cluster a user-provided data set. In fact, several of these methods have been applied to large known data sets and user can only consult the resulting classifications stored in databases. Among the protein sequence clustering methods defined bellow only ProClust, TribeMCL and JACOP are accessed by the community and user can classify his own sequence set.</p>
         <p>The main idea here is to design and develop efficient clustering algorithms based on partitioning techniques, which are not very investigated in protein sequence clustering field, in order to cluster large sets of protein sequences. In fact, the number of protein sequences available now is very important (in the order of millions) and hierarchical methods are computationally very expensive so they cannot be extended to cluster large protein sets. However, partitioning methods are very simple and more appropriate to cluster large data sets <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. For these reasons, we propose here new clustering algorithms based on partitioning techniques which aim to find meaningful partitions, to improve the classification's quality and to reduce the computation time compared to the published clustering tools, ProClust, TribeMCL and JACOP, on different data sets.</p>
         <p>Several partitioning clustering algorithms have been proposed in literature. K-means <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp> is a standard partitioning clustering method based on K centroids of a random initial partition which is iteratively improved. LEADER <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp> is an incremental partitioning clustering algorithm in which each of the K clusters is represented by a leader. CLARA (Clustering LARge Applications) <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> is a partitioning algorithm based on a combination of a sampling approach and the PAM algorithm. CLARANS (Clustering Large Applications based on RANdomized Search) <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> algorithm views the process of finding optimal medoids as searching through a certain graph, in which each node represents a set of medoids.</p>
         <p>We adapted the partitioning algorithms cited bellows to protein sequence data sets. These proposed algorithms are named: Pro-Kmeans, Pro-LEADER, Pro-CLARA and Pro-CLARANS. Performance measures are used to evaluate the proposed methods and to compare them with ProClust, TribeMCL and JACOP results.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Algorithms Implementation</p>
            </st>
            <p>To facilitate subsequent discussion, the main symbols used through the paper and their definitions are summarized in Table <tblr tid="T1">1</tblr>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Summary of symbols and definitions</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Symbols</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Definitions</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D</p>
                     </c>
                     <c ca="left">
                        <p>Data set of protein sequences to be clustered</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>K</p>
                     </c>
                     <c ca="left">
                        <p>Number of clusters</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>n</p>
                     </c>
                     <c ca="left">
                        <p>Number of proteins in <it>D</it></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>O<sub>i</sub></p>
                     </c>
                     <c ca="left">
                        <p>a protein sequence <it>i </it>in <it>D</it></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>q</p>
                     </c>
                     <c ca="left">
                        <p>Number of iterations</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>The main objective in the proposed algorithms Pro-Kmeans, Pro-LEADER, Pro-CLARA and Pro-CLARANS is to produce K clusters from a data set D of n protein sequences, so that the objective function f(V) is maximized.</p>
            <p>f(V) is the global score function that evaluates the clustering quality and it is as follows</p>
            <p>
               <display-formula id="M1">
                  <graphic file="1756-0381-2-3-i1.gif"/>
               </display-formula>
            </p>
            <p>Where R<sub>i </sub>is the centroid of the group i for which belong the object O<sub>j </sub>and Score (O<sub>j</sub>, R<sub>i</sub>) is the alignment score of the protein sequences O<sub>j </sub>and R<sub>i</sub>, calculated as follows</p>
            <p>
               <display-formula id="M2">
                  <graphic file="1756-0381-2-3-i2.gif"/>
               </display-formula>
            </p>
            <p>Where <it>S (A</it><sub><it>i</it></sub>, <it>B</it><sub><it>j</it></sub><it>) </it>is the substitution score of the amino acid <it>A</it><sub><it>i </it></sub>by <it>B</it><sub><it>j </it></sub>as determined from a scoring matrix and <it>g(n) </it>is the total cost of penalties for a gap length <it>n</it>. The gap is defined as follows</p>
            <p>
               <display-formula id="M3">
                  <graphic file="1756-0381-2-3-i3.gif"/>
               </display-formula>
            </p>
            <p>Where P<sub>o </sub>is the gap opening penalty and P<sub>e </sub>is the gap extension penalty.</p>
            <p>We chose Smith and Waterman local alignment algorithm for computing alignment score. The choice of this algorithm was motivated by the sensitivity for low-scoring alignments <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> compared to heuristic algorithms such as FASTA <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> and BLAST <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, and by execution time <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> compared to Needleman and Wunsch global alignment algorithm.</p>
            <p>We present here Pro-Kmeans, Pro-LEADER, Pro-CLARA and Pro-CLARANS partitioning clustering algorithms for protein sequence sets.</p>
            <sec>
               <st>
                  <p>Pro-Kmeans algorithm</p>
               </st>
               <p>The Pro-Kmeans algorithm proposed here, starts by a random partition of the data set D into K clusters and then uses the Smith Waterman algorithm to compare proteins of each cluster S<sub>i</sub><sub>(i &#8712; [1..K]) </sub>and to compute SumScore(S<sub>i</sub>, O<sub>j</sub>) of each protein j in S<sub>i </sub>as follows</p>
               <p>
                  <display-formula id="M4">
                     <graphic file="1756-0381-2-3-i4.gif"/>
                  </display-formula>
               </p>
               <p>Where <it>m </it>is the size of the subset <it>S</it><sub><it>i</it></sub>, for which belongs the object <it>O</it><sub><it>j</it></sub>.</p>
               <p>The sequence <it>O</it><sub><it>j </it></sub>in each cluster <it>S</it><sub><it>i </it></sub>which has the maximum <it>SumScore</it>(<it>S</it><sub><it>i</it></sub>, <it>O</it><sub><it>j</it></sub>) is considered as the centroid <it>R</it><sub><it>i </it></sub>of the cluster. The Smith Waterman algorithm is used here also to compare each protein <it>O</it><sub><it>h </it></sub>of the data set <it>D </it>with centroids and to assign the object to the nearest cluster where the <it>R</it><sub><it>i </it></sub>have the maximum score of similarity with the object <it>O</it><sub><it>h</it></sub>. Pro-Kmeans proceeds to this procedure for a number of times, <it>q</it>, in order to maximize the <it>f(V) </it>function. Input parameters are the number of clusters, <it>K</it>, and of iterations, <it>q</it>, and as outputs the algorithm returns the best partition of the training base <it>D </it>and the center, or mean, of each cluster <it>S</it><sub><it>i</it></sub>. Pro-Kmeans algorithm is illustrated in Figure <figr fid="F1">1</figr>.</p>
               <fig id="F1">
                  <title>
                     <p>Figure 1</p>
                  </title>
                  <caption>
                     <p>Pseudo code for Pro-Kmeans algorithm</p>
                  </caption>
                  <text>
                     <p><b>Pseudo code for Pro-Kmeans algorithm</b>.</p>
                  </text>
                  <graphic file="1756-0381-2-3-1"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Pro-LEADER algorithm</p>
               </st>
               <p>Pro-LEADER is an incremental algorithm which selects the first sequence of the data set <it>D </it>as the first leader, and use the Smith Waterman algorithm to compute the similarity score of each sequence in <it>D </it>with all leaders. The algorithm detects the nearest leader <it>R</it><sub><it>i </it></sub>to each sequence <it>O</it><sub><it>j </it></sub>and compares the score, <it>Score(R</it><sub><it>i</it></sub>, <it>O</it><sub><it>j</it></sub><it>)</it>, with a pre-fixed <it>Threshold</it>. If the similarity score of <it>R</it><sub><it>i </it></sub>and <it>O</it><sub><it>j</it></sub>, is more than the <it>Threshold</it>, <it>O</it><sub><it>j </it></sub>is considered as a new leader and if not, the sequence <it>O</it><sub><it>j </it></sub>is assigned to the cluster defined by the leader <it>R</it><sub><it>i</it></sub>. Pro-LEADER is thus an incremental algorithm in which each of the <it>K </it>clusters is represented by a leader. The <it>K </it>clusters are generated using a suitable <it>Threshold </it>value. Pro-LEADER aims also to maximize the <it>f(V) </it>function. Input parameter is the similarity score <it>Threshold </it>to consider an object <it>O</it><sub><it>j </it></sub>as a new leader, and as outputs the algorithm returns the best partition of the training base <it>D </it>and the <it>K </it>leaders of the obtained clusters. The Pro-LEADER algorithm is fast, requiring only one pass through the data set <it>D</it>. Pro-LEADER algorithm is depicted in Figure <figr fid="F2">2</figr>.</p>
               <fig id="F2">
                  <title>
                     <p>Figure 2</p>
                  </title>
                  <caption>
                     <p>Pseudo code for Pro-LEADER algorithm</p>
                  </caption>
                  <text>
                     <p><b>Pseudo code for Pro-LEADER algorithm</b>.</p>
                  </text>
                  <graphic file="1756-0381-2-3-2"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Pro-CLARA algorithm</p>
               </st>
               <p>Pro-CLARA relies on the sampling approach to handle large data sets <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Instead of finding medoids for the entire data set, Pro-CLARA algorithm draws a small sample <it>S </it>of 40 + 2<it>K </it>sequences from the data set <it>D</it>. To generate an optimal set of medoids for this sample, Pro-CLARA applies the proposed PAM algorithm for protein sequence data sets, Pro-PAM algorithm,</p>
               <p>The Pro-PAM algorithm proposed here, selects randomly K sequences from the data set as clusters, and then use the Smith Waterman algorithm to compute the total score TS<sub>ih </sub>of each pair of selected sequence R<sub>i </sub>and non selected sequence O<sub>h</sub>. TS<sub>ih </sub>is as follows</p>
               <p>
                  <display-formula id="M5">
                     <graphic file="1756-0381-2-3-i5.gif"/>
                  </display-formula>
               </p>
               <p>Where <it>S</it><sub><it>jih </it></sub>is the differential score of each pair of non-selected object <it>O</it><sub><it>h </it></sub>in <it>D </it>and selected object <it>R</it><sub><it>i</it>(<it>i </it>&#8712; [1..<it>K</it>]) </sub>with all non-selected objects <it>O</it><sub><it>j </it></sub>in <it>D. S</it><sub><it>jih </it></sub>is as follows</p>
               <p>
                  <display-formula id="M6">
                     <graphic file="1756-0381-2-3-i6.gif"/>
                  </display-formula>
               </p>
               <p>Pro-PAM selects the maximal <it>TS</it><sub><it>ih</it></sub>, <it>MaxTS</it><sub><it>ih</it></sub>. If <it>MaxTS</it><sub><it>ih </it></sub>is positive, the corresponding non selected sequence <it>O</it><sub><it>h </it></sub>will be selected, otherwise Smith Waterman algorithm is used to compare each protein <it>O</it><sub><it>h </it></sub>of the data set with all medoids <it>R</it><sub><it>i</it>(<it>i </it>&#8712; [1..<it>K</it>])</sub>, and to assign the sequence <it>O</it><sub><it>h </it></sub>to the nearest cluster. Input parameter of Pro-PAM is the number of clusters, <it>K</it>, and as outputs the algorithm returns the best partition of the protein sequence base and the medoid of each cluster. Pro-PAM algorithm is depicted in Figure <figr fid="F3">3</figr>.</p>
               <fig id="F3">
                  <title>
                     <p>Figure 3</p>
                  </title>
                  <caption>
                     <p>Pseudo code for Pro-PAM algorithm</p>
                  </caption>
                  <text>
                     <p><b>Pseudo code for Pro-PAM algorithm</b>.</p>
                  </text>
                  <graphic file="1756-0381-2-3-3"/>
               </fig>
               <p>Pro-CLARA uses the optimal set of medoids <it>R</it><sub><it>i</it></sub><sub>(<it>i </it>&#8712; [1..<it>K</it>]) </sub>obtained by Pro-PAM and the Smith Waterman algorithm to compare each protein <it>O</it><sub><it>h </it></sub>of the data set <it>D </it>with all medoids <it>R</it><sub><it>i</it>(<it>i </it>&#8712; [1..<it>K</it>])</sub>, and to assign the sequence <it>O</it><sub><it>h </it></sub>to the nearest cluster. In order to alleviate sampling bias, Pro-CLARA repeats the sampling and the clustering process a pre-defined number of times, <it>q</it>, and subsequently selects as the final clustering result the set of medoids with the maximal <it>f (V)</it>. Input parameters of Pro-CLARA algorithm are the number of clusters, <it>K</it>, and of iterations, <it>q</it>, and as outputs the algorithm returns the best partition of the training base <it>D </it>and the <it>K </it>medoids of the obtained clusters. Pro-CLARA algorithm is detailed in Figure <figr fid="F4">4</figr>.</p>
               <fig id="F4">
                  <title>
                     <p>Figure 4</p>
                  </title>
                  <caption>
                     <p>Pseudo code for Pro-CLARA algorithm</p>
                  </caption>
                  <text>
                     <p><b>Pseudo code for Pro-CLARA algorithm</b>.</p>
                  </text>
                  <graphic file="1756-0381-2-3-4"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Pro-CLARANS algorithm</p>
               </st>
               <p>Pro-CLARANS algorithm starts from an arbitrary node <it>C </it>in the graph, <it>C = [R</it><sub>1</sub>, <it>R</it><sub>2</sub>,..., <it>R</it><sub><it>k</it></sub><it>]</it>, which represents an initial set of medoids. Pro-CLARANS randomly selects one of <it>C </it>neighbors, <it>C*</it>, which differs by only one sequence. If the total score of the selected neighbour, <it>TS</it><sub><it>ih </it></sub>(Equation (5)), is higher than that of the current node <it>TS'</it><sub><it>ih</it></sub>, Pro-CLARANS proceeds to this neighbor and continues the neighbor selection and comparison process. Otherwise, Pro-CLARANS randomly checks another neighbor until a better neighbor is found or the pre-determined maximal number of neighbours to check, <it>Maxneighbor</it>, has been reached. In this study <it>Maxneighbor </it>is defined as proposed by <abbrgrp><abbr bid="B30">30</abbr></abbrgrp></p>
               <p>
                  <display-formula id="M7">
                     <graphic file="1756-0381-2-3-i7.gif"/>
                  </display-formula>
               </p>
               <p>Where the maximal number of neighbours must be at least a threshold value 250 or obtained using the number of clusters <it>K </it>and the number of sequences, <it>n</it>, in the data set as: <it>1.25%*K*(n-K)</it>.</p>
            </sec>
            <sec>
               <st>
                  <p>Pro-CLARANS algorithm aims to maximize the total score, <it>TS</it><sub><it>ih</it></sub></p>
               </st>
               <p>Pro-CLARANS algorithm use then the Smith Waterman algorithm to compute the similarity score of each sequence <it>O</it><sub><it>h </it></sub>in <it>D </it>with each medoid <it>R</it><sub><it>i</it>(<it>i </it>&#8712; [1..<it>K</it>]) </sub>and to assign it to the nearest cluster. The algorithm repeats the clustering process a pre-defined number of times, <it>q</it>, and selects as the final clustering result the set of medoids with the maximal <it>f (V)</it>. Input parameters of Pro-CLARANS algorithm are the number of clusters, <it>K</it>, and of iterations, <it>q</it>, and as outputs the algorithm returns the best partition of the training base <it>D </it>and the <it>K </it>medoids of the obtained clusters. Pro-CLARANS algorithm is detailed in Figure <figr fid="F5">5</figr>.</p>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>Pseudo code for Pro-CLARANS algorithm</p>
                  </caption>
                  <text>
                     <p><b>Pseudo code for Pro-CLARANS algorithm</b>.</p>
                  </text>
                  <graphic file="1756-0381-2-3-5"/>
               </fig>
               <p>The proposed algorithms presented here have been implemented in Java package. All of these algorithms used the EMBOSS <url>ftp://emboss.open-bio.org/pub/EMBOSS/</url> implementation of the Smith and Waterman local alignment algorithm for computing alignment score.</p>
               <p>BLOSUM62 (Blocks Substitution Matrix) <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> was chosen to compute amino acids substitution scores <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. We chose the default penalties proposed by Smith and Waterman EMBOSS implementation as gap opening (<it>P</it><sub><it>o</it></sub>) and gap extension penalties (<it>P</it><sub><it>e</it></sub>) (<it>P</it><sub><it>o </it></sub>= 10 and <it>P</it><sub><it>e </it></sub>= 2).</p>
            </sec>
            <sec>
               <st>
                  <p>Performance measure</p>
               </st>
               <p>To evaluate the Pro-Kmeans, Pro-LEADER, Pro-CLARA and Pro-CLARANS clustering algorithms, a large data set, Training data set, is used. We obtained from the training phase <it>K </it>clusters and each cluster is defined by a medoid (centroid or leader). The training phase results are used to cluster a different data set named, Test data set. Smith Waterman algorithm is used to compare each protein sequence on the test data set with all medoids <it>R</it><sub><it>i</it>(<it>i </it>&#8712; [1..<it>K</it>]) </sub>obtained from the training phase, and to assign each sequence to the nearest cluster. The predicted family group of each sequence is which of the nearest medoid.</p>
               <p>The results obtained from the test phase are used to calculate the <it>Sensitivity </it>and the <it>Specificity </it>of each algorithm and to compare them with results of the published clustering tools, ProClust, TribeMCL and JACOP, tested on the same set: "Test data set".</p>
               <p><it>Sensitivity </it>specifies the probability of correctly predicting a classifier and it is defined as</p>
               <p>
                  <display-formula id="M8">
                     <graphic file="1756-0381-2-3-i8.gif"/>
                  </display-formula>
               </p>
               <p>and <it>Specificity </it>the probability that the provided prediction is correct and it is defined as</p>
               <p>
                  <display-formula id="M9">
                     <graphic file="1756-0381-2-3-i9.gif"/>
                  </display-formula>
               </p>
               <p>where <it>TP </it>(True Positives) is the number of correctly identified true homologues pairs, <it>FN </it>(False Negatives) is the number of not identified true homologues pairs and <it>FP </it>(False Positives) is the number of non-homologue pairs predicted to be homologue. A pair of sequences is considered truly homologous, if both are in the same family group.</p>
            </sec>
            <sec>
               <st>
                  <p>Protein sequence data sets</p>
               </st>
               <p>To evaluate the performance of the proposed clustering algorithms Pro-Kmeans, Pro-LEADER, Pro-CLARA and Pro-CLARANS, and to compare their results with the available graph-based clustering tools ProClust and TribeMCL and the only available partitioning clustering tool JACOP, protein sequence families with known subfamilies/groups are considered. Protein sequences of HLA protein family have been collected from <url>ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla</url>. From this set, we have randomly selected 893 sequences named DS1 and grouped into 12 classes. Protein sequences of Hydrolases protein family have been collected from <url>http://www.brenda-enzymes.org/</url>. Hydrolases protein family sequences are categorized into 8 classes according to their function and 3737 sequences, named DS2, have been considered from this family. From Globins protein family <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>, sequences have been collected randomly from 8 different classes and 292 sequences, named DS3, have been selected from the data set <it>IPR000971 </it>in <url>http://srs.ebi.ac.uk</url>. Thus, totally 28 different classes containing sequences are considered as they have been classified by scientists/experts.</p>
               <p>The data set considered has a total of 4922 sequences, named DS4, out of which 3500 sequences (practically 70% of the dataset DS4) are randomly for training, and 1422 for testing (practically 30% of the dataset DS4). The same method to obtain the training and the test sets are used on DS1, DS2 and DS3: randomly 70% of the set is selected for the training set and 30% for the test set [see Additional file <supplr sid="S1">1</supplr>].</p>
               <suppl id="S1">
                  <title>
                     <p>Additional File 1</p>
                  </title>
                  <text>
                     <p><b>Training and test data sets</b>. The file contains text files which correspond to the used data set in this study in fasta format. The file contains two directories: the training base which has 3500 sequences and the test base which has 1422 sequences. The considered data set, named DS4, has a total of 4922 sequences out of which 3500 sequences (practically 70% of the dataset DS4) are randomly selected for training, and 1422 for testing (practically 30% of the dataset DS4). This dataset contains proteins selected from HLA (DS1), Hydrolases (DS2) and Globins (DS3) protein families.</p>
                  </text>
                  <file name="1756-0381-2-3-S1.zip">
                     <p>Click here for file</p>
                  </file>
               </suppl>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <p>Experiments are conducted on Intel Pentium4 processor based machine, having a clock frequency of 2.4 GHZ and 512 MB of RAM. Experimental results are obtained using default values as follows. In Pro-Kmeans, Pro-CLARA and Pro-CLARANS algorithms, the number of iterations <it>q </it>is fixed to 5 <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> and the number of clusters <it>K </it>is fixed to 28. After a number of simulations, we find that the best clustering results are obtained when the parameter <it>K </it>= 28 <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. In Pro-LEADER algorithm, the <it>Threshold </it>value is fixed to 350 after a number of simulations <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>.</p>
         <p>Experimental results of Pro-Kmeans, Pro-LEADER, Pro-CLARA, Pro-CLARANS, ProClust, TribeMCL and JACOP algorithms on DS1, DS2, DS3 and DS4 are summarized in Table <tblr tid="T2">2</tblr>.</p>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Performance of the three other tools (ProClust, TribeMCL and JACOP) and our four proposed methods on DS1, DS2, DS3 and DS4 data sets with respect to two clustering quality measurements: Sensitivity (Sens.) and Specificity (Spec.)</p>
            </caption>
            <tblbdy cols="9">
               <r>
                  <c ca="left">
                     <p>Algorithms</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>DS1</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>DS2</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>DS3</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>DS4</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Sens.</p>
                  </c>
                  <c ca="left">
                     <p>Spec.</p>
                  </c>
                  <c ca="left">
                     <p>Sens.</p>
                  </c>
                  <c ca="left">
                     <p>Spec.</p>
                  </c>
                  <c ca="left">
                     <p>Sens.</p>
                  </c>
                  <c ca="left">
                     <p>Spec.</p>
                  </c>
                  <c ca="left">
                     <p>Sens.</p>
                  </c>
                  <c ca="left">
                     <p>Spec.</p>
                  </c>
               </r>
               <r>
                  <c cspan="9">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>ProClust</p>
                  </c>
                  <c ca="left">
                     <p>50.64</p>
                  </c>
                  <c ca="left">
                     <p>56.77</p>
                  </c>
                  <c ca="left">
                     <p>48.71</p>
                  </c>
                  <c ca="left">
                     <p>61.86</p>
                  </c>
                  <c ca="left">
                     <p>46.09</p>
                  </c>
                  <c ca="left">
                     <p>55.14</p>
                  </c>
                  <c ca="left">
                     <p>46.39</p>
                  </c>
                  <c ca="left">
                     <p>51.07</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>TribeMCL</p>
                  </c>
                  <c ca="left">
                     <p>46.09</p>
                  </c>
                  <c ca="left">
                     <p>52.89</p>
                  </c>
                  <c ca="left">
                     <p>41.42</p>
                  </c>
                  <c ca="left">
                     <p>52.14</p>
                  </c>
                  <c ca="left">
                     <p>41.04</p>
                  </c>
                  <c ca="left">
                     <p>47.48</p>
                  </c>
                  <c ca="left">
                     <p>51.22</p>
                  </c>
                  <c ca="left">
                     <p>56.46</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>JACOP</p>
                  </c>
                  <c ca="left">
                     <p>99.92</p>
                  </c>
                  <c ca="left">
                     <p>66.27</p>
                  </c>
                  <c ca="left">
                     <p>99.96</p>
                  </c>
                  <c ca="left">
                     <p>70.06</p>
                  </c>
                  <c ca="left">
                     <p>99.96</p>
                  </c>
                  <c ca="left">
                     <p>73.96</p>
                  </c>
                  <c ca="left">
                     <p>99.92</p>
                  </c>
                  <c ca="left">
                     <p>94.42</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Pro-Kmeans</p>
                  </c>
                  <c ca="left">
                     <p>92.38</p>
                  </c>
                  <c ca="left">
                     <p>99.90</p>
                  </c>
                  <c ca="left">
                     <p>55.32</p>
                  </c>
                  <c ca="left">
                     <p>98.01</p>
                  </c>
                  <c ca="left">
                     <p>63.30</p>
                  </c>
                  <c ca="left">
                     <p>96.92</p>
                  </c>
                  <c ca="left">
                     <p>56.06</p>
                  </c>
                  <c ca="left">
                     <p>99.56</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Pro-LEADER</p>
                  </c>
                  <c ca="left">
                     <p>90.21</p>
                  </c>
                  <c ca="left">
                     <p>91.40</p>
                  </c>
                  <c ca="left">
                     <p>53.15</p>
                  </c>
                  <c ca="left">
                     <p>91.24</p>
                  </c>
                  <c ca="left">
                     <p>52.96</p>
                  </c>
                  <c ca="left">
                     <p>74.06</p>
                  </c>
                  <c ca="left">
                     <p>23.34</p>
                  </c>
                  <c ca="left">
                     <p>95.70</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Pro-CLARA</p>
                  </c>
                  <c ca="left">
                     <p>93.60</p>
                  </c>
                  <c ca="left">
                     <p>99.92</p>
                  </c>
                  <c ca="left">
                     <p>73.28</p>
                  </c>
                  <c ca="left">
                     <p>99.26</p>
                  </c>
                  <c ca="left">
                     <p>81.53</p>
                  </c>
                  <c ca="left">
                     <p>98.60</p>
                  </c>
                  <c ca="left">
                     <p>77.84</p>
                  </c>
                  <c ca="left">
                     <p>99.66</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Pro-CLARANS</p>
                  </c>
                  <c ca="left">
                     <p>93.10</p>
                  </c>
                  <c ca="left">
                     <p>99.90</p>
                  </c>
                  <c ca="left">
                     <p>78. 62</p>
                  </c>
                  <c ca="left">
                     <p>98.70</p>
                  </c>
                  <c ca="left">
                     <p>76.24</p>
                  </c>
                  <c ca="left">
                     <p>97.34</p>
                  </c>
                  <c ca="left">
                     <p>62.06</p>
                  </c>
                  <c ca="left">
                     <p>99.09</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>DS4 is a very large data set which contains all sequences of DS1 (HLA protein family), DS2 (Hydrolases protein family) and DS3 (Globins protein family).</p>
            </tblfn>
         </tbl>
         <p>In our experiments, the use of partitioning clustering methods Pro-Kmeans, Pro-LEADER, Pro-CLARA, Pro-CLARANS and JACOP have improved sensitivity and specificity of hierarchical methods, ProClust and TribeMCL.</p>
         <p>We have demonstrated the performance of the proposed Pro-Kmeans, Pro-LEADER, Pro-CLARA, Pro-CLARANS algorithms for the clustering of protein sequences using similarity.</p>
         <p>Experiments show also that on the considered data sets DS1, DS2, DS3 and DS4, the higher probability of correctly predicting a classifier (<it>Sensitivity</it>) is obtained using JACOP method. Pro-CLARA method gives the higher probability that the provided prediction is correct (<it>Specificity</it>) although that, Pro-Kmeans, Pro-LEADER, and, Pro-CLARANS obtain also good results. The use of Pro-LEADER method on very large and heterogeneous set, DS4, is not very valuable. In fact the number of not identified true homologues pairs (False Negatives) is very important for that, the obtained <it>Sensitivity </it>is limited to 23.34.</p>
         <p>Pro-Kmeans, Pro-LEADER, Pro-CLARA and Pro-CLARANS result confirm that those proposed partitioning methods are valuable, reliable tools for the automatic functional clustering of protein sequences. The use of these methods instead of alignment methods or the classic known clustering methods by biologists can improve the clustering sensitivity and specificity and reduce significantly the computational time. The proposed methods can be used by new biologists especially to cluster a large data set of proteins into meaningful clusters in order to detect their functions.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Similar protein sequences probably have similar biochemical function and three dimensional structures. If two sequences from different organisms are similar, they may have a common ancestor sequence and hence they are said to be homologous. Protein sequence clustering, using Pro-Kmeans, Pro-LEADER, Pro-CLARA and Pro-CLARANS methods helps in classifying a new sequence, retrieve a set of similar sequences for a given query sequence and predict the protein structure of an unknown sequence. We noticed that the classification of large protein sequence data sets using clustering techniques instead of only alignment methods reduce extremely the execution time and improve the efficiency of this important task in molecular biology.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>SF incepted and directed the research and wrote the manuscript. NE and ML participated in the coordination and the direction of the whole study. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Computational Molecular Biology &#8211; An Introduction</p>
            </title>
            <aug>
               <au>
                  <snm>Clote</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Backofen</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <publisher>John Wiley &amp; Sons, Ltd</publisher>
            <pubdate>2000</pubdate>
            <xrefbib>
               <pubid idtype="pmpid">10902160</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Bioinformatics &#8211; Sequence and Genome Analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Mount</snm>
                  <fnm>DW</fnm>
               </au>
            </aug>
            <publisher>Cold Spring Harbor Laboratory Press, New York</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Identification of common molecular subsequences</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>TF</fnm>
               </au>
               <au>
                  <snm>Waterman</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1981</pubdate>
            <volume>147</volume>
            <fpage>195</fpage>
            <lpage>197</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(81)90087-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">7265238</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>A general method applicable to the search for similarities in the amino acid sequence of the proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Needleman</snm>
                  <fnm>SB</fnm>
               </au>
               <au>
                  <snm>Wunsch</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1970</pubdate>
            <volume>48</volume>
            <fpage>443</fpage>
            <lpage>453</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(70)90057-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">5420325</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Discovering Data Mining: From Concept to Implementation</p>
            </title>
            <aug>
               <au>
                  <snm>Cabena</snm>
                  <fnm>P</fnm>
               </au>
               <etal/>
            </aug>
            <publisher>Prentice Hall PTR, Upper Saddle River, NJ</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Data mining and knowledge discovery: Making sense out of data</p>
            </title>
            <aug>
               <au>
                  <snm>Fayyad</snm>
                  <fnm>UM</fnm>
               </au>
            </aug>
            <source>IEEE Expert</source>
            <pubdate>1996</pubdate>
            <volume>11</volume>
            <fpage>20</fpage>
            <lpage>25</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1109/64.539013</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>The COG database: an updated version includes eukaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Tatusov</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Fedorova</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Jackson</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Jacobs</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kiryutin</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Krylov</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Mazumder</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Mekhedov</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nikolskaya</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Rao</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Smirnov</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sverdlov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Vasudevan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Wolf</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Yin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Natale</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>41</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">222959</pubid>
                  <pubid idtype="pmpid" link="fulltext">12969510</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-4-41</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>ProtoNet 4.0: a hierarchical classification of one million protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Kaplan</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sasson</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Inbar</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Friedlich</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Fromer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Fleischer</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Portugaly</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Linial</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Linial</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <issue>33 Database</issue>
            <fpage>D216</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">539961</pubid>
                  <pubid idtype="pmpid" link="fulltext">15608180</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>The metric space of proteins-comparative study of clustering algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Sasson</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Linial</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Linial</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>Suppl 1</issue>
            <fpage>S14</fpage>
            <lpage>21</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12169526</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Picasso: generating a covering set of protein family profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Herger</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Holm</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>3</issue>
            <fpage>272</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.3.272</pubid>
                  <pubid idtype="pmpid" link="fulltext">11294792</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Improvements to CluSTr: the database of SWISS-PROT + TrEMBL protein clusters</p>
            </title>
            <aug>
               <au>
                  <snm>Kriventseva</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Servant</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>1</issue>
            <fpage>388</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165482</pubid>
                  <pubid idtype="pmpid" link="fulltext">12520029</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg035</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>GeneRAGE: a robust algorithm for sequence clustering and domain detection</p>
            </title>
            <aug>
               <au>
                  <snm>Enright</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <issue>5</issue>
            <fpage>451</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/16.5.451</pubid>
                  <pubid idtype="pmpid" link="fulltext">10871267</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Large scale hierarchical clustering of protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Krause</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Vingron</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>15</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">547898</pubid>
                  <pubid idtype="pmpid" link="fulltext">15663796</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-15</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>ProtoMap: automatic classification of protein sequences and hierarchy of protein families</p>
            </title>
            <aug>
               <au>
                  <snm>Yona</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Linial</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Linial</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <issue>1</issue>
            <fpage>49</fpage>
            <lpage>55</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102438</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592179</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.49</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Normalized cuts and image segmentation</p>
            </title>
            <aug>
               <au>
                  <snm>Shi</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Malik</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proceedings of the IEEE conference on Computer Vision Pattern Recognition</source>
            <pubdate>1997</pubdate>
            <fpage>731</fpage>
            <lpage>737</lpage>
         </bibl>
         <bibl id="B16">
            <title>
               <p>An optimal graph theoretic approach to data clustering: theory and its application to image segmentation</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Leahy</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>PAMI</source>
            <pubdate>1993</pubdate>
            <volume>11</volume>
            <fpage>1101</fpage>
            <lpage>1113</lpage>
         </bibl>
         <bibl id="B17">
            <title>
               <p>ProClust: improved clustering of protein sequences with an extended graph-based approach</p>
            </title>
            <aug>
               <au>
                  <snm>Pipenbacher</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Schliep</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Schneckener</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sch&#246;nhuth</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Schomburg</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Schrader</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>Suppl 2</issue>
            <fpage>S182</fpage>
            <lpage>91</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12386002</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Clustering protein sequences-structure prediction by transitive homology</p>
            </title>
            <aug>
               <au>
                  <snm>Bolten</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Schliep</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Schneckener</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Schomburg</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Schrader</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>10</issue>
            <fpage>935</fpage>
            <lpage>41</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.10.935</pubid>
                  <pubid idtype="pmpid" link="fulltext">11673238</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>An efficient algorithm for large-scale detection of protein familes</p>
            </title>
            <aug>
               <au>
                  <snm>Enright</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Van Dongen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>7</issue>
            <fpage>1575</fpage>
            <lpage>84</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">101833</pubid>
                  <pubid idtype="pmpid" link="fulltext">11917018</pubid>
                  <pubid idtype="doi">10.1093/nar/30.7.1575</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Graph clustering by flow simulation</p>
            </title>
            <aug>
               <au>
                  <snm>Van Dongen</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Phd Thesis</source>
            <publisher>University of Utrecht, The Netherlands</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>A scalable algorithm for clustering sequential data</p>
            </title>
            <aug>
               <au>
                  <snm>Guralnik</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Karypis</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>SIGKDD Workshop on Bioinformatics, BIOKDD</source>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B22">
            <title>
               <p>JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture</p>
            </title>
            <aug>
               <au>
                  <snm>Sperisen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>216</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1208858</pubid>
                  <pubid idtype="pmpid" link="fulltext">16135248</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-216</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Finding Groups in Data: An Introduction to Cluster Analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Kaufman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rousseeuw</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <publisher>John Wiley &amp; Sons, Inc., New York</publisher>
            <pubdate>1990</pubdate>
         </bibl>
         <bibl id="B24">
            <title>
               <p>MyHits: improvements to an interactive resource for analyzing protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ioannidis</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Zahn-Zabal</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jongeneel</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Hau</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Kuznetsov</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Falquet</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2007</pubdate>
            <issue>35 Web server</issue>
            <fpage>W433</fpage>
            <lpage>37</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1933190</pubid>
                  <pubid idtype="pmpid" link="fulltext">17545200</pubid>
                  <pubid idtype="doi">10.1093/nar/gkm352</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Algorithms for Clustering Data</p>
            </title>
            <aug>
               <au>
                  <snm>Anil</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Richard</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <publisher>Prentice-Hall</publisher>
            <pubdate>1988</pubdate>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Clustering and the continuous k-means algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Faber</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Los Alamos Science</source>
            <pubdate>1994</pubdate>
            <volume>22</volume>
            <fpage>138</fpage>
            <lpage>144</lpage>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Algorithm AS136: A k-means clustering algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Hartigan</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Applied Statistics</source>
            <pubdate>1979</pubdate>
            <volume>28</volume>
            <fpage>100</fpage>
            <lpage>108</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/2346830</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Incremental clustering for dynamic information processing</p>
            </title>
            <aug>
               <au>
                  <snm>Can</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>ACM Trans Inf Syst</source>
            <pubdate>1993</pubdate>
            <volume>11</volume>
            <issue>2</issue>
            <fpage>143</fpage>
            <lpage>164</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/130226.134466</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Cluster analysis algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Spath</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <publisher>Ellis Horwood, Chichester, UK</publisher>
            <pubdate>1980</pubdate>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Efficient and Effective Clustering Methods for Spatial Data Mining</p>
            </title>
            <aug>
               <au>
                  <snm>Ng</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Han</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proceedings of International Conference on Very Large Data Bases</source>
            <publisher>Santiago, Chile</publisher>
            <pubdate>1994</pubdate>
            <fpage>144</fpage>
            <lpage>155</lpage>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships</p>
            </title>
            <aug>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>TJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>6073</fpage>
            <lpage>6078</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">27587</pubid>
                  <pubid idtype="pmpid" link="fulltext">9600919</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.11.6073</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Improved tools for biological sequence comparison</p>
            </title>
            <aug>
               <au>
                  <snm>Pearson</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1988</pubdate>
            <volume>85</volume>
            <fpage>2444</fpage>
            <lpage>2448</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">280013</pubid>
                  <pubid idtype="pmpid">3162770</pubid>
                  <pubid idtype="doi">10.1073/pnas.85.8.2444</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Basic local alignment search tool</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <fpage>403</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">2231712</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>A comparison of four pair-wise sequence alignment methods</p>
            </title>
            <aug>
               <au>
                  <snm>Essoussi</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Fayech</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformation</source>
            <pubdate>2007</pubdate>
            <volume>2</volume>
            <fpage>166</fpage>
            <lpage>168</lpage>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Performance evaluation of amino acid substitution matrices</p>
            </title>
            <aug>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>1993</pubdate>
            <volume>17</volume>
            <fpage>49</fpage>
            <lpage>61</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.340170108</pubid>
                  <pubid idtype="pmpid">8234244</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Positionsgenaues Alignment von Proteinse-quenzen</p>
            </title>
            <aug>
               <au>
                  <snm>Schneckener</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>PhD Thesis</source>
            <publisher>Universit&#228;t zu k&#246;ln</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B37">
            <title>
               <p>The Universal Protein Ressource (UniProt): an expanding universe of protein information</p>
            </title>
            <aug>
               <au>
                  <snm>Cathy</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <fpage>87</fpage>
            <lpage>191</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/gkl485</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>How many clusters are best?</p>
            </title>
            <aug>
               <au>
                  <snm>Dubes</snm>
                  <fnm>RC</fnm>
               </au>
            </aug>
            <source>Pattern Recogn</source>
            <pubdate>1987</pubdate>
            <volume>20</volume>
            <issue>6</issue>
            <fpage>645</fpage>
            <lpage>663</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0031-3203(87)90034-3</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>

