An overview and comparison of voting methods for pattern recognition

An overview and comparison of voting methods for pattern recognition Merijn van Erp NICI P.O.Box 9104, 6500 HE Nijmegen, the Netherlands M.vanErp@nici.kun.nl Louis Vuurpijl NICI P.O.Box 9104, 6500 HE Nijmegen, the Netherlands vuurpijl@nici.kun.nl Lambert Schomaker Artificial Intelligence University of Groningen Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands L.Schomaker@ai.rug.nl Abstract In pattern recognition, there is a growing use of multiple classifier combinations with the goal to increase recognition performance. In many cases, plurality voting is a part of the combination process. In this article, we discuss and test several well known voting methods from politics and economics on classifier combination in order to see if an alternative to the simple plurality vote exists. We found that, assuming a number of prerequisites, better methods are available, that are comparatively simple and fast. 1 Introduction Multiple-classifier combination is rapidly establishing itself as a major topic in the pattern recognition community[4, 7, 11]. In this article we will explore and evaluate the application of several well-known voting methods on combining multiple-classifier hypotheses. Basically, there exist two forms of classifier combination, the multi-stage/hierarchical[1, 10] methods and the ensemble (or late fusion)[4, 8] methods. In the first approach, the classifiers are placed in a multi-layered architecture where the output of one layer limits the possible classes, or chooses the most applicable classifier, in the next layer. The second approach explores ensembles of classifiers, trained on different or similar data and using different or similar features. The classifiers are run simultaneously and their outputs are merged into one compound classification. In most cases, this combination of output hypotheses is done by using the simplest of voting methods (plurality vote 1 ), though more elaborate combination schemes have been proposed (e.g. Dempster-Schafer, BKS and DCS[5]). Plurality voting is mostly used in classifier combination, as it is simple and yields acceptable results. However, we will show that there exist alternative, and sometimes better voting methods. The concept of voting is well-known from, e.g., politics and economics, where multiple opinions shared by people must be merged into one final decision. Many different voting methods stem from these areas, which are all relatively simple to perform but use different amounts of information. In this paper we will present and discuss the best known of these voting methods that are suitable for application in classifier combination. The performance of these voting methods will be assessed by combining various ensembles of classifiers. This paper is organized as follows. In Section 2, an overview and discussion is presented of the voting methods. Section 3 contains the setup of the experiment in which we tested the voting methods on ensembles of classifiers. In Sections 4 and 5 we describe and discuss the results of the 1 Often erroneously called majority vote 1

experiment. 2 Voting methods In this section the voting methods will be presented. We will start in Section 2.1 with a general overview of voting methods and their application to combining classifiers and then discuss the actual voting methods in three distinguishable classes, i.e. unweighed voting methods, confidence voting methods and ranked voting methods. 2.1 General overview In human society, voting is a formal way of expressing opinions. A well known example is the election of a president. In this example, voters are the people that express their opinion by means of a vote. When voting, a voter chooses one of the candidates or indicates some kind of rank-order which indicates his preference. The voting method is the mechanism of integrating all votes into one final decision. The winner is the candidate that is chosen as result of the voting method. Here we would like to show how voting is translated for use in multiple classifier combination. Now, the classifiers are the voters, the possible classes are the candidates and an election is the classification of one sample. This produces a winner, which is the resulting classification of the sample by the ensemble of classifiers. The actual voting depends on the voting method used, but a classifier expresses its opinion simply by classifying a sample. The result of this classification, be it a single class, a ranked list of all classes or even a scored list of the classes, can be interpreted as a vote. The voting methods are simple, formal, step-by-step methods (see next section), so implementing them, given the translation above, is straightforward. Actually, the process of voting by pattern classifiers is more simple than the process of voting by humans. Classifiers are programmed to classify a sample independent of the results of other classifiers. Therefore, current classifiers will not alter their results in order to use the voting method to the benefit of a preferred class, as a human might do. In other words, a classifier does not cheat (yet). 2.2 Unweighed voting methods The unweighed voting methods consist of methods in which each vote carries equal weight. The only differentiation between the candidates is the number of votes they have received. As a consequence, voters cannot express the degree of preference of one candidate over another. Although this removes relevant information, it also results in less complex methods because no elaborate measures need to be taken to limit the power of the voter when expressing degrees of preference. Another drawback is the larger chance on a tied result. With the lack of extra information this can only be solved by a random draw. Three of the voting methods presented here (amendment, run-off and Condorcet) are multi-step procedures. These methods require that the classifiers are able to give a preference choice between any two given classes. This makes these three voting methods more difficult to apply than other unweighed voting methods. It might be argued that the multi-step methods should be placed under the ranking methods (see Section 2.4), but the separate steps are inherently unweighed voting, so they are discussed here. In terms of classifier combination, the (single step) unweighed voting methods demand no prerequisites from the classifiers, but also do not use any extra information the classifiers may provide. The multi-step methods expect the classifiers to be able to handle two-class subdomains of a larger population of classes. Plurality: Also known as first past the post, plurality is the simplest form of voting. Every voter has one vote, which it can cast for any one candidate. The candidate with the highest number of votes wins. The benefits of this method are its simplicity and ease of use. The major drawback of plurality voting is the real possibility of a win on a small number of votes and thus of a minority (and very probably an erroneous) winner. Majority: In majority voting every voter has one vote that can be cast for any one candidate. The candidate that received the majority (i.e. more than half) of the votes, wins the election. Note that majority voting is often confused with plurality voting in which no majority is needed to win. The benefits of this method are its simplicity and its low error count. The method only appoints a winner in case of a majority candidate, so in order to produce an error the majority of the classifiers has to be wrong. The chances of this happening are low, especially with a large number of classifiers. However, the downside is that when no majority candidate is present, no result is produced and the sample is rejected by the voting method. Amendment vote: Amendment voting starts with a majority vote between the first two candidates. The winner of that election is pitted against the next available candidate and so on until the one remaining candidate is declared the winner. This voting method is favorable for the candidates that are added last in the total election. This lack of neutrality should be recognized when using this voting method. 2

Runoff vote: The runoff vote is a two step voting process. In the first step each voter can vote for any one candidate. The two candidates with the highest number of votes advance to the next round. The second round is a majority vote between these two candidates in which all voters can participate again. The runoff vote solves the biggest problem of the plurality vote and has no rejections like the majority vote at the cost of a slight decrease of transparency. It will always deliver a winner and the chances of electing a minority candidate have decreased considerably. Condorcet count: In this method all candidates are compared in pairwise elections. The winner of each election scores a point. The candidate with the highest number of points wins the total election. This method is more complex then the other unweighed voting methods, but also suffers least from the problems of these methods. 2.3 Confidence voting methods In confidence voting methods, voters can express the degree of their preference for a candidate. This is done by assigning a value (called the confidence value, hence the name for these voting methods) to candidates. The higher the confidence value, the more the candidate is preferred by the voter. Examples of confidence scores for pattern recognition are probabilities and distances. The prerequisite for using these voting methods in classifier combination is not only that classifiers produce such a confidence value, but also that these confidence values are scaled correctly. So questions like is there a limit to the confidence value or will any number do? and how does one proportionally correctly translate a preference for a candidate in a value?, should be answered. Pandemonium: Every voter is given one vote which it can cast for any one candidate. The voter casts the vote by stating its confidence in the candidate. The candidate which received the vote with the highest confidence of all votes cast wins. This method, known as Selfridge s Pandemonium[9], is one of the very first examples of using separate experts/agents in computer science. It is very simple, but misses the possibility for a voter to express differences of preference between candidates. Only the voter s top choice and its confidence are known. Furthermore, there is no limit to the amount of confidence nor a scale for voter s to adhere to. While limits are easily added to the method, a correct scale is still difficult to implement. However, with well scaled classifiers, this method could be sufficient. Sum rule: When the sum rule is used each voter has to give a confidence value for each candidate. Next all confidence values are added for each candidate and the candidate with the highest sum wins the election. Product rule: Like with the sum rule, each voter gives a confidence value for each candidate. Then all confidence values are multiplied per candidate. The candidate with the highest confidence product wins. The product rule is highly subjective to low confidence values. A very low value can ruin a candidate s chances on winning the election no matter what its other confidence values are. 2.4 Ranked voting methods In ranked voting methods the voters are asked for a preference ranking of the candidates. This way, more information on the voter s preference is used than in the unweighed voting methods. On the other hand, it does only convey the degree of preference between two classes in fixed amounts (the ranks) instead of the confidence values of the confidence vote methods. This constitutes a loss of information, though it is easier in use (no problems in scaling the voters confidences) and it prevents over-confidence in voters (see also [6]). Ranked voting methods are useful in classifier combination if the classifiers can give some kind of confidence value that is hard to scale correctly. Borda count: This method, developed by Jean-Charles de Borda[2], needs a complete preference ranking from all voters over all candidates. It then computes the mean rank of each candidate over all voters. The classes are reranked by their mean rank and the top ranked class wins the election. Note that the Borda count is the ranked variant of the sum rule. Single transferable vote (STV): Also known as alternative voting (in case of one winner situations), each voter gives a preference ranking of the candidates. Incomplete ranks are possible, though it may result in a voter losing his vote altogether. A majority vote is held based on the highest ranked candidate of each voter s preference ranking. If some candidate gains the majority, it wins the election. Otherwise, the candidate with the least number of votes in the majority election is eliminated from further participation. This candidate is removed from all preference rankings. Now, the process repeats itself, starting with the majority vote, until one candidate gains the majority. One low rank in an STV election is less disruptive for a candidate s chances of winning then in the Borda count. However, due to the elimination procedure, complex and illogical side effects may occur (e.g. voting for a candidate 3

may result in its loss of the election). 3 Method 3.1 Bagging To test the effect of the different voting methods when combining classifier outputs, the technique of bagging was used[3]. Bagging is a simple method to increase the recognition performance of a classification technique that depends on training the classifier. Bagging consists of the following steps: 1. New training sets were created by randomly sampling with replacement from the original training set. A number of training sets between 10 and 20 is sufficient[3]. The number of samples in each training set is normally equal to the number of samples in the original training set. Note that the number of different samples is probably smaller as doubles are possible (and even very likely). 2. For each new training set, train a classifier using a consistent technique. The bagged classifier is now complete. 3. For classification, each sample is classified by all classifiers. 4. When the classifiers return their results for a sample, these results are then combined using a plurality vote. We use bagging for two reasons. It creates a situation where a lot of classifiers are used, which are easily constructed. Creating a sufficiently large number of classifiers based on different techniques would have been a project in itself. Furthermore, bagging is a relatively new, successful technique that uses a voting mechanism. However, this mechanism itself is not the core of the technique, more an afterthought just to combine the results. By testing the voting mechanisms with bagging, we immediately gain a useful application for the results of the experiment. 3.2 Procedure In the experiment two datasets were used of the UNIPEN release train r01 v07. UNIPEN is a large database of online handwriting data made available by the International Unipen Foundation. The first dataset contained 10636 samples of handwritten digits and the second set contained 19552 samples of handwritten capitals. All samples are given in angular velocity features[10]. In these feature sets, the original sample is spatially normalized and described in 30 sets of x-y-z coordinates. Next, for each consecutive pair of x-y-z coordinates, the running angle (Ó µ Ò µ) is added to the feature vector. Finally, for each pair of running angles the angular difference (Ó µ Ò µ) brings the total to 204 features. Both datasets were divided equally in a training and a test set. The classifiers used in the bagging procedure are multi-layered perceptrons (MLPs). For both datasets, two runs of the experiment were performed, each time with a different MLP structure. For the digit set the MLPs had an architecture of ¾¼ ¼ ½¼ network for the first run and a ¾¼ ½¼ for second run. The capitals set had MLPs of ¾¼ ½¾¼ ¾ and ¾¼ ½ ¾ for the first and second run respectively. This setup was chosen to test the voting methods in good (the large MLPs) and bad (the small MLPs) situations. The experiment procedure was as follows for each data set: 1. 17 MLP classifiers are created by bagging. This number was chosen well over 10 to make certain that any late bagging effects are captured. 2. For each possible number of classifiers Ò (i.e. ½ Ò ½ ), randomly draw an ensemble of classifiers of size Ò from the available 17 classifiers. Repeat this 100 times, so there will be 100 of such ensembles for each size Ò. 3. For each ensemble of classifiers drawn (1435 in all 2 ), classify all test samples. 4. For all ensemble sizes Ò, record the mean recognition performance over all ensembles for that size. 4 Results The results of the experiment are displayed in Figure 1 (digit sample set) and Figure 2 (capital sample set) that show the recognition performance of the combined classifiers. The performance is given in percentages of correctly classified samples (with an offset of 50%). The x-axis denotes the number of classifiers in the combination. The results shown for each combination size are the average over all 100 random combinations of classifiers. Only the results of the small MLP combinations on both sample sets are shown as the large MLP combinations did not result in any appreciable differences in performance between the voting methods. Recognition rate (in %) 100 95 90 85 80 75 Sum rule 70 Amendment Borda count 65 Condorcet count Majority Product rule 60 Pandemonium Plurality 55 Run-off STV 50 0 2 4 6 8 10 12 14 16 18 Number of classifiers Figure 1. Recognition rate (in %) of the voting methods tested on the digit sample set with the small MLPs The serrate edge of some performance curves (e.g. the majority vote) is due to the difference in an odd or an even number of classifiers. An even number of classifiers is more likely to produce a tied result when two classes vie for the top spot. 2 A classifier can only appear once in each ensemble, so, for combination sizes 1, 16 and 17 all possible ensembles are used instead of 100. 4

Recognition rate (in %) 100 95 90 85 80 75 70 65 60 55 Voting method Ensemble size 16 Ensemble size 17 Digits Digits Borda Count 89.73 89.83 Sum rule 89.60 89.60 Product rule 88.81 88.91 Condorcet 88.08 88.81 Amendment 87.02 88.74 Run-of 86.89 88.74 Plurality 86.40 85.92 STV 84.17 84.51 Pandemonium 83.73 83.64 Majority 78.35 79.84 50 0 2 4 6 8 10 12 14 16 18 Number of classifiers Figure 2. Recognition rate (in %) of the voting methods tested on capital sample set with the small MLPs In Table 3, the recognition rates for the ensemble sizes 16 and 17 are shown. For each voting method, a calculation of the standard deviation over one million samples per ensemble size was made. This yielded standard deviations of less than 0.05 in all cases. With a desired confidence level of 0.01, the confidence limits of the recognition rates are less than 0.12 (two tailed). So, the performance rates of the different voting methods indicated a very strong significant difference between them. Voting method Ensemble size 16 Ensemble size 17 Capitals Capitals Borda Count 80.81 80.98 Sum rule 79 76 79.90 Product rule 81.21 81.37 Condorcet 79.44 79.43 Amendment 78.86 79.30 Run-of 77.67 77.70 Plurality 77.26 77.44 STV 69.59 71.24 Pandemonium 74.20 74.14 Majority 60.69 62.06 Figure 3. Recognition rates (in %) for the ensemble sizes 16 and 17 for both sample sets and small MLPS. Note that plurality voting is 7th in all cases. 5 Discussion The main goal of this experiment was to determine which classifier combination method would perform best on bagged classifiers. On the small MLP combinations three voting methods outperformed the other voting methods. The product and sum rules, both confidence voting methods performed especially well on the smaller ensemble sizes, indicating that the confidence rules benefit the fastest from additional classifiers (Pandemonium shows this as well). Their success can be attributed to the experimental setup. The main problem of the confidence methods, i.e. the requirement of a well scaled and limited confidence output, is automatically provided here. An MLP uses a threshold function with a limited range and because bagging uses the same basic classifiers, the results of each classifier in the combination are readily comparable. The third method, that performs very well on the small MLP combination, is the Borda count. While the Borda count does have the requirement of a complete ranking, it is less demanding than the confidence methods. This makes it an ideal alternative for classifier combination if the scale of the confidence values of the classifiers used is different or not applicable. Interestingly, the Borda count performs better on larger ensemble sizes, thus forming a nice complement to the product and sum rules. Also striking is the difference between the combination of the large and the small MLPs. The large MLPs, with a better performance, do not show any noticeable difference between the combination methods. This can be attributed to the hard samples. The used data set concerns unconstrained handwriting, which allows for some very hard to recognize samples, even for humans. These samples will most probably be misclassified by most MLPs and no amount of combining is able to correct such results. Finally, note that the majority-vote performance is low compared with all the other voting methods. What is not shown here is that the majority vote also rejects a large number of samples as indecisive (no majority candidate). As a result, the actual errors that the majority vote makes is much lower then the recognition performance suggests. The difference is especially strong in the small MLP combination, where the majority vote has between 50% to 75% less errors than any other method. 6 Conclusion In this paper we have presented an overview of ten voting methods in the context of combining classifiers. We tested these voting methods on the combination of ensembles of classifiers trained by bagging. The best voting methods for classifier combination are the product rule, the sum rule, and the Borda count. The product and sum rule performed best on the smaller sized ensembles, 5

while the Borda count gave good recognition results on larger ensembles. Of special interest is the performance of the widely used plurality voting method. Six of the proposed alternative voting methods outperformed plurality voting in several of the experiments, while performing comparatively well in the other experiments (a conclusion that confirms the outcome of [7]). Therefore, if applicable, it is preferable to use the product rule, sum rule or the Borda count instead of plurality voting. References [1] E. Alpaydin, C. Kaynak, and F. Alimoglu. Cascading multiple classifiers and representations for optical and pen-based handwritten digit recognition. In IWFHR VII, pages 453 462, 2000. [2] J.-C. d. Borda. Memoire sur les Elections au Scrutin. Histoire de l Academie Royale des Sciences, Paris, 1781. [3] L. Breiman. Bagging predictors. Technical Report 421, Department of Statistics, University of California, Berkeley, California 94720, USA, September 1994. [4] T. G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems, pages 1 15, 2000. [5] G. Giacinto and F. Roli. Dynamic classifier selection. In Multiple Classisifer Systems, pages 177 189, 2000. [6] T. K. Ho, J. J. Hull, and N. Srihari. Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1):66 75, 1994. [7] J. Kittler and F. Alkoot. Relationship of sum and vote fusion strategies. In J. Kittler and F. Rolli, editors, Multiple Classifier Systems, pages 339 348. Springer, 2001. [8] L. I. Kuncheva. A theoretical study on six classifier fusion strategies. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):281 286, 2002. [9] O. Selfridge. Pandemonium: a paradigm for learning in mechanisation of thought processes. In Proceedings of a Symposium Held at the National Physical Laboratory, pages 513 526, London, November 1958. HMSO. [10] L. Vuurpijl and L. Schomaker. Two-stage character classification: a combined approach of clustering and support vector classifiers. In IWFHR VII, pages 423 432, 2000. [11] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics, 22(3):418 435, May/June 1992. 6