<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JILSA</journal-id><journal-title-group><journal-title>Journal of Intelligent Learning Systems and Applications</journal-title></journal-title-group><issn pub-type="epub">2150-8402</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jilsa.2015.74010</article-id><article-id pub-id-type="publisher-id">JILSA-60996</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  A KNN Undersampling Approach for Data Balancing
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>arcelo</surname><given-names>Beckmann</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Nelson</surname><given-names>F. F. Ebecken</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Beatriz</surname><given-names>S. L. Pires de Lima</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Civil Engineering Program/COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>beckmann.marcelo@gmail.com(AB)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>29</day><month>09</month><year>2015</year></pub-date><volume>07</volume><issue>04</issue><fpage>104</fpage><lpage>116</lpage><history><date date-type="received"><day>30</day>	<month>April</month>	<year>2015</year></date><date date-type="rev-recd"><day>accepted</day>	<month>8</month>	<year>November</year>	</date><date date-type="accepted"><day>11</day>	<month>November</month>	<year>2015</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  In supervised learning, the imbalanced number of instances among the classes in a dataset can make the algorithms to classify one instance from the minority class as one from the majority class. With the aim to solve this problem, the KNN algorithm provides a basis to other balancing methods. These balancing methods are revisited in this work, and a new and simple approach of KNN undersampling is proposed. The experiments demonstrated that the KNN undersampling method outperformed other sampling methods. The proposed method also outperformed the results of other studies, and indicates that the simplicity of KNN can be used as a base for efficient algorithms in machine learning and knowledge discovery.
 
</p></abstract><kwd-group><kwd>Machine Learning</kwd><kwd> Class Overlaping</kwd><kwd> Imbalanced Datases</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>When dealing with supervised learning, one of the main problems in classification activities lies in the treatment of datasets where one or more classes have a minority quantity of instances. This condition denotes an imbalanced dataset, which makes the algorithm to incorrectly classify one instance from the minority class as belonging to the majority class, and in highly skewed datasets, this is also denoted as a “needle in the haystack” problem [<xref ref-type="bibr" rid="scirp.60996-ref1">1</xref>] , due to the high number of instances from a class overcoming one or more minority classes. Nevertheless, in most of cases the minority class represents an abnormal event in a dataset, and usually this is the most interesting and valuable information to be discovered.</p><p>Learning from imbalanced datasets is still considered an open problem in data mining and knowledge discovery, and needs real attention from the scientific community [<xref ref-type="bibr" rid="scirp.60996-ref2">2</xref>] . The experiments performed in [<xref ref-type="bibr" rid="scirp.60996-ref3">3</xref>] demonstrated the class overlapping is commonly associated with the class imbalance problem. An empirical study was performed by [<xref ref-type="bibr" rid="scirp.60996-ref1">1</xref>] , in order to identify causes of why classifiers perform worse in the presence of class imbalance. In [<xref ref-type="bibr" rid="scirp.60996-ref4">4</xref>] the authors conducted a taxonomy of methods applied to correct or mitigate this problem, and in this study three main approaches were found: data adjusting, cost sensitive learning, and algorithm adjusting. In the data adjusting, there are two main sub-approaches: creation of instances from minority class (oversampling), and removal of instances from majority class.</p><p>This work is focused in the data adjusting algorithms, and a proposal of a KNN undersampling (KNN-Und) algorithm will be presented. The KNN-Und is a very simple algorithm, and basically it uses the neighbor count to remove instances from majority class. Despite its simplicity, the classification experiments performed with KNN-Und balancing resulted in better performance of G-Mean [<xref ref-type="bibr" rid="scirp.60996-ref5">5</xref>] and AUC [<xref ref-type="bibr" rid="scirp.60996-ref6">6</xref>] , in most of the 33 datasets, if compared with three methods based on KNN: SMOTE [<xref ref-type="bibr" rid="scirp.60996-ref7">7</xref>] , ENN [<xref ref-type="bibr" rid="scirp.60996-ref8">8</xref>] , NCL [<xref ref-type="bibr" rid="scirp.60996-ref9">9</xref>] , and the random undersampling method. The KNN-Und balancing also performed better than the results published previously by [<xref ref-type="bibr" rid="scirp.60996-ref10">10</xref>] in 11 of 15 datasets, and had higher average values of G-Mean and AUC, than the evolutionary algorithm proposed by [<xref ref-type="bibr" rid="scirp.60996-ref11">11</xref>] . The results obtained in the experiments show that KNN-Und and other balancing methods based on KNN are an interesting approach to solve the imbalanced dataset problem. Instead of generating new synthetic data as oversampling methods, especially when the datasets are approaching petabytes of size [<xref ref-type="bibr" rid="scirp.60996-ref12">12</xref>] , the oriented removal of majority instances can be a better solution than to create more data.</p><p>This paper is organized as follows. In Session 2 a literature review about KNN balancing methods is presented, in Session 3 the KNN-Und methodology is explained in more details. In section 4 the experiments conducted in this work will be presented, compared and commented, followed by the conclusions.</p>Imbalanced Dataset Definition<p>This section establishes some notations that will be used in this work.</p><p>Given the training set T with m examples and n attributes, where<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x6.png" xlink:type="simple"/></inline-formula>, and where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x7.png" xlink:type="simple"/></inline-formula> is an instance in the set of attributes<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x8.png" xlink:type="simple"/></inline-formula>, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x9.png" xlink:type="simple"/></inline-formula> is an instance in the set of classes<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x10.png" xlink:type="simple"/></inline-formula>, there is a subset with positive instances<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x11.png" xlink:type="simple"/></inline-formula>, and a subset of negative instances<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x12.png" xlink:type="simple"/></inline-formula>, where<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x13.png" xlink:type="simple"/></inline-formula>. All</p><p>subset of P created by sampling methods will be denominated S. The pre-processing strategies applied to datasets aims to balance the training set T, such as<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x14.png" xlink:type="simple"/></inline-formula>.</p></sec><sec id="s2"><title>2. Literature Review</title><p>Along the years, a great effort was done in the scientific community in order to solve or mitigate the imbalanced dataset problem. Specifically for KNN, there are several balancing methods based on this algorithm. This section will provide a bibliographic review about the KNN and its derivate algorithms for dataset balancing. Also, the random oversampling and undersampling methods, the class overlapping problem, and evaluation measures will be reviewed.</p><sec id="s2_1"><title>2.1. KNN Classifier</title><p>The k Nearest Neighbor (KNN) is a supervised classifier algorithm, and despite his simplicity, it is considered one of the top 10 data mining algorithms [<xref ref-type="bibr" rid="scirp.60996-ref13">13</xref>] . It creates a decision surface that adapts to the shape of the data distribution, enabling them to obtain good accuracy rates when the training set is large or representative. The KNN was introduced initially by [<xref ref-type="bibr" rid="scirp.60996-ref14">14</xref>] , and it was developed with the need of perform discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to determine.</p><p>The KNN is a nonparametric lazy learning algorithm. It is nonparametric because it does not make any assumptions on the underlying data distribution. Most of the practical data in the real world does not obey the typical theoretical assumptions made (for example, Gaussian mixtures, linear separability, etc....). Nonparametric algorithms like KNN are more suitable on these cases [<xref ref-type="bibr" rid="scirp.60996-ref15">15</xref>] [<xref ref-type="bibr" rid="scirp.60996-ref16">16</xref>] .</p><p>It is also considered a lazy algorithm. A lazy algorithm works with a nonexistent or minimal training phase but a costly testing phase. For KNN this means the training phase is fast, but all the training data is needed during the testing phase, or at the least, a subset with the most representative data must be present. This contrasts with other techniques like SVM, where you can discard all nonsupport vectors.</p><p>The classification algorithm is performed according to the following steps:</p><p>1. Calculate the distance (usually Euclidean) between a x<sub>i</sub> instance and all instances of the training set T;</p><p>2. Select the k nearest neighbors;</p><p>3. The x<sub>i</sub> instance is classified (labeled) with the most frequent class among the k nearest neighbors. It is also possible to use the neighbors' distance to weight the classification decision.</p><p>The value of k is training-data dependent. A small value of k means that noise will have a higher influence on the result. A large value makes it computationally expensive and defeats the basic philosophy behind KNN: points that are close might have similar densities or classes. Typically in the literature are found odd values for k, normally with k = 5 or k = 7, and [<xref ref-type="bibr" rid="scirp.60996-ref15">15</xref>] reports k = 3 allowing to obtain a performance very close to the Bayesian classifier in large datasets. An approach to determine k as a function (1) of data size m is proposed in [<xref ref-type="bibr" rid="scirp.60996-ref16">16</xref>] .</p><disp-formula id="scirp.60996-formula86"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x15.png"  xlink:type="simple"/></disp-formula><p>The algorithm may use other distance metrics than Euclidean [<xref ref-type="bibr" rid="scirp.60996-ref17">17</xref>] [<xref ref-type="bibr" rid="scirp.60996-ref18">18</xref>] .</p></sec><sec id="s2_2"><title>2.2. SMOTE</title><p>The SMOTE algorithm proposed by [<xref ref-type="bibr" rid="scirp.60996-ref7">7</xref>] is one of the most known oversampling techniques, being successful in several areas of application, being also a base for other oversampling algorithms [<xref ref-type="bibr" rid="scirp.60996-ref9">9</xref>] [<xref ref-type="bibr" rid="scirp.60996-ref19">19</xref>] - [<xref ref-type="bibr" rid="scirp.60996-ref22">22</xref>] .</p><p>The SMOTE executes the balancing of a P set of minority instances, creating n synthetic instances from each instance <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x16.png" xlink:type="simple"/></inline-formula> of the P set. The synthetic instance is created based in a minority instance and its nearest neighbors. One synthetic instance is generated based in the instances <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x17.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x18.png" xlink:type="simple"/></inline-formula>, being <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x19.png" xlink:type="simple"/></inline-formula> an instance randomly selected among the k nearest neighbors (KNN) of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x20.png" xlink:type="simple"/></inline-formula>, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x20.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x21.png" xlink:type="simple"/></inline-formula> as a random number between 0 and 1, according the Equation (2). The process is repeated n times for each instance <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x20.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x21.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x22.png" xlink:type="simple"/></inline-formula> from the P set, where n = b/100, and b is a parameter that defines the percentage of oversampling required to balance the dataset.</p><disp-formula id="scirp.60996-formula87"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x23.png"  xlink:type="simple"/></disp-formula><p><xref ref-type="fig" rid="fig1">Figure 1</xref> shows the SMOTE process with k = 5. Starting from (a), there is an imbalanced dataset, where (−) belongs to majority instances, also known as negative instances, and (+) belongs to minority instances, also known as positive instances. In (b) The KNN selects 5 nearest neighbors from a minority instance<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x24.png" xlink:type="simple"/></inline-formula>. In (c) one of the five nearest neighbors <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x24.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x25.png" xlink:type="simple"/></inline-formula> is randomly selected. In (d) a new synthetic instance is generated with random attributes between <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x24.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x25.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x26.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x24.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x25.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x26.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x27.png" xlink:type="simple"/></inline-formula>. The process is repeated for every minority instance (+) from the subset P.</p><fig-group id="fig1"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> SMOTE process for k = 5. (a) An imbalanced dataset, with negative (−) and positive (+) instances. An instance x<sub>i</sub> is selected; (b) the k = 5 nearest instances (neighbors) of x<sub>i</sub> are selected; (c) one of the k = 5 neighbors<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x30.png" xlink:type="simple"/></inline-formula>, is randomly selected; (d) a new synthetic instance is created with the random values of v1 and v2 between x<sub>i</sub> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x30.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x31.png" xlink:type="simple"/></inline-formula></title></caption><fig id ="fig1_1"><label> (b)</label><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x28.png"/></fig><fig id ="fig1_2"><label>(c)</label><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x29.png"/></fig></fig-group><p>An extensive comparison of several oversampling and undersampling methods was performed in [<xref ref-type="bibr" rid="scirp.60996-ref23">23</xref>] . The authors concluded the SMOTE combined with Tomek Links [<xref ref-type="bibr" rid="scirp.60996-ref24">24</xref>] and ENN methods [<xref ref-type="bibr" rid="scirp.60996-ref8">8</xref>] presented better performance in 50% of the experiments. Nevertheless, the SMOTE algorithm alone had better performance in 16% of the cases, and in most of the cases, it presented similar results in terms of AUC, if compared with the combined methods.</p><p>According the experiments conducted in [<xref ref-type="bibr" rid="scirp.60996-ref25">25</xref>] , one of the weaknesses of SMOTE lies in the fact all the positive instances acts as a base for synthetic instance generation. The authors argue such strategy doesn’t take into account that not always a homogeneous distribution of synthetic instances is applicable to an unbalancing problem; as such practice could cause overfitting and class overlapping. Another weakness reported is the result variance, caused by the random characteristics existing in some points of the algorithm.</p></sec><sec id="s2_3"><title>2.3. Edited Nearest Neighbor Rule (ENN)</title><p>The ENN method proposed by [<xref ref-type="bibr" rid="scirp.60996-ref8">8</xref>] , removes the instances of the majority class whose prediction made by KNN method is different from the majority class. So, if an instance <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x32.png" xlink:type="simple"/></inline-formula> has more neighbors of a different class, this instance <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x32.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x33.png" xlink:type="simple"/></inline-formula> will be removed. The ENN works according to the steps below:</p><p>1. Obtain the k nearest neighbors of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x34.png" xlink:type="simple"/></inline-formula>,<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x34.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x35.png" xlink:type="simple"/></inline-formula>;</p><p>2. <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x36.png" xlink:type="simple"/></inline-formula>will be removed if the number of neighbors from another class is predominant;</p><p>3. The process is repeated for every majority instance of the subset N.</p><p>According to the experiments conducted in [<xref ref-type="bibr" rid="scirp.60996-ref26">26</xref>] , the ENN method removes both the noisy examples as borderline examples, providing a smoother decision surface.</p></sec><sec id="s2_4"><title>2.4. Neighbor Cleaning Rule (NCL)</title><p>The Neighbor Cleaning Rule (NCL) proposed by [<xref ref-type="bibr" rid="scirp.60996-ref9">9</xref>] , consists in improving the ENN method for two-classes problems in the following way: for each example<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x37.png" xlink:type="simple"/></inline-formula>, find its k = 3 nearest neighbors. If <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x37.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x38.png" xlink:type="simple"/></inline-formula> belongs to the majority class and there is a prediction error related to its nearest neighbors, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x37.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x38.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x39.png" xlink:type="simple"/></inline-formula>will be removed. If <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x37.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x38.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x39.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x40.png" xlink:type="simple"/></inline-formula> belongs to the minority class and there is a prediction error related to its nearest neighbors, the nearest neighbors belonging to the majority class will be removed.</p></sec><sec id="s2_5"><title>2.5. Random Sampling</title><p>This is one of the simplest strategies for data sets adjusting, and basically consists in the random removal (undersampling) and addition (oversampling) of instances. For oversampling, instances of the positive set P are randomly selected, duplicated and added to the set T. For undersampling, the instances from negative set N are randomly selected for removal.</p><p>Although both strategies have the similar operation and brings some benefit than simply classifying without any preprocessing [<xref ref-type="bibr" rid="scirp.60996-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.60996-ref9">9</xref>] , they also introduce problems in learning. For the instances removal, there is the risk of removing important concepts related to the negative class. In the case of adding positive instances, the risk is to create over adjustment (overfitting), i.e., a classifier can construct rules that apparently are precise, but in fact cover only a replicated example.</p></sec><sec id="s2_6"><title>2.6. Class Overlapping Problem</title><p>According the experiments of [<xref ref-type="bibr" rid="scirp.60996-ref3">3</xref>] , the low classification performance on imbalanced datasets is not associated only to the class distribution, but is also related to class overlapping. The authors concluded that normally in highly skewed datasets, the problem of “needle in the haystack” comes together with a class overlapping problems.</p></sec><sec id="s2_7"><title>2.7. Evaluation Measures</title><p>In supervised learning, it is necessary to use some measure to evaluate the results obtained with a classifier algorithm. The confusion matrix from <xref ref-type="table" rid="table1">Table 1</xref>, also known as contingency table, is frequently applied for such purposes, providing not only the count of errors and hits, but also the necessary variables to calculate other measures.</p><p>The confusion matrix is able to represent either two class or multiclass problems. Nevertheless, the research and literature related to imbalanced datasets is concentrated in two class problems, also known as binary or binomial</p>
<table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Confusion matri</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Positive prediction</th><th align="center" valign="middle" >Negative prediction</th></tr></thead><tr><td align="center" valign="middle" >Positive class</td><td align="center" valign="middle" >True Positive (TP)</td><td align="center" valign="middle" >False Negative (FN)</td></tr><tr><td align="center" valign="middle" >Negative class</td><td align="center" valign="middle" >False Positive (FN)</td><td align="center" valign="middle" >True Negative (TN)</td></tr></tbody></table></table-wrap>
<p>problems, which the less frequent class is named as positive, and the remaining classes are merged and named as negative.</p><p>Some of the most known measures derived from this matrix are the error rate (3) and the accuracy (4). Nevertheless, such measures are not appropriated to evaluate imbalanced datasets, because they do not take into account the number of examples distributed among the classes. On the other hand, there are measures that compensate this disproportion in their calculation. The Precision (5), Recall (8) and F-Measure [<xref ref-type="bibr" rid="scirp.60996-ref27">27</xref>] are appropriated when the positive class is the main concern. The G-Mean, ROC and AUC are appropriated when the performance of both classes (positive and negative) are important.</p><disp-formula id="scirp.60996-formula88"><label>(3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x41.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.60996-formula89"><label>(4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x42.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.60996-formula90"><label>(5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x43.png"  xlink:type="simple"/></disp-formula><p>In this work, both classes are considered as equal importance, therefore, the measures G-Mean and AUC will be used to evaluate the experiments.</p><p>The G-Mean [<xref ref-type="bibr" rid="scirp.60996-ref5">5</xref>] verifies the performance in both classes, taking into account the distribution between them, by computing the geometric average between the true positives and true negatives (6).</p><disp-formula id="scirp.60996-formula91"><label>(6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x44.png"  xlink:type="simple"/></disp-formula><p>The Receiver Operating Characteristics (ROC) chart, also denominated ROC Curve, has been applied in detection, signal analysis since the Second World War, and recently in data mining and classification. It consists of a two dimensions chart, were the y-axis refers to Sensitivity or Recall (7), and the x-axis calculated as 1-Especificity (8). According to [<xref ref-type="bibr" rid="scirp.60996-ref6">6</xref>] there are several points in this chart that deserve attention. By analyzing this chart it is possible to identify not only the classifier performance, but also to deduce some classifier behaviors like: conservative, aggressive, or aleatory.</p><disp-formula id="scirp.60996-formula92"><label>(7)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x45.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.60996-formula93"><label>(8)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x46.png"  xlink:type="simple"/></disp-formula><p>The AUC measure (9) synthetizes as a simple scalar the information represented by a ROC chart, and is insensitive to class imbalance problems.</p><disp-formula id="scirp.60996-formula94"><label>(9)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601307x47.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x48.png" xlink:type="simple"/></inline-formula> is the normal cumulative distribution, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x48.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x49.png" xlink:type="simple"/></inline-formula>is the Euclidean distance between the class centroids of two classes, and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x48.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x50.png" xlink:type="simple"/></inline-formula>, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x48.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x50.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x51.png" xlink:type="simple"/></inline-formula> are the standard deviation from the positive and negative classes. An algorithm to calculate AUC is also provided in [<xref ref-type="bibr" rid="scirp.60996-ref6">6</xref>] .</p></sec></sec><sec id="s3"><title>3. Proposed Method: KNN-Und</title><p>The KNN-Und method works removing instances from the majority classes based on his k nearest neighbors, and works according to the steps below:</p><p>1. Obtain the k nearest neighbors for<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x52.png" xlink:type="simple"/></inline-formula>;</p><p>2. <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x53.png" xlink:type="simple"/></inline-formula>will be removed if the count of its neighbor is greater or equal to t;</p><p>3. The process is repeated for every majority instance of the subset N.</p><p>The parameter t defines the minimum count of neighbors around <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x54.png" xlink:type="simple"/></inline-formula> belonging to the P (minority) subset. If this count is greater or equal t, the instance <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x54.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x55.png" xlink:type="simple"/></inline-formula> will be removed from the training set T. The valid values of tare<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x54.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x55.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x56.png" xlink:type="simple"/></inline-formula> and as lower t is, as aggressive is the undersampling. This algorithm can also be used in multiclass problems, as in the negative subset N can contain instances from several majority classes. The KNN-Und algorithm was developed as a preprocessing plug in in Weka platform [<xref ref-type="bibr" rid="scirp.60996-ref28">28</xref>] .</p><p>If compared with ENN, the KNN-Und has a more aggressive behavior in terms of instance removal, because KNN-Und does not depend of a wrong prediction of KNN to remove an instance<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601307x57.png" xlink:type="simple"/></inline-formula>. KNN-Und only acts in the class overlapping areas, because an instance from majority class only will be removed if a number t of instances from other classes are present in its neighborhood. In the cases that an instance of the majority class is not surrounded by t instances of other classes, that instance will not be removed. This situation only occurs in non-overlapping areas. Despite this behavior, in our experiments t = 1 was kept in most of the cases, because the KNN-Und only acts in overlapping areas. The nonoverlapping areas, which are far from the decision surface are kept untouchable. This explains why the KNN-Und can also be used to solve the class-overlapping problem, which is commonly associated with imbalanced datasets [<xref ref-type="bibr" rid="scirp.60996-ref3">3</xref>] . Nevertheless, in highly skewed problems the KNN-Und is not efficient to balance the dataset. In these cases, the combination of KNN-Und with another sampling method could improve the results.</p><p>The KNN-Und can be considered a very simple algorithm, and has the advantage to be a deterministic method, since different of other methods, there is no random component. In the literature review, only one study [<xref ref-type="bibr" rid="scirp.60996-ref29">29</xref>] has mentioned a similar methodology and application so far, but the results published previously demonstrated this alternative was not very well exploited at that time.</p></sec><sec id="s4"><title>4. Experiments</title><p>In this section, the experiments to validate the applicability of KNN-Und are conducted. The <xref ref-type="table" rid="table2">Table 2</xref> depicts the 33 datasets prepared for the experiments, ordered by imbalance rate (IR) [<xref ref-type="bibr" rid="scirp.60996-ref30">30</xref>] , which is the rate between the quantity of negative and positive instances. All the 33datasets are originated from UCI machine learning repository [<xref ref-type="bibr" rid="scirp.60996-ref31">31</xref>] . For the data sets with more than two classes, one or more classes with fewer examples were selected as the positive class, and collapsed the remainder as the negative class. The datasets were balanced with KNN-Und and submitted to a classifier, then the evaluation measures were compared with three methods based on KNN: SMOTE [<xref ref-type="bibr" rid="scirp.60996-ref7">7</xref>] , ENN [<xref ref-type="bibr" rid="scirp.60996-ref8">8</xref>] , NCL [<xref ref-type="bibr" rid="scirp.60996-ref9">9</xref>] , and the random undersampling method. The performance of KNN- Und was also compared with the published results of two other studies [<xref ref-type="bibr" rid="scirp.60996-ref10">10</xref>] [<xref ref-type="bibr" rid="scirp.60996-ref11">11</xref>] . All the algorithms tested in this work were implemented in Weka platform [<xref ref-type="bibr" rid="scirp.60996-ref28">28</xref>] .</p><p>In all datasets and algorithms that uses KNN, the parameter k was determined according to (1). The parameter t was adjusted in order to control the undersampling level with KNN-Und method, and in most of datasets with IR &lt; 2, this parameter was set to t &gt; 1 to control the excessive undersampling. The <xref ref-type="table" rid="table2">Table 2</xref> shows the values of k and t for each dataset, and the respective under sampling effect of KNN-Und on the negative (majority) class and resulting IR. The parameters for SMOTE and Random Undersampling methods were adjusted to obtain a balanced class distribution. The ENN and NCL methods do not have a balancing control parameter.</p><p>The classification results in terms of AUC and G-Mean are presented in <xref ref-type="table" rid="table3">Table 3</xref> and <xref ref-type="table" rid="table4">Table 4</xref>, respectively.</p><p>Those tables show the results averaged over 10 runs and the standard deviation between parentheses for the 33 datasets.</p><p>The classification results in terms of AUC with KNN-Und data preparation were compared with our previous work [<xref ref-type="bibr" rid="scirp.60996-ref10">10</xref>] . This paper proposed a Genetic Algorithm (GA) as an oversampling method, with the aim to evolve sub regions filled with synthetic instances to adjust imbalanced datasets. To reproduce here the same experiments conditions as before, the classifier used is the C4.5 decision tree algorithm [<xref ref-type="bibr" rid="scirp.60996-ref32">32</xref>] , with 25% of pruning and 10-fold cross validation. The Euclidean distance was used for numeric attributes, and the superposition distance for nominal attributes [<xref ref-type="bibr" rid="scirp.60996-ref17">17</xref>] [<xref ref-type="bibr" rid="scirp.60996-ref18">18</xref>] .</p><p>The same experiment setup was applied forC4.5 classifier without balancing (as a baseline comparison) and for the others balancing methods: SMOTE, ENN, NCL and Random Undersampling. The last columns of <xref ref-type="table" rid="table3">Table 3</xref> and <xref ref-type="table" rid="table4">Table 4</xref> presents the results of the proposed balancing method, KNN-Und, with C4.5 and 1-NN classifier.</p>
<table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Datasets ordered by IR before balancing with KNN-Und and the respective effect after KNN-Und</title></caption></table-wrap>

<p>This last classifier was included to make a comparison with the evolutionary algorithm EBUS-MS-GM developed in [11] .</p>
<p>The best result for each dataset is marked in bold.</p>
<p>Analyzing the AUC results in Table 3, it can be observed the KNN-Und with C4.5 algorithm outperformed in 19 of 33 datasets and had one dataset with equal result, if compared with the results of four different sampling methods. If compared with our previous results published in [10] , the KNN-Und outperformed in 11 of 15 datasets.</p>
<p>Figure 2 illustrates the results of AUC using the balancing methods with C4.5 classifier for the 33 datasets. It shows the AUC values of KNN-Und (in green) at the top, or nearby, in all datasets.</p>

<p>The results in terms of G-Mean (Table 4) shows that the KNN-Und outperformed in 20 of 33 datasets, and one dataset with equal result. Different of GA, SMOTE and Random Undersampling methods, the KNN-Und, C4.5, ENN, and NCL have a deterministic behavior, which lead to most stable results with standard deviations equals to 0. The second best results were obtained with NCL algorithm, but an excessive undersampling was observed in datasets with IR &lt; 2, which led to G-Mean values of 0.</p>
<p>Table 5 summarizes the count of the best results of the balancing methods with C4.5 classifier in terms of AUC and G-Mean. The KNN-Und has the highest scores.</p>
<p>These results can be explained by the fact that KNN-Und acts removing instances from the majority classes and at the same time cleaning the decision surface, reducing the class overlapping. Figure 3 and Figure 4 show the scatter plot of datasets EcoliIMU, and Satimage4, before and after the balancing methods. The points in blue belong to the majority class, the points in red to the minority class. These plots show the behavior of the methods, as described previously. The SMOTE algorithm performs a homogeneous distribution of synthetic instances around each positive instance. The ENN removes negative instances around the positive instances, and KNN-Und performs a more aggressive removal of negative instances in the decision surface region.</p>
<p>G-Mean and AUC values were not published by dataset for the evolutionary algorithm EUB-MS-GM in [11] , so another comparison was done with the available results, that is, the average and standard deviation of G- Mean and AUC for the 28 evaluated datasets. Table 6 compares the average results of KNN-Und and EUB-MS- GM methods. The KNN-Und results are at least 13 points higher than the EUB-MS-GM. It is not reasonable to do a comparison of standard deviations here, as the 28 datasets have independent results in both cases. One explanation for the obtained high values would be the 1-NN classifier used for comparison, that uses a decision boundary similar to KNN-Und, but the results for KNN-Und with C4.5 decision tree also had higher average values, showing that KNN-Und can also improve the classification results with other algorithms.</p>
<fig id="fig2"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption>
<title>Comparison of classification with AUC after the balancing methods.</title></caption>
<graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x58.png"/></fig> 
<fig id="fig3"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption>
<title>Scatter plot of EcoliIMU dataset before and after balancing methods. For x = aac (score of Amino acid content), y = alm1 (score of the ALOM membrane).</title></caption>
<graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x59.png"/><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x60.png"/>No Balancing SMOTE<graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x61.png"/><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x62.png"/>ENN KNN-Und</fig> 
<fig id="fig4"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption>
<title>Scatter plot of Satimage4 dataset before and after balancing. For x = pixel column 6, y = pixel column 31.</title></caption>
<graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x63.png"/><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x64.png"/>No Balancing SMOTE<graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x65.png"/><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601307x66.png"/>ENN KNN-Und</fig> 


</sec>
<sec id="s5"><title>5. Conclusions</title>
<p>This work presented a proposal of an algorithm, KNN-Und, to adjust datasets with imbalanced number of instances among the classes, also known as imbalanced datasets. The proposed method is an undersampling method, and is based on KNN algorithm, removing instances from the majority class based on the count of neighbors of different classes. The classification experiments conducted with the KNN Undersampling method on 33 datasets outperformed the results of other six methods, two studies based in evolutionary algorithms and the SMOTE, ENN, NCL and Random Undersampling methods.</p>
<p>The good results obtained with KNN Undersampling can be explained by the fact that KNN-Und acts removing instances from the majority classes, reducing this way the “needle in a haystack” effect, at the same time, cleaning the decision surface, reducing the class overlapping and removing noisy examples. These results indicates that the simplicity of KNN can be used as a base for constructing efficient algorithms in machine learning and knowledge discovery. They also show that the selective removal of instances from the majority class is an interesting way to be followed rather than to generate instances to balance datasets. This issue is important nowadays as the datasets are approaching the size of petabytes with big data, and retaining only the representative data can be better than creating more data.</p></sec></body>

<back><ref-list><title>References</title><ref id="scirp.60996-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Weiss, G.M. and Provost, F. (2001) The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report MLTR-43, Department of Computer Science, Rutgers University, New Brunswick, NJ, USA.</mixed-citation></ref><ref id="scirp.60996-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">He, H. and Ma, Y. (2013) Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, Hoboken, NJ, USA. http://dx.doi.org/10.1002/9781118646106</mixed-citation></ref><ref id="scirp.60996-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Japkowicz, N. (2003) Class Imbalances. Are We Focusing on the Right Issue? Proceedings of the ICML’2003, Workshop on Learning from Imbalanced Data Sets II, Washington DC.</mixed-citation></ref><ref id="scirp.60996-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Qiong, G., Cai, Z., Zhu, L. and Huang, B. (2008) Data Mining on Imbalanced Data Sets. International Conference on Advanced Computer Theory and Engineering, Phuket, 20-22 December, 1020-1024.</mixed-citation></ref><ref id="scirp.60996-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Barandela, R., Sánchez, J.S., García, V. and Rangel, E. (2003) Strategies for Learning in Class Imbalance Problems. Pattern Recognition, 36, 849-851. http://dx.doi.org/10.1016/S0031-3203(02)00257-1</mixed-citation></ref><ref id="scirp.60996-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Fawcett, T. (2004) ROC Graphs: Notes and Practical Considerations for Researchers, HP Laboratories.</mixed-citation></ref><ref id="scirp.60996-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.</mixed-citation></ref><ref id="scirp.60996-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Wilson, D.L. (1972) Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Communications, 2, 408-421. http://dx.doi.org/10.1109/TSMC.1972.4309137</mixed-citation></ref><ref id="scirp.60996-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Laurikkala, J. (2001) Improving Identification of Difficult Small Classes by Balancing Class Distribution. Technical Report A-2001-2, University of Tampere, Tampere, Finland.</mixed-citation></ref><ref id="scirp.60996-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Beckmann, M., De Lima, B.S.L.P. and Ebecken, N.F.F. (2011) Genetic Algorithms as a Pre-Processing Strategy for Imbalanced Datasets. Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, Dublin, 12-16 July 2011, 131-132.</mixed-citation></ref><ref id="scirp.60996-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">García, S. and Herrera, F. (2009) Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation, 17, 275-396. http://dx.doi.org/10.1162/evco.2009.17.3.275</mixed-citation></ref><ref id="scirp.60996-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Hilbert, M. and López, P. (2011) The World’s Technological Capacity to Store, Communicate, and Compute Information. Science, 332, 60-65. http://dx.doi.org/10.1126/science.1200970</mixed-citation></ref><ref id="scirp.60996-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Wu, X.D., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J. and Steinberg, D. (2007) Top 10 Algorithms in Data Mining. Knowledge Information Systems, 14, 1-37. http://dx.doi.org/10.1007/s10115-007-0114-2</mixed-citation></ref><ref id="scirp.60996-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Fix, E. and Hodges, J.L. (1951) Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field.</mixed-citation></ref><ref id="scirp.60996-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Dasarathy, B.V. (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos.</mixed-citation></ref><ref id="scirp.60996-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Duda, R.O., Hart, P.E. and Stork, D.G. (2001) Pattern Classification. 2nd Edition, John Wiley &amp; Sons Ltd., New York, 202-220.</mixed-citation></ref><ref id="scirp.60996-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Boriah, S., Chandola, V. and Kumar, V. (2007) Similarity Measures for Categorical Data: A Comparative Evaluation. Proceedings of the SIAM International Conference on Data Mining, Minneapolis, 26-28 April 2007, 243-254.</mixed-citation></ref><ref id="scirp.60996-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Wilson, D.R. and Martinez, T.R. (1997) Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research, 6, 1-34.</mixed-citation></ref><ref id="scirp.60996-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Chawla, N.V., Lazarevic, A., Hall, L.O. and Bowyer, K.W. (2003) SMOTEBoost: Improving Prediction of the Minority Class in Boosting. Proceeding of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, 22-26 September 2003, 107-119. http://dx.doi.org/10.1007/978-3-540-39804-2_12</mixed-citation></ref><ref id="scirp.60996-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Chen, L., Cai, Z., Chen, L. and Gu, Q. (2010) A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets. 3rd International Conference on Knowledge Discovery and Data Mining, Phuket, 9-10 January 2010, 81-85.</mixed-citation></ref><ref id="scirp.60996-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Han, H., Wang, W.Y. and Mao, B.H. (2005) Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Proceedings of International Conference on Intelligent Computing, Hefei, 23-26 August 2005, 878-887. http://dx.doi.org/10.1007/11538059_91</mixed-citation></ref><ref id="scirp.60996-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">He, H., Bai, Y. and Garcia, E.A. (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of International Joint Conference on Neural Networks, Hong Kong, 1-8 June 2008, 1322-1328.</mixed-citation></ref><ref id="scirp.60996-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Batista, G.E.A.P.A., Prati, R.C. and Monard, M.C. (2004) A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 6, 20-29. http://dx.doi.org/10.1145/1007730.1007735</mixed-citation></ref><ref id="scirp.60996-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">Tomek, I. (1976) Two Modifications of CNN. IEEE Transactions on Systems Man and Communications, 6, 769-772. http://dx.doi.org/10.1109/TSMC.1976.4309452</mixed-citation></ref><ref id="scirp.60996-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Wang, B.X. and Japkowicz, N. (2004) Imbalanced Data Set Learning with Synthetic Samples. Proceedings of IRIS Machine Learning Workshop, Ottawa, 9 June 2004.</mixed-citation></ref><ref id="scirp.60996-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Wilson, D.R. and Martinez, T.R. (2000) Reduction Techniques for Instance-Based. Machine Learning, 38, 257-286. http://dx.doi.org/10.1023/A:1007626913721</mixed-citation></ref><ref id="scirp.60996-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Van Rijsbergen, C.J. (1979) Information Retrieval. 2nd Edition, Butterworths, Waltham.</mixed-citation></ref><ref id="scirp.60996-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Ian, H.W. and Frank, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.</mixed-citation></ref><ref id="scirp.60996-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, J.P. and Mani, I. (2003) KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets, Washington DC, 21 August 2003.</mixed-citation></ref><ref id="scirp.60996-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">Orriols-Puig, A. and Bernadó-Mansilla, E. (2009) Evolutionary Rule-Based Systems for Imbalanced Datasets. Soft Computing, 13, 213-225. http://dx.doi.org/10.1007/s00500-008-0319-7</mixed-citation></ref><ref id="scirp.60996-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Blake, C. and Merz, C. (1998) UCI Repository of Machine Learning Databases. Department of Information and Computer Sciences, University of California, Oakland. http://www.ics.uci.edu/~mlearn/~MLRepository.html</mixed-citation></ref><ref id="scirp.60996-ref32"><label>32</label><mixed-citation publication-type="book" xlink:type="simple">Kohavi, R. and Quinlan, J.R. (2002) Decision Tree Discovery. In: Klosgen, W. and Zytkow, J.M., Eds., Handbook of Data Mining and Knowledge Discovery, Oxford University Press, New York, 267-276</mixed-citation></ref></ref-list></back></article>