Title
Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification
Abstract
The \"big data\" term has caught the attention of experts in the context of learning from data. This term is used to describe the exponential growth and availability of data (structured and unstructured). The design of effective models that can process and extract useful knowledge from these data represents a immense challenge. Focusing on classification problems, many real-world applications present a class distribution where one or more classes are represented by a large number of examples with respect to the negligible number of examples of other classes, which are precisely those of primary interest. This circumstance is known as the problem of classification with imbalanced datasets. In this work, we analyze a hypothesis in order to increment the accuracy of the underrepresented class when dealing with extremely imbalanced big data problems under the MapReduce framework. The performance of our solution has been analyzed in an experimental study that is carried out over the extremely imbalanced big data problem that was used in the ECBDL'14 Big Data Competition. The results obtained show that is necessary to find a balance between the classes in order to obtain the highest precision.
Year
DOI
Venue
2015
10.1109/Trustcom-BigDataSe-ISPA.2015.579
TrustCom/BigDataSE/ISPA
Keywords
DocType
Volume
Big data,Hadoop,MapReduce,Imbalance classification,Preprocessing
Conference
2
ISSN
Citations 
PageRank 
2324-9013
6
0.48
References 
Authors
13
3
Name
Order
Citations
PageRank
S. del Río12438.92
José Manuel Benítez288856.02
Francisco Herrera3273911168.49