SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutio...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Basgall, María José, Hasperué, Waldo, Naiouf, Marcelo, Fernández, Alberto, Herrera, Francisco
Formato: Articulo
Lenguaje:Inglés
Publicado: 2018
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/71652
http://journal.info.unlp.edu.ar/JCST/article/view/1122
Aporte de:
id I19-R120-10915-71652
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
big data
imbalanced classification
SMOTE
pre-processing
Spark
clasificación no balanceada
grandes datos
preprocesamiento
spellingShingle Ciencias Informáticas
big data
imbalanced classification
SMOTE
pre-processing
Spark
clasificación no balanceada
grandes datos
preprocesamiento
Basgall, María José
Hasperué, Waldo
Naiouf, Marcelo
Fernández, Alberto
Herrera, Francisco
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
topic_facet Ciencias Informáticas
big data
imbalanced classification
SMOTE
pre-processing
Spark
clasificación no balanceada
grandes datos
preprocesamiento
description The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, a fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
format Articulo
Articulo
author Basgall, María José
Hasperué, Waldo
Naiouf, Marcelo
Fernández, Alberto
Herrera, Francisco
author_facet Basgall, María José
Hasperué, Waldo
Naiouf, Marcelo
Fernández, Alberto
Herrera, Francisco
author_sort Basgall, María José
title SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_short SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_full SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_fullStr SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_full_unstemmed SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_sort smote-bd: an exact and scalable oversampling method for imbalanced classification in big data
publishDate 2018
url http://sedici.unlp.edu.ar/handle/10915/71652
http://journal.info.unlp.edu.ar/JCST/article/view/1122
work_keys_str_mv AT basgallmariajose smotebdanexactandscalableoversamplingmethodforimbalancedclassificationinbigdata
AT hasperuewaldo smotebdanexactandscalableoversamplingmethodforimbalancedclassificationinbigdata
AT naioufmarcelo smotebdanexactandscalableoversamplingmethodforimbalancedclassificationinbigdata
AT fernandezalberto smotebdanexactandscalableoversamplingmethodforimbalancedclassificationinbigdata
AT herrerafrancisco smotebdanexactandscalableoversamplingmethodforimbalancedclassificationinbigdata
AT basgallmariajose smotebdunmetododesobremuestreoexactoyescalableparalaclasificacionnobalanceadaenbigdata
AT hasperuewaldo smotebdunmetododesobremuestreoexactoyescalableparalaclasificacionnobalanceadaenbigdata
AT naioufmarcelo smotebdunmetododesobremuestreoexactoyescalableparalaclasificacionnobalanceadaenbigdata
AT fernandezalberto smotebdunmetododesobremuestreoexactoyescalableparalaclasificacionnobalanceadaenbigdata
AT herrerafrancisco smotebdunmetododesobremuestreoexactoyescalableparalaclasificacionnobalanceadaenbigdata
bdutipo_str Repositorios
_version_ 1764820482679373826