FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to g...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Basgall, María, Naiouf, Marcelo, Fernández, Alberto
Formato: Articulo
Lenguaje:Inglés
Publicado: 2021
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/125448
https://www.mdpi.com/2079-9292/10/15/1757
Aporte de:
id I19-R120-10915-125448
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
spellingShingle Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
Basgall, María
Naiouf, Marcelo
Fernández, Alberto
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
topic_facet Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
description In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
format Articulo
Articulo
author Basgall, María
Naiouf, Marcelo
Fernández, Alberto
author_facet Basgall, María
Naiouf, Marcelo
Fernández, Alberto
author_sort Basgall, María
title FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_short FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_fullStr FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full_unstemmed FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_sort fdr²-bd: a fast data reduction recommendation tool for tabular big data classification problems
publishDate 2021
url http://sedici.unlp.edu.ar/handle/10915/125448
https://www.mdpi.com/2079-9292/10/15/1757
work_keys_str_mv AT basgallmaria fdr2bdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems
AT naioufmarcelo fdr2bdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems
AT fernandezalberto fdr2bdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems
bdutipo_str Repositorios
_version_ 1764820451687661569