FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to g...
Guardado en:
Autores principales: | , , |
---|---|
Formato: | Articulo |
Lenguaje: | Inglés |
Publicado: |
2021
|
Materias: | |
Acceso en línea: | http://sedici.unlp.edu.ar/handle/10915/125448 https://www.mdpi.com/2079-9292/10/15/1757 |
Aporte de: |
id |
I19-R120-10915-125448 |
---|---|
record_format |
dspace |
institution |
Universidad Nacional de La Plata |
institution_str |
I-19 |
repository_str |
R-120 |
collection |
SEDICI (UNLP) |
language |
Inglés |
topic |
Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark |
spellingShingle |
Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark Basgall, María Naiouf, Marcelo Fernández, Alberto FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
topic_facet |
Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark |
description |
In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline. |
format |
Articulo Articulo |
author |
Basgall, María Naiouf, Marcelo Fernández, Alberto |
author_facet |
Basgall, María Naiouf, Marcelo Fernández, Alberto |
author_sort |
Basgall, María |
title |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_short |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_full |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_fullStr |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_full_unstemmed |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_sort |
fdr²-bd: a fast data reduction recommendation tool for tabular big data classification problems |
publishDate |
2021 |
url |
http://sedici.unlp.edu.ar/handle/10915/125448 https://www.mdpi.com/2079-9292/10/15/1757 |
work_keys_str_mv |
AT basgallmaria fdr2bdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems AT naioufmarcelo fdr2bdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems AT fernandezalberto fdr2bdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems |
bdutipo_str |
Repositorios |
_version_ |
1764820451687661569 |