Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets

One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Fernández, Juan Manuel, Errecalde, Marcelo Luis
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2022
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/149456
Aporte de:
id I19-R120-10915-149456
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
spellingShingle Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
Fernández, Juan Manuel
Errecalde, Marcelo Luis
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
topic_facet Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
description One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
format Objeto de conferencia
Objeto de conferencia
author Fernández, Juan Manuel
Errecalde, Marcelo Luis
author_facet Fernández, Juan Manuel
Errecalde, Marcelo Luis
author_sort Fernández, Juan Manuel
title Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_short Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_fullStr Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full_unstemmed Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_sort instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
publishDate 2022
url http://sedici.unlp.edu.ar/handle/10915/149456
work_keys_str_mv AT fernandezjuanmanuel instanceretrievalfromnonlabeleddataasastrategyforautomaticclassifcationofimbalancedemaildatasets
AT errecaldemarceloluis instanceretrievalfromnonlabeleddataasastrategyforautomaticclassifcationofimbalancedemaildatasets
bdutipo_str Repositorios
_version_ 1764820462820392960