Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets

One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Fernández, Juan Manuel, Errecalde, Marcelo Luis
Formato:	Objeto de conferencia
Lenguaje:	Inglés
Publicado:	2022
Materias:	Ciencias Informáticas imbalanced data automatic classification information retrieval
Acceso en línea:	http://sedici.unlp.edu.ar/handle/10915/149456
Aporte de:	SEDICI (UNLP) de Universidad Nacional de La Plata

id	I19-R120-10915-149456
record_format	dspace
institution	Universidad Nacional de La Plata
institution_str	I-19
repository_str	R-120
collection	SEDICI (UNLP)
language	Inglés
topic	Ciencias Informáticas imbalanced data automatic classification information retrieval
spellingShingle	Ciencias Informáticas imbalanced data automatic classification information retrieval Fernández, Juan Manuel Errecalde, Marcelo Luis Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
topic_facet	Ciencias Informáticas imbalanced data automatic classification information retrieval
description	One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
format	Objeto de conferencia Objeto de conferencia
author	Fernández, Juan Manuel Errecalde, Marcelo Luis
author_facet	Fernández, Juan Manuel Errecalde, Marcelo Luis
author_sort	Fernández, Juan Manuel
title	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_short	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_fullStr	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full_unstemmed	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_sort	instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
publishDate	2022
url	http://sedici.unlp.edu.ar/handle/10915/149456
work_keys_str_mv	AT fernandezjuanmanuel instanceretrievalfromnonlabeleddataasastrategyforautomaticclassifcationofimbalancedemaildatasets AT errecaldemarceloluis instanceretrievalfromnonlabeleddataasastrategyforautomaticclassifcationofimbalancedemaildatasets
bdutipo_str	Repositorios
_version_	1764820462820392960

Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets

Ejemplares similares