Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish

Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Tessore, Juan Pablo, Esnaola, Leonardo, Lanzarini, Laura Cristina, Baldassarri, Sandra
Formato: Articulo
Lenguaje:Inglés
Publicado: 2021
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/138899
Aporte de:
id I19-R120-10915-138899
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Informática
Sentiment analysis
Dataset construction
Dataset validation
Facebook
Text mining
spellingShingle Informática
Sentiment analysis
Dataset construction
Dataset validation
Facebook
Text mining
Tessore, Juan Pablo
Esnaola, Leonardo
Lanzarini, Laura Cristina
Baldassarri, Sandra
Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
topic_facet Informática
Sentiment analysis
Dataset construction
Dataset validation
Facebook
Text mining
description Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.
format Articulo
Articulo
author Tessore, Juan Pablo
Esnaola, Leonardo
Lanzarini, Laura Cristina
Baldassarri, Sandra
author_facet Tessore, Juan Pablo
Esnaola, Leonardo
Lanzarini, Laura Cristina
Baldassarri, Sandra
author_sort Tessore, Juan Pablo
title Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_short Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_full Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_fullStr Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_full_unstemmed Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_sort distant supervised construction and evaluation of a novel dataset of emotion-tagged social media comments in spanish
publishDate 2021
url http://sedici.unlp.edu.ar/handle/10915/138899
work_keys_str_mv AT tessorejuanpablo distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish
AT esnaolaleonardo distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish
AT lanzarinilauracristina distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish
AT baldassarrisandra distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish
bdutipo_str Repositorios
_version_ 1764820458097606656