Exploiting user-frequency information for mining regionalisms in Argentinian Spanish from Twitter

The task of detecting regionalisms (expressions or words used in certain regions) has traditionally relied on the use of questionnaires and surveys, heavily depending on the expertise and intuition of the surveyor. The emergence of social media and microblogging services has produced an unprecede...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Gravano, Agustín, Pérez, Juan Manuel, Aleman, Damian E., Kalinowski, Santiago N.
Formato: Artículo publishedVersion
Lenguaje:Español
Publicado: Procesamiento del Lenguaje Natural, Revista 2022
Materias:
Acceso en línea:https://repositorio.utdt.edu/handle/20.500.13098/11442
http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6427
Aporte de:
id I57-R163-20.500.13098-11442
record_format dspace
institution Universidad Torcuato Di Tella
institution_str I-57
repository_str R-163
collection Repositorio Digital Universidad Torcuato Di Tella
language Español
orig_language_str_mv spa
topic Lexical dialectology
Social media
Spanish variants
Entropy
spellingShingle Lexical dialectology
Social media
Spanish variants
Entropy
Gravano, Agustín
Pérez, Juan Manuel
Aleman, Damian E.
Kalinowski, Santiago N.
Exploiting user-frequency information for mining regionalisms in Argentinian Spanish from Twitter
topic_facet Lexical dialectology
Social media
Spanish variants
Entropy
description The task of detecting regionalisms (expressions or words used in certain regions) has traditionally relied on the use of questionnaires and surveys, heavily depending on the expertise and intuition of the surveyor. The emergence of social media and microblogging services has produced an unprecedented wealth of content (mainly informal text generated by users), opening new opportunities for linguists to extend their studies of language variation. Previous work on the automatic detection of regionalisms depended mostly on word frequencies. In this work, we present a novel metric based on Information Theory that incorporates user frequency. We tested this metric on a corpus of Argentinian Spanish tweets in two ways: via manual annotation of the relevance of the retrieved terms, and also as a feature selection method for geolocation of users. In either case, our metric outperformed other techniques based on word frequency, suggesting that measuring the amount of users that use a word is an informative feature. This tool has helped lexicographers discover several unregistered words of Argentinian Spanish, as well as di erent meanings assigned to registered words.
format Artículo
publishedVersion
author Gravano, Agustín
Pérez, Juan Manuel
Aleman, Damian E.
Kalinowski, Santiago N.
author_facet Gravano, Agustín
Pérez, Juan Manuel
Aleman, Damian E.
Kalinowski, Santiago N.
author_sort Gravano, Agustín
title Exploiting user-frequency information for mining regionalisms in Argentinian Spanish from Twitter
title_short Exploiting user-frequency information for mining regionalisms in Argentinian Spanish from Twitter
title_full Exploiting user-frequency information for mining regionalisms in Argentinian Spanish from Twitter
title_fullStr Exploiting user-frequency information for mining regionalisms in Argentinian Spanish from Twitter
title_full_unstemmed Exploiting user-frequency information for mining regionalisms in Argentinian Spanish from Twitter
title_sort exploiting user-frequency information for mining regionalisms in argentinian spanish from twitter
publisher Procesamiento del Lenguaje Natural, Revista
publishDate 2022
url https://repositorio.utdt.edu/handle/20.500.13098/11442
http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6427
work_keys_str_mv AT gravanoagustin exploitinguserfrequencyinformationforminingregionalismsinargentinianspanishfromtwitter
AT perezjuanmanuel exploitinguserfrequencyinformationforminingregionalismsinargentinianspanishfromtwitter
AT alemandamiane exploitinguserfrequencyinformationforminingregionalismsinargentinianspanishfromtwitter
AT kalinowskisantiagon exploitinguserfrequencyinformationforminingregionalismsinargentinianspanishfromtwitter
bdutipo_str Repositorios
_version_ 1764820542380048386