Learning the costs for a string edit distance-based similarity measure for abbreviated language

We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manual...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autor principal:	Alonso i Alemany, Laura
Formato:	Objeto de conferencia
Lenguaje:	Inglés
Publicado:	2010
Materias:	Ciencias Informáticas Natural Language Processing String Edit Distances
Acceso en línea:	http://sedici.unlp.edu.ar/handle/10915/152590 http://39jaiio.sadio.org.ar/sites/default/files/39jaiio-asai-07.pdf
Aporte de:	SEDICI (UNLP) de Universidad Nacional de La Plata

id	I19-R120-10915-152590
record_format	dspace
spelling	I19-R120-10915-1525902023-05-08T20:03:59Z http://sedici.unlp.edu.ar/handle/10915/152590 http://39jaiio.sadio.org.ar/sites/default/files/39jaiio-asai-07.pdf issn:1850-2784 Learning the costs for a string edit distance-based similarity measure for abbreviated language Alonso i Alemany, Laura 2010 2010 2023-05-08T17:53:41Z en Ciencias Informáticas Natural Language Processing String Edit Distances We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an ortographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an ortographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance. Sociedad Argentina de Informática e Investigación Operativa Objeto de conferencia Objeto de conferencia http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) application/pdf 72-81
institution	Universidad Nacional de La Plata
institution_str	I-19
repository_str	R-120
collection	SEDICI (UNLP)
language	Inglés
topic	Ciencias Informáticas Natural Language Processing String Edit Distances
spellingShingle	Ciencias Informáticas Natural Language Processing String Edit Distances Alonso i Alemany, Laura Learning the costs for a string edit distance-based similarity measure for abbreviated language
topic_facet	Ciencias Informáticas Natural Language Processing String Edit Distances
description	We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an ortographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an ortographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance.
format	Objeto de conferencia Objeto de conferencia
author	Alonso i Alemany, Laura
author_facet	Alonso i Alemany, Laura
author_sort	Alonso i Alemany, Laura
title	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_short	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_full	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_fullStr	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_full_unstemmed	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_sort	learning the costs for a string edit distance-based similarity measure for abbreviated language
publishDate	2010
url	http://sedici.unlp.edu.ar/handle/10915/152590 http://39jaiio.sadio.org.ar/sites/default/files/39jaiio-asai-07.pdf
work_keys_str_mv	AT alonsoialemanylaura learningthecostsforastringeditdistancebasedsimilaritymeasureforabbreviatedlanguage
_version_	1765660134017597440

Learning the costs for a string edit distance-based similarity measure for abbreviated language

Ejemplares similares