Language modeling tools for massive historical OCR post-processing

Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processi...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Xamena, Eduardo, Maguitman, Ana Gabriela
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2020
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/116420
http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdf
Aporte de:
id I19-R120-10915-116420
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval.
spellingShingle Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval.
Xamena, Eduardo
Maguitman, Ana Gabriela
Language modeling tools for massive historical OCR post-processing
topic_facet Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval.
description Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.
format Objeto de conferencia
Objeto de conferencia
author Xamena, Eduardo
Maguitman, Ana Gabriela
author_facet Xamena, Eduardo
Maguitman, Ana Gabriela
author_sort Xamena, Eduardo
title Language modeling tools for massive historical OCR post-processing
title_short Language modeling tools for massive historical OCR post-processing
title_full Language modeling tools for massive historical OCR post-processing
title_fullStr Language modeling tools for massive historical OCR post-processing
title_full_unstemmed Language modeling tools for massive historical OCR post-processing
title_sort language modeling tools for massive historical ocr post-processing
publishDate 2020
url http://sedici.unlp.edu.ar/handle/10915/116420
http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdf
work_keys_str_mv AT xamenaeduardo languagemodelingtoolsformassivehistoricalocrpostprocessing
AT maguitmananagabriela languagemodelingtoolsformassivehistoricalocrpostprocessing
bdutipo_str Repositorios
_version_ 1764820446994235396