A web platform for collaborative semi-automatic OCR post-processing

Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This wo...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Mechaca C., Ana L., Marmanillo, Walter G., Xamena, Eduardo, Ramirez-Orta, Juan, Maguitman, Ana Gabriela, Milios, Evangelos E.
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2021
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/140119
http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdf
Aporte de:
id I19-R120-10915-140119
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models
spellingShingle Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models
Mechaca C., Ana L.
Marmanillo, Walter G.
Xamena, Eduardo
Ramirez-Orta, Juan
Maguitman, Ana Gabriela
Milios, Evangelos E.
A web platform for collaborative semi-automatic OCR post-processing
topic_facet Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models
description Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.
format Objeto de conferencia
Objeto de conferencia
author Mechaca C., Ana L.
Marmanillo, Walter G.
Xamena, Eduardo
Ramirez-Orta, Juan
Maguitman, Ana Gabriela
Milios, Evangelos E.
author_facet Mechaca C., Ana L.
Marmanillo, Walter G.
Xamena, Eduardo
Ramirez-Orta, Juan
Maguitman, Ana Gabriela
Milios, Evangelos E.
author_sort Mechaca C., Ana L.
title A web platform for collaborative semi-automatic OCR post-processing
title_short A web platform for collaborative semi-automatic OCR post-processing
title_full A web platform for collaborative semi-automatic OCR post-processing
title_fullStr A web platform for collaborative semi-automatic OCR post-processing
title_full_unstemmed A web platform for collaborative semi-automatic OCR post-processing
title_sort web platform for collaborative semi-automatic ocr post-processing
publishDate 2021
url http://sedici.unlp.edu.ar/handle/10915/140119
http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdf
work_keys_str_mv AT mechacacanal awebplatformforcollaborativesemiautomaticocrpostprocessing
AT marmanillowalterg awebplatformforcollaborativesemiautomaticocrpostprocessing
AT xamenaeduardo awebplatformforcollaborativesemiautomaticocrpostprocessing
AT ramirezortajuan awebplatformforcollaborativesemiautomaticocrpostprocessing
AT maguitmananagabriela awebplatformforcollaborativesemiautomaticocrpostprocessing
AT miliosevangelose awebplatformforcollaborativesemiautomaticocrpostprocessing
AT mechacacanal webplatformforcollaborativesemiautomaticocrpostprocessing
AT marmanillowalterg webplatformforcollaborativesemiautomaticocrpostprocessing
AT xamenaeduardo webplatformforcollaborativesemiautomaticocrpostprocessing
AT ramirezortajuan webplatformforcollaborativesemiautomaticocrpostprocessing
AT maguitmananagabriela webplatformforcollaborativesemiautomaticocrpostprocessing
AT miliosevangelose webplatformforcollaborativesemiautomaticocrpostprocessing
bdutipo_str Repositorios
_version_ 1764820458279010304