A web platform for collaborative semi-automatic OCR post-processing
Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This wo...
Guardado en:
Autores principales: | , , , , , |
---|---|
Formato: | Objeto de conferencia |
Lenguaje: | Inglés |
Publicado: |
2021
|
Materias: | |
Acceso en línea: | http://sedici.unlp.edu.ar/handle/10915/140119 http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdf |
Aporte de: |
id |
I19-R120-10915-140119 |
---|---|
record_format |
dspace |
institution |
Universidad Nacional de La Plata |
institution_str |
I-19 |
repository_str |
R-120 |
collection |
SEDICI (UNLP) |
language |
Inglés |
topic |
Ciencias Informáticas OCR Post-processing Digital Humanities Language Models |
spellingShingle |
Ciencias Informáticas OCR Post-processing Digital Humanities Language Models Mechaca C., Ana L. Marmanillo, Walter G. Xamena, Eduardo Ramirez-Orta, Juan Maguitman, Ana Gabriela Milios, Evangelos E. A web platform for collaborative semi-automatic OCR post-processing |
topic_facet |
Ciencias Informáticas OCR Post-processing Digital Humanities Language Models |
description |
Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models. |
format |
Objeto de conferencia Objeto de conferencia |
author |
Mechaca C., Ana L. Marmanillo, Walter G. Xamena, Eduardo Ramirez-Orta, Juan Maguitman, Ana Gabriela Milios, Evangelos E. |
author_facet |
Mechaca C., Ana L. Marmanillo, Walter G. Xamena, Eduardo Ramirez-Orta, Juan Maguitman, Ana Gabriela Milios, Evangelos E. |
author_sort |
Mechaca C., Ana L. |
title |
A web platform for collaborative semi-automatic OCR post-processing |
title_short |
A web platform for collaborative semi-automatic OCR post-processing |
title_full |
A web platform for collaborative semi-automatic OCR post-processing |
title_fullStr |
A web platform for collaborative semi-automatic OCR post-processing |
title_full_unstemmed |
A web platform for collaborative semi-automatic OCR post-processing |
title_sort |
web platform for collaborative semi-automatic ocr post-processing |
publishDate |
2021 |
url |
http://sedici.unlp.edu.ar/handle/10915/140119 http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdf |
work_keys_str_mv |
AT mechacacanal awebplatformforcollaborativesemiautomaticocrpostprocessing AT marmanillowalterg awebplatformforcollaborativesemiautomaticocrpostprocessing AT xamenaeduardo awebplatformforcollaborativesemiautomaticocrpostprocessing AT ramirezortajuan awebplatformforcollaborativesemiautomaticocrpostprocessing AT maguitmananagabriela awebplatformforcollaborativesemiautomaticocrpostprocessing AT miliosevangelose awebplatformforcollaborativesemiautomaticocrpostprocessing AT mechacacanal webplatformforcollaborativesemiautomaticocrpostprocessing AT marmanillowalterg webplatformforcollaborativesemiautomaticocrpostprocessing AT xamenaeduardo webplatformforcollaborativesemiautomaticocrpostprocessing AT ramirezortajuan webplatformforcollaborativesemiautomaticocrpostprocessing AT maguitmananagabriela webplatformforcollaborativesemiautomaticocrpostprocessing AT miliosevangelose webplatformforcollaborativesemiautomaticocrpostprocessing |
bdutipo_str |
Repositorios |
_version_ |
1764820458279010304 |