Language modeling tools for massive historical OCR post-processing
Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processi...
Guardado en:
| Autores principales: | , |
|---|---|
| Formato: | Objeto de conferencia |
| Lenguaje: | Inglés |
| Publicado: |
2020
|
| Materias: | |
| Acceso en línea: | http://sedici.unlp.edu.ar/handle/10915/116420 http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdf |
| Aporte de: |
| id |
I19-R120-10915-116420 |
|---|---|
| record_format |
dspace |
| institution |
Universidad Nacional de La Plata |
| institution_str |
I-19 |
| repository_str |
R-120 |
| collection |
SEDICI (UNLP) |
| language |
Inglés |
| topic |
Ciencias Informáticas OCR post-processing Neural language models Information retrieval. |
| spellingShingle |
Ciencias Informáticas OCR post-processing Neural language models Information retrieval. Xamena, Eduardo Maguitman, Ana Gabriela Language modeling tools for massive historical OCR post-processing |
| topic_facet |
Ciencias Informáticas OCR post-processing Neural language models Information retrieval. |
| description |
Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms. |
| format |
Objeto de conferencia Objeto de conferencia |
| author |
Xamena, Eduardo Maguitman, Ana Gabriela |
| author_facet |
Xamena, Eduardo Maguitman, Ana Gabriela |
| author_sort |
Xamena, Eduardo |
| title |
Language modeling tools for massive historical OCR post-processing |
| title_short |
Language modeling tools for massive historical OCR post-processing |
| title_full |
Language modeling tools for massive historical OCR post-processing |
| title_fullStr |
Language modeling tools for massive historical OCR post-processing |
| title_full_unstemmed |
Language modeling tools for massive historical OCR post-processing |
| title_sort |
language modeling tools for massive historical ocr post-processing |
| publishDate |
2020 |
| url |
http://sedici.unlp.edu.ar/handle/10915/116420 http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdf |
| work_keys_str_mv |
AT xamenaeduardo languagemodelingtoolsformassivehistoricalocrpostprocessing AT maguitmananagabriela languagemodelingtoolsformassivehistoricalocrpostprocessing |
| bdutipo_str |
Repositorios |
| _version_ |
1764820446994235396 |