Is an average OCR quality of 70% enough for my study? What OCR quality should we ask from external suppliers? Should we re-do the OCR of our collections to bring it from 80% to 85%? Libraries and researchers alike face the same dilemma in our times of textual abundance: when is OCR quality good enough? User access, scientific results and the investment of limited resources increasingly depend on answering this question.
CREATE’s member Giovanni Colavizza did a research residency at the National Library of the Netherlands (KB) and worked with the Library’s DH team to approach this research question. During the project, Giovanni conducted a comprehensive assessment of the impact of OCR quality in Dutch newspaper, journal and book collections, via extrinsic evaluation: assessing results from a set of representative downstream tasks, such as text classification or clustering.
All results are very encouraging: it was found that topic modelling and document classification can work well on OCRed texts of varying quality. While more work is needed, including assessing with different datasets and methods, these results suggest that the quality of existing OCR might be sufficient to successfully perform a variety of machine learning tasks. All the results, a link to the final project webinar, as well as code and data are provided below.
The project was conducted in collaboration with Mirjam Cuper (KB) and Konstantin Todorov (UvA, CREATE).
External references:
- Blogpost 1 (Introduction): https://lab.kb.nl/about-us/blog/your-ocr-good-enough-comprehensive-assessment-impact-ocr-quality-downstream-tasks
- Blogpost 2 (Results): https://lab.kb.nl/about-us/blog/your-ocr-good-enough-probably-so-results-assessment-impact-ocr-quality-downstream
- Final webinar: https://www.youtube.com/watch?v=i2YVRK-o4SM
- Code and data: https://github.com/Giovanni1085/KB_OCR_impact