|Language Technology Programming Competition 2017|
2017 Shared Task Description: Correcting OCR Errors
Basic Task Description
Many digital documents are the result of scanning printed copies. These documents, although in digital form, are in fact images, and as such, standard natural language processing techniques such as text search cannot be applied to them.
The National Library of Australia maintains an archive of scanned Australian publications in the Trove database. Many of these scans have been processed through Optical Character Recognition (OCR) and form a searchable resource with over 500 million items. But the OCR output may contain errors which need to be corrected. Trove has corrected the errors through a process of collaborative editing of the output of the OCR system described in the Trove help centre.
The goal of this task is to automatically correct errors of OCR from a subset of scans from the Trove database. We have downloaded over 7,000 documents and obtained the original output of the OCR system, together with the corrected versions. Note that the corrected versions may still contain errors. We provide 6,000 documents and their corrected versions as the training set. Your goal is to apply automated techniques to correct the OCR errors of a separate test set.
We will use Kaggle in Class to evaluate the systems.
Data Files and SubmissionWe will use Kaggle in Class for this year's competition (look for the ALTA 2017 Challenge). The data files and submission instructions will be provided in the competition website.
In order to access the Kaggle in Class pages, you need to register with this shared task.