ALTA 2017 Shared Task Description

Language Technology Programming Competition 2017

Home

2017 Shared Task Description: Correcting OCR Errors

Basic Task Description

Many digital documents are the result of scanning printed copies. These documents, although in digital form, are in fact images, and as such, standard natural language processing techniques such as text search cannot be applied to them.

The National Library of Australia maintains an archive of scanned Australian publications in the Trove database. Many of these scans have been processed through Optical Character Recognition (OCR) and form a searchable resource with over 500 million items. But the OCR output may contain errors which need to be corrected. Trove has corrected the errors through a process of collaborative editing of the output of the OCR system described in the Trove help centre.

The goal of this task is to automatically correct errors of OCR from a subset of scans from the Trove database. We have downloaded over 7,000 documents and obtained the original output of the OCR system, together with the corrected versions. Note that the corrected versions may still contain errors. We provide 6,000 documents and their corrected versions as the training set. Your goal is to apply automated techniques to correct the OCR errors of a separate test set.

Evaluation

We will use Kaggle in Class to evaluate the systems.

Data Files and Submission

We will use Kaggle in Class for this year's competition (look for the ALTA 2017 Challenge). The data files and submission instructions will be provided in the competition website.

In order to access the Kaggle in Class pages, you need to register with this shared task.

Important Dates

Release of training data	On registration
Deadline for submission of results over test data	20 Oct 2017
Notification of results	24 Oct 2017
Deadline for submission of system description	3 Nov 2017
Presentation of results at ALTA workshop, Queensland University of Technology	8 Dec 2017