|Language Technology Programming Competition 2013|
2013 Shared Task Description
Basic Task Description
July 10, 2013 | Version 1
The goal of this task is to recover missing information about word casing and punctuation in English text. Such low quality text can be the result of automated speech transcription, or optical character recognition (OCR), or text written in a hurry such as quick notes in minutes, instant messaging, or web forums.
To make this task easier, we have simplified it from a more ambitious task whose goal is to recover all the capitalisation and punctuation marks. For example, given the following text:
We would hope to restore it to:
In this task we only ask you to predict wheter the word in its original form has any characters in uppercase, and whether the word is followed by one of these punctuation marks:
You do not need to determine what particular characters of the word are in uppercase, or what punctuation mark follows the word.
You will be given a file that lists a word per line like this:
ID WORD 255 stored 256 at 257 the 258 ucla 259 television 260 archives 261 the 262 archived 263 episodes 264 were 265 telecast 266 march 267 8 268 16 269 and 270 24 271 1971 271 april 273 1 274 and
The first line contains header information that you can ignore. Each of the following lines contains a word ID and the actual word.
You will need to produce a file that lists the IDs of all words that have at least one capitalised character and the IDs of all words that are followed by a punctuation mark. The correct submission for the above example is:
Id,documents Case,258 259 260 261 266 272 Punct,260 265 267 268 270 271
This submission says that word with ID 258 has at least one character in uppercase, word 260 has uppercase and punctuation marks, and so on.
Data Files and SubmissionWe will use Kaggle in Class for this year's competition (look for the ALTA 2013 Challenge). The data files and submission instructions will be provided in the competition website.