Cognates words are word pairs that are similar in spelling, pronunciation, and meaning between two languages. For instance, the English word chair (university position) and the French word chaire are cognates since both refer to the same role, and are spelled and pronounced similarly. The difficulty arises when we consider false cognate pairs. These are pairs of words which have similar spelling or pronunciation, but different meaning. For instance, the French word pain means bread, which is not what an English speaker might expect.
Cognate pairs are useful when learning second languages since they are easy to understand: a learner can use knwoledge from their first language to understand unseen words in the second language. Therefore, being able to detect cognate pairs in texts has exciting applications in computer assisted second language learning.
Your task is to identify in English texts all the cognates from the perspective of the French language. In other words, identify all the words in the English text that would acceptably translate into a similar word in French. The winner will receieve a minimum AUD 350 prize money. This amount is guaranteed, and it may be higher depending on additional sponsorship.
To register, send an email to Diego Molla (firstname.lastname@example.org) with the following information:
We will send you a confirmation message and the training data for this year's task.
There will be two competition categories:
The data is divided into a training set and a test set, each comprising 30 documents, 5 in each of the following genres: novel, subtitles, sports news, political news, technology news, and cooking recipes. While the separations between the documents will be included in both the training and testing data, the categories of documents will only be released after submission of results, to allow to be taken into account in detailed results analysis, but not as input to the test data.
Data is divided into document text and annotation files. Document text files are formatted with one word (with punctuation attached, if present) per line and each line starts with the line number followed by a space. Document boundaries are indicated by a document id marker.
1 <docid 1> 2 Chewy 3 little 4 drops 5 of 6 chocolate 7 cookies, 8 covered 9 with 10 peanuts
Annotation files are in .csv format. Each line comprises a document number in the first column, and a space delimited list of cognate term indices in the second column. Each document has two lines in this annotation file; the first line lists sure cognates only, while the second line lists both borderline and sure cognates.
For instance, to indicate that `chocolate' (index 6) and `cookies' (index 7) are cognates of French words, the annotation file will include the entry:
Eval_id,Cognates_id 1, 6 7Prior work has been done on this task, and is reported at: Wang & Sitbon (2014). Multilingual Lexical Resources to Detect Cognates in Non-aligned Texts. Proceedings of the Australasian Language Technology Association Workshop 2014, Melbourne, Australia, pp. 14-22. Please feel free to use this description as a starting point for your submission.
Preparation of the 2015 task:
Laurianne Sitbon (QUT), Haoxing Wang (QUT).
Shared Task Coordinator and primary contact:
Diego Molla-Aliod (Macquarie University).
Please contact Diego if you are interested in sponsoring this event.