ALTA 2015

Identifying French Cognates in English Text

Cognates words are word pairs that are similar in spelling, pronunciation, and meaning between two languages. For instance, the English word chair (university position) and the French word chaire are cognates since both refer to the same role, and are spelled and pronounced similarly. The difficulty arises when we consider false cognate pairs. These are pairs of words which have similar spelling or pronunciation, but different meaning. For instance, the French word pain means bread, which is not what an English speaker might expect.

Cognate pairs are useful when learning second languages since they are easy to understand: a learner can use knwoledge from their first language to understand unseen words in the second language. Therefore, being able to detect cognate pairs in texts has exciting applications in computer assisted second language learning.

Your task is to identify in English texts all the cognates from the perspective of the French language. In other words, identify all the words in the English text that would acceptably translate into a similar word in French. The winner will receieve a minimum AUD 350 prize money. This amount is guaranteed, and it may be higher depending on additional sponsorship.

Participating

We will use Kaggle in Class for this year's competition (look for the ALTA 2015 Challenge). The data files and submission instructions will be provided in the competition website.

To register, send an email to Diego Molla (shared.task@alta.asn.au) with the following information:

Name of your team;
Name and current studies, including the current year, degree name, and university (e.g. "third year of the Bachelor of IT at Macquarie University") of each team member.

We will send you a confirmation message and the training data for this year's task.

There will be two competition categories:

All the members of the student category must be university students. It cannot have members that are full-time employed or that have completed a PhD.
Any other teams will fall into the open category.

Rules

The winner will be the team who obtains the best results. The winner must outperform the results of a sample solution that will be provided. There are no limitations on the size of the teams or the means that they can use to solve the problem, as long as the processing is fully automatic - there should be no human intervention. As soon as you join the competition you will have access to a sample of data that you can use to develop your system. To qualify for the prize you need to submit your results on the test dataset by 21st October and a poster that describes the methods that you used to obtain the results by 7th November. Selected posters will be displayed at ALTA 2015.

Data

The data is divided into a training set and a test set, each comprising 30 documents, 5 in each of the following genres: novel, subtitles, sports news, political news, technology news, and cooking recipes. While the separations between the documents will be included in both the training and testing data, the categories of documents will only be released after submission of results, to allow to be taken into account in detailed results analysis, but not as input to the test data.

Data is divided into document text and annotation files. Document text files are formatted with one word (with punctuation attached, if present) per line and each line starts with the line number followed by a space. Document boundaries are indicated by a document id marker.

1 <docid 1>
2 Chewy
3 little
4 drops
5 of
6 chocolate
7 cookies,
8 covered
9 with
10 peanuts

Annotation files are in .csv format. Each line comprises a document number in the first column, and a space delimited list of cognate term indices in the second column. Each document has two lines in this annotation file; the first line lists sure cognates only, while the second line lists both borderline and sure cognates.

For instance, to indicate that `chocolate' (index 6) and `cookies' (index 7) are cognates of French words, the annotation file will include the entry:

Eval_id,Cognates_id
1, 6 7

Prior work has been done on this task, and is reported at: Wang & Sitbon (2014). Multilingual Lexical Resources to Detect Cognates in Non-aligned Texts. Proceedings of the Australasian Language Technology Association Workshop 2014, Melbourne, Australia, pp. 14-22. Please feel free to use this description as a starting point for your submission.

Organisers

Preparation of the 2015 task:
Laurianne Sitbon (QUT), Haoxing Wang (QUT).

Shared Task Coordinator and primary contact:
Diego Molla-Aliod (Macquarie University).

Please contact Diego if you are interested in sponsoring this event.

13th annual workshop of
The Australasian Language Technology Association

University of Western Sydney, Parramatta

Identifying French Cognates in English Text

Participating

Rules

Data

Organisers

Past Shared Tasks

13th annual workshop of The Australasian Language Technology Association

University of Western Sydney, Parramatta

Identifying French Cognates in English Text

Participating

Rules

Data

Organisers

Past Shared Tasks

13th annual workshop of
The Australasian Language Technology Association