|Language Technology Programming Competition 2014|
2014 Shared Task Description
Basic Task Description
The goal of this task is to identify all mentions of locations in the text of tweets. Location is any specific mention of a country, city, suburb, street, or POI (Point of Interest). A POI can be the name of a shopping centre, such as "Macquarie Centre" or name of a hospital, e.g., "Ryde Hospital". This is an information extraction task important for applications that want to find out where people are or they are talking about which location. This task requires you to only identify which word in the text of a tweet refers to a location, and does not expect you to find the location on the map. For example, in the tweets:
Locations can be in the text itself, or in hashtags (e.g, #australia), URLs, or sometimes even in mentions (e.g., @australia). As location mentions can span over words, all these words must be identified, however, partial identification of location names will be rewarded too. For example if from "eastern Ukraine" your system only identifies "Ukraine", it will be half correct.
You will be given a list of tweet-ids and a script to help you download the tweets from Twitter. If a tweet is deleted by its author, it will not be retrieved. Your system should find the location mentions, and list them all in lowercase as blank separated words next to their tweet-id. For example,
Input id 493450763931512832
author: BBCBreaking tweet text: France and Germany join the US and UK in advising their nationals in Libya to leave immediately http://bbc.in/1rVmrDJ
and your output should be:
493450763931512832,france germany us uk libya
All punctuation in the word containing the location must be removed, including the hash symbol (#).
If a locations is repeated in a tweet, you need to number them from the second occurrence. For example, if there are three mentions of Australia, then you will have
australia australia2 australia3
If a location has multiple words, separate them with blank space so that, in effect, it does not matter whether it is one location expression with two words or two different location expressions. Thus, if a tweet with ID "1234" has two location expressions "London" and "United States" the following are valid and equivalent descriptions:
1234,london united states 1234,united london states
If a tweet does not have any location mention, then use the marker
We will use Kaggle in Class to evaluate the systems using F-measure on the word level.
Data Files and SubmissionWe will use Kaggle in Class for this year's competition (look for the ALTA 2014 Challenge). The data files and submission instructions will be provided in the competition website.
There is a training set and a test set. The training set contains 2000 tweets sorted in time, together with the location mentions. The format of this file is exactly the same as the format of the submission file. The test set contains just over 1000 tweets sorted in time, this time without the location mentions. Your task is to find the location mentions of this test set and submit the results to Kaggle in Class. The timestamp of the tweets of the test set are after those of the training set, to model a realistic scenario where we train on known tweets and we want to predict the location mentions in future tweets.