Proceedings of ALTSS/ALTW, Melbourne, December 2003
Annotated corpora have been a critical component of research in the speech and language sciences for some years. Today, these corpora are being created and deployed for a rapidly expanding set of languages, disciplines and technologies. A wealth of formats and tools have sprung up around this enterprise, many of which are documented on the Linguistic Annotation page [http://www.ldc.upenn.edu/annotation/]. Linguistic annotation is a term which covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, "named entity" identification, co-reference annotation, and so on. This lecture will present a model of linguistic annotation which provides a simple framework for representing and manipulating complex, heterogeneous, multi-layered annotations. The model uses "annotation graphs": directed acyclic graphs having labels on the edges and time-offsets on the nodes. The lecture will cover the formalism, the software infrastructure, and practical applications.
Steven Bird is Associate Professor of Computer Science and Software Engineering at the University of Melbourne, and he teaches human language technology and supervises several research students working in this area. His research focuses on formal and computational models for linguistic information, with application to human language technologies and to the description of the world's ~7,000 languages. Before coming to Melbourne University he did doctoral and post-doctoral research at the University of Edinburgh (1987-94). From 1995-97 he conducted linguistic fieldwork on the languages of western Cameroon, published a dictionary, and helped develop several new writing systems. From 1998-2002 he was associate director of the Linguistic Data Consortium at the University of Pennsylvania, where he led an R&D team working on open-source software for linguistic annotation. [http://www.cs.mu.oz.au/~sb/]