Home

Australasian Language Technology Association

About

Registration

Accommodation

Program

Public Lectures

Workshop

Forum

Sponsorship

Information for Presenters

Brochure

Poster

 

Australasian Language Technology Summer School

Australasian Language Technology Workshop

8-12 December 2003, University of Melbourne

Department of Computer Science and Software Engineering

Summer School Program

 

INTRODUCTORY COURSES
(four 1.5-hour lectures and lab sessions each)

I1: Practical NLP using Python: Trevor Cohn and Steven Bird, Melbourne
I2: Speech processing: David Grayden, Bionic Ear Institute, Melbourne
I3: Dialogue systems: Robert Dale, Macquarie and Dominique Estival, DSTO
I4: Information extraction and question answering: Diego Molla, Macquarie

ADVANCED COURSES
(four 1.5-hour lectures and lab sessions each)

A1: Machine translation: Harold Somers, UMIST, UK
A2: Validation and evaluation in NLP and IR: David Powers, Flinders
A3: Statistical parsing: Mark Johnson, Brown University, USA
A4: SVMs and kernel methods in NLP: Jim Hogan, QUT

A series of free public lectures will be held during the same week:

www.cs.otago.ac.nz/research/ai/ALTW2003/prov-programme.pdf

Workshop programme available in pdf or html

Mon Tue Wed Thu Fri
8:30-10:00
10:30-12:00
I1, A1 I1, A1 ALTW I3, A3 I3,A3
1:30-2:30 Lectures
Dale (1)
Baeza-Yates (1)
Lectures
Knott
Baeza-Yates (2)
ALTW Lectures
Dale (2)
Baldwin
Lectures
Paris
Wallis
Cassidy/Bird
2:30-4:00
4:30-6:00
I2, A2 I2, A2 ALTW I4, A4 I4, A4

COURSE DETAILS INTRODUCTORY COURSES

I1: Practical NLP using Python:
Trevor Cohn and Steven Bird, Melbourne University

ABSTRACT:
The objective of this course is for students to understand the fundamentals of symbolic and statistical natural language processing, and to apply this understanding in writing small Python programs using the Natural Language Toolkit (nltk.sourceforge.net). Topics covered will include part-of-speech tagging, chunk parsing, parsing with context-free grammars, and annotated linguistic corpora.

BIO :
Steven Bird is Associate Professor of Computer Science and Software Engineering, and he teaches human language technology and supervises several research students working in this area. His research focuses on formal and computational models for linguistic information, with application to human language technologies and to the description of the world's ~7,000 languages. Before coming to Melbourne University he did doctoral and post-doctoral research at the University of Edinburgh (1987-94). From 1995-97 he conducted linguistic fieldwork on the languages of western Cameroon, published a dictionary, and helped develop several new writing systems. From 1998-2002 he was associate director of the Linguistic Data Consortium at the University of Pennsylvania, where he led an R&D team working on open-source software for linguistic annotation. http://www.cs.mu.oz.au/~sb/

Trevor Cohn is a PhD student in Computer Science and Software Engineering at Melbourne University. His research interests include word sense disambiguation and automatic text summarisation. Before commencing his candidature, he spent three years in industry working as a Software Engineer, both at Ericsson Australia R&D labs and KESEM International after completing his undergraduate degree of BComm/BEng(Software;Hons). http://www.cs.mu.oz.au/~tacohn/index.php

top


I2: Speech processing:
David Grayden, Bionic Ear Institute, Melbourne

ABSTRACT
This course is an introduction to the speech signal and how it is processed by humans and by machines. We begin with the production of speech, the properties of the acoustic signal and how it is perceived by humans. Then we look at the methods of analysing the speech signal. Speech signal analysis and human perception are tied together by looking at speech coding, in particular perceptual coding of sound using MPEG-1 psychoacoustic models, such as MP3. We touch on data embedding and watermarking and then look at automatic speech recognition in some detail. Finally there is an introduction to speech synthesis and areas of ongoing speech processing research.

BIO
Dr David Grayden has been working as a Research Fellow at the Bionic Ear Institute in Melbourne since 1997. His main research involves examination of phoneme confusions made by people using cochlear implants with the view to designing strategies that will improve perception by the users. He is currently developing and evaluation a number of advanced sound processing strategies. He is also involved in other research areas, including automatic speech recognition and speech enhancement using auditory models, auditory physiology, integration of auditory and visual input, and models of spike-timing dependent plasticity for adaptive learning of spatiotemporal patterns.
http://www.bionicear.org/people/graydend/

top


I3: Dialogue systems
Robert Dale, Macquarie University and Dominique Estival, DSTO

ABSTRACT
I3: Dialogue Systems Robert Dale, Macquarie University and Dominique Estival, DSTO Abstract This practically-oriented course has two principal aims: - To provide an introduction to what is involved in building real spoken language dialog systems. - To give some practical experience in constructing dialog systems. After a brief introduction to spoken language dialog systems (SLDSs) and the key elements involved in their development, students will use a dialog systems toolkit to build a simple dialog system. We will explore how the task of dialog design interacts with grammar and prompt writing, and look at how complex grammars can be developed. The course will end by looking at current standards such as VoiceXML and SALT, and discussing where dialog systems are headed in the future.

BIO
Professor Robert Dale is Director of the Centre for Language Technology at Macquarie University in Sydney, where he teaches on various aspects of language technology. After completing his PhD in Computational Linguistics at the University of Edinburgh in 1989, he taught in the Centre for Cognitive Science at Edinburgh, before taking up a position with Microsoft in Sydney in 1994. He was Director of the Microsoft Research Institute at Macquarie University (1996-1999). His research interests include intelligent text processing; natural language generation; spoken language dialog systems; and reference and anaphora. He is author or editor of five books and around 60 papers in various aspects of natural language processing, and is editor of the Journal of Computational Linguistics.
http://www.ics.mq.edu.au/~rdale/

Dominique Estival has been a Senior Research Scientist at DSTO since early 2002. After receiving her PhD in linguistics from the University of Pennsylvania in 1986, she started working as a computational linguist in industry: first in a machine translation company (Weidner, Chicago, USA; 1986-88) and then at Wang Laboratories (Boston, USA; 1988-89). She was a researcher at ISSCO (Geneva, Switzerland, 1989-1995) before coming to Australia to take up the position of lecturer in Computational Linguistics at the University of Melbourne (1995-1998). She joined Syrinx Speech Systems in 1999 to head the Natural Language Processing group and lead the NLP R&D project to develop a natural language telephone dialogue system. Her research interests have included the investigation of the computational modelling of language change, machine translation, grammar formalisms, grammar development and linguistic engineering and spoken dialogue systems.
http://www.ics.mq.edu.au/~destival/

top


I4: Information extraction and question answering
Diego Mollá, Macquarie University

ABSTRACT:
In this course we will introduce two hot areas of Language Technology: information extraction and question answering. Both are key areas for tasks that require the recovery of specific information from text documents. Due to the current availability of increasingly large volumes of text stored in digital form (e.g. in the World Wide Web), an increasing number of organisations and companies are becoming interested in applications from these areas. Information Extraction (IE) systems populate databases with specific information extracted from text documents. IE systems typically operate in closed domains (e.g. news of terrorist attacks) and the type of information to be extracted is predetermined by the system administrator (e.g identify the nature of the attack, the perpetrator, the time, the location, and the effect of the attack). In contrast, Question Answering (QA) systems return the answers to arbitrary questions asked in a human language by searching through the source documents. Now the type of information to be found is not predetermined and the source documents may belong either to closed domains (e.g. a computer manual) or to open domains (e.g. the World Wide Web). Both information extraction and question answering systems use an array of technologies that will be explored in this course. Topics to cover include document retrieval, named-entity recognition, question classification, linguistic resources, and logical inference. These topics will be introduced and their application to information extraction and question answering will be unveiled.

BIO:
Diego Mollá is a lecturer in the Centre for Language Technology at Macquarie University in sydney, Australia. His research focuses on bridging the gap between theoretical linguistics, especially semantics and logical forms, and practical natural language processing applications. His current projects center around AnswerFinder, a question-answering system. He received an MSc in speech and language processing and PhD in the formal semantics of aspectual composition from the University of Edinburgh. He is currently secretary of the Australasian Language Technology Association. His teaching duties in Macquarie University's undergraduate Language Technology program include a 3rd-year unit in intelligent text processing and an Honours unit in question answering. http://www.ics.mq.edu.au/gen/person/diego.html

top


ADVANCED COURSES

A1 Machine Translation
Harold Somers, UMIST, UK

ABSTRACT
1. Introduction to MT This first session will introduce the topic of MT by looking first at its history, then at some of the basic problems (and solutions), focusing on linguistic aspects of translation and the use of computers to address them. We will consider the use of fully automatic MT for assimilation purposes (translating into the user's language), compared to controlled language and/or computer-based aids for translators for dissemination (translating into a foreign language). SPoken-language MT will also be briefly mentioned.

2. Linguistic aspects of MT In this session we will look more closely at the kinds of linguistic problems that MT has to face and will discuss ways in which MT programs work around these problems. We will distinguish monolingual problems of morphology, lexical ambiguity, syntactic ambiguity, pragmatic aspects from bilingual problems of language contrast: lexical mismatches, structural divergence, typological differences.

3. Evaluation of MT Evaluation of MT software is important to developers and users alike. In this session we will look at the many different features of MT that can be evaluated, and at suitable methods for conducting an evaluation.

4. Empirical approaches to MT The latest research on MT is the so-called "empiricist" approach, relying on large amounts of textual data from which linguistic "knowledge" is extracted and automatically used to produce translations on the basis of analogy. The two main variants of this approach (statistics-based and example-based MT) are explained and exemplified.

Recommended reading:
R. Dale, H. Moisl and H. Somers (eds) Handbook of natural language processing. New York (2000): Marcel Dekker. Chapters 13, 25.
D. Jurafsky and J. H. Martin. Speech and language processing. Upper Saddle River NJ (2000): Prentice Hall. Chapter 21.
R. Mitkov (ed) The Oxford handbook of computational linguistics. Oxford (2003): Oxford University Press. Chapters 27 and 28.
And if you're really hooked:
H. Somers (ed) Computers and translation: A translator's guide. Amsterdam (2003): Benjamins. Especially chapters 1-3,6,8,9,11,13-15.

BIO
Harold Somers is Professor of Language Engineering at UMIST (Manchester). With over 25 years' experience in the field both as a researcher and educator, he is editor of one of the field's premier journals (Machine Translation), and has written extensively on the subject. His latest publication "Computers and Translation" (John Benjamins, 2003) promises to become an influential and useful addition to the literature. http://www.ccl.umist.ac.uk/staff/harold/

top


A2: Validation and Evaluation in NLP and IR.
David Powers, Flinders University

ABSTRACT
So you're developing a Natural Language system? You've developed a model and want to train it up and prove how good it is, but how? The development and training of a model can be undertaken in many ways, and may be theoretically driven or empirically derived. It may involve statistical learning or neural networks. It may use a supervised or an unsupervised paradigm. In all cases there are pitfalls in training and testing the system, and many approaches to validation and evaluation lead to invalid or misleading comparisons with other approaches.

The first step to setting up a model is to correctly sample the target corpus and provide the appropriate number of datasets for the chosen development paradigm.

The second step is to ensure that appropriate manipulations of the raw data are performed systematically to ensure reproducible results from the algorithms employed.

The third step is to ensure that the output distributions from the model match the probability distribution of the target corpus or application.

The fourth step is to ensure that appropriate evaluation techniques are used to determine how well your system is doing compared to chance.

This course will go through each of these stages and identify common mistakes and sneaky manipulations that lead to the publication of meaningless or misleading results.

BIO
David Powers has been working in the area of Machine Learning of Natural Language for over 25 years, and has published over 100 papers as well as a monograph and a number of proceedings in the area. Powers organized the first events in MLNL in 1991 and founded SIGNLL in 1993 and CoNLL in 1997.

Currently Powers is Head of the AI Lab at Flinders University an supervises a dozen projects relating to the learning of natural language and ontology, falling under two major research areas making use of a range of learning, analysis and data fusion techniques:

  • The robot baby and the intelligent room (commercialized by I2Net), including audiovisual speech/speaker recognition/location, spelling/grammar checking, transcription of Asian languages, brain/speech control of computers/devices.
  • Advanced web search and visualization (commercialized by YourAmigo), including search of the hidden web, syntactic and semantic tagging of web pages for more accurate search and ranking, and intuitive display of multidimensional data.

http://www.infoeng.flinders.edu.au/people/pages/powers_david/

top

A3 Probabilistic models and stochastic grammars.
Mark Johnson, Brown University, USA

ABSTRACT
This course unites two different approaches to computational linguistics and natural language processing. On the one hand, there is considerable linguistic evidence that natural language possesses a rich hierarchical structure that is only indirectly reflected in the sequence of words or sounds that make up sentences. On the other hand, there are also reliable statistical regularities in the selection and ordering of words and phrases in natural languages. Stochastic grammars are capable of describing both aspects of natural languages, and are a key component of state-of-the-art technology in many areas of computational linguistics.

This course will cover the following topics:

  1. An introduction to language modeling and the noisy channel model. Applications of language modeling and parsing, including speech recognition, machine translation and information extraction and retrieval.
  2. Finite-state machines and hidden Markov models.
  3. Probabilistic context-free grammars (PCFGs). Estimation of PCFGs from visible and hidden data (maximum likelihood estimation, expectation maximization). Chart parsing and dynamic programming algorithms.
  4. Stochastic Unification-based Grammars and discriminative training.

The course does not have any specific prerequisites, but mathematical and computer science background will be helpful. The ability to take derivatives and manipulate mathematical expressions at a first-year undergraduate level will enable students to follow the derivation of the formulas, and computer science experience in algorithms with enable students to understand, analyze and implement the various algorithms described in the course.

BIO
Mark Johnson is Professor of Cognitive & Linguistic Sciences and Computer Science at Brown University, and current president of the Association for Computational Linguistics. He has made significant contributions to research into computational processes involved in human language understanding, and is at the forefront of research in statistical natural language processing.
http://www.cog.brown.edu/~mj/

top


A4: SVMs and kernel methods in NLP
Jim Hogan, QUT

ABSTRACT
Support vector approaches have been around since the mid 1990s, initially as a binary classification technique, with later extensions to regression and multiple class classification. At its core is the idea of structural risk minimisation, a principled technique for selecting a model which minimises generalisation error. As a result of its success in controlling model capacity and of the availability of remarkably fast quadratic programming approaches to training, the technique has been adopted widely and used across a variety of applications.

Within the SV framework, similarity between patterns is defined through the use of kernel functions, usually some kind of generalisation of the scalar product for real vectors. It is often possible to tailor kernel functions to a particular problem domain, with the use of string, syllable and tree-structure kernels particularly important in NLP. Moreover, for some classes of functions known as Mercer kernels, it is even possible to get the benefits of transforming to a higher dimensional feature space without ever leaving the original pattern space. This property is shared by the three most common approaches: the linear, polynomial and radial basis function kernels.

This course begins with a detailed, but accessible, introduction to the theory of the SV approach, before considering in turn a variety of NLP applications and the kernels which underpin their success. These will include text mining, topic spotting, authorship attribution, tagging and specialised sructural analysis in both NLP and bioinformatics. While much of our focus will be upon developments in specialised string kernels, we will also highlight the success of the 'vanilla' approaches, and the key role of scaling in ensuring adequate discrimination.

BIO
Jim Hogan is a senior lecturer in QUT's school of software engineering and data communications, where among other things he works on machine learning problems in bioinformatics (SVMs for location of regulatory regions), NLP (authorship and cohort analysis, spatial semantics) and vision (SVM face classification, Bayesian top-down visual attention).
http://sky.fit.qut.edu.au/~hogan/

top


Convenor
A/Prof Steven Bird
Dept of Computer Science
University of Melbourne

Local Arrangements
Cathy Bow
Dept of Computer Science
University of Melbourne

Conference Management
The University of Melbourne
Jen Westphal
Telephone: +61 (03) 8344 6107
Facsimile: +61 (03) 8344 6122
Email: westphal@unimelb.edu.au

Bronwen Hewitt
Telephone: +61 (03) 8344 6389
Facsimile: +61 (03) 8344 6122
Email: bhewitt@unimelb.edu.au

top