|Australasian Language Technology Workshop 2012|
Invited talks at ALTA 2012
Chris Brockett (Microsoft Research)
Diverse Words, Shared Meanings: Statistical Machine Translation for Paraphrase, Grounding, and Intent
Can two different descriptions refer to the same event or action? Recognising that dissimilar strings are equivalent in meaning for some purpose is something that humans do rather well, but it is a task at which machines often fail. In the Natural Language Processing Group at Microsoft Research, we are attempting to address this challenge at sentence scale by generating semantically equivalent rewrites that can be used in applications ranging from authoring assistance to intent mapping for search or command and control. The Microsoft Translator paraphrase engine, developed in the NLP group, is a large-scale phrasal machine translation system that generates short sentential and phrasal paraphrases in English and has a public API that is available to researchers and developers. I will present the data extraction process, architecture, issues in generating diverse outputs, applications and possible future directions, and discuss the strengths and limitations of the statistical machine translation model as it relates to paraphrasing, how paraphrase is like machine translation, and how it differs in important respects. The statistical machine translation approach also has broad applications in capturing user intent in search, conversational understanding, and the grounding of language in objects and actions, all active areas of investigation in Microsoft Research.
Jen Hay (University of Canterbury)
Using a large annotated historical corpus to study word-specific effects in sound change
The Origins of New Zealand English Corpora (ONZE) at the University of Canterbury contain recordings spanning 150 years of New Zealand English. These have all been force-aligned at the phoneme-level, and are stored with many layers of annotation some which have been automatically generated, and some which have been manually annotated. We interact with the corpus via our custom LaBB-CAT interface (LAnguage, Brain and Behaviour Corpus Analysis Tool). I will begin the talk by describing and demonstrating the corpus, and its associated LaBB-CAT tool. I will then focus on one particular recent study which has used the corpus, which aims to understand processes of sound change.
The combination of the time-depth of the ONZE collection, and the degree of careful annotation it contains, makes it an ideal data-set for the study of mechanisms underlying sound change. In particular, we aim to address the question which has been the subject of long-standing debate in the sound-change literature do sound changes proceed uniformly through the lexicon, or are there word-specific changes, with some words more ahead in the change than others? I describe a study which aimed to investigate this question by focusing on the mechanisms underpinning the New Zealand English front short vowel shift, of the vowels in words like bat, bet and bit. We automatically extracted formant values for over 100,000 tokens of words containing these vowels, We show that this data contains good evidence for word-specific effects in sound change, and argue that these are predicted by current models of speech production and perception, in combination with well-established psycholinguistic processes.