Speech Technology and Human Computer Interaction Workshop
March 27th, Informatics Forum, Edinburgh University, UK
09:00-09:20 Coffee
09:20-09:30 Opening Remarks (Steve, Matthew, Ben)
09:30-10:00 Per Ola Kristennson
Efficient and flexible
communication via natural user interfaces
10:00-10:30 Russell Beale
Speaking volumes: a design perspective
10:30-11:00 Coffee
11:00-11:30 Holly Branigan
Making the right impression -
The challenge of speech-based technology from a psycholinguistic perspective
11:30-12:00 Steve Renals (25 min with 5 min questions)
Spoken interaction technologies
12.00-12.30 Poster 'madness' session
12:30-13:00 Poster Session (manned)
13:00-14:00 Lunch
14:00-14:30 Poster Session (manned)
14:30-15:15 Interactivity Session (Demo tech being developed)
15:15-15:45 Coffee
15:45-17:30 Group Session (discussions/mapping of challenge areas)
19:00-23:00 Pub meeting in Holyrood 9A
Interlingual Map Task Corpus Collection
Presented by: Hayakawa Akira, Saturnino Luz, Nick Campbell
School of Computer Science and Statistics, School of Linguistic,
Speech and Communication Sciences, Trinity College Dublin
Although initially designed to support the investigation of linguistic
phenomena in spontaneous speech, the HCRC Map Task corpus has been the
focus of a variety of studies of communicative behaviour. The
simplicity of the task on the one hand, and the complexity of
phenomena it can elicit on the other, make the map task an ideal
object of study. Here we investigate a map task based on
speech-to-speech machine translated interaction where the instruction
giver and the instruction follower speak different languages.
Although there are numerous studies into task oriented human
behaviour, this Interlingual Map Task is, to the best of our
knowledge, the first investigation of communicative behaviour in the
presence of three additional filters: Automatic Speech Recognition
(ASR), Machine Translation (MT) and Text To Speech (TTS) synthesis. We
are interested in understanding how such filters affect the
participants in terms of cognitive load, adaptation of communicative
acts to the technology, repair strategies, and other factors that
might have implications for the design of dialogue systems and, in
particular, speech-to-speech translation systems.
Anthropomorphism and lexical alignment in human-computer
dialogue
Presented by: Benjamin Cowan
HCI Centre, University of Birmingham
Our interlocutors affect our linguistic behaviours in dialogue. A
common observation is that people tend to converge, or align,
linguistically in dialogue scenarios. This alignment is a key
component of natural and successful communication. Recent research
suggests that alignment at the lexical level can be influenced by our
judgments of the abilities of our interlocutors as effective
communication partners. The use of speech as an interaction modality
in mainstream computing is rising, yet little is known about the
influence interlocutor design may have on user judgements of partner
competence and therefore how design affects alignment behaviour in
these interactions.
The research presented uses a Wizard of Oz based referential
communication task to explore how interlocutor design in
human-computer dialogue in the form of voice anthropomorphism impacts
lexical alignment. The results show a strong lexical alignment effect
in spoken human-computer dialogue, yet there was no significant impact
of interlocutor ability judgment on alignment levels. This gives
support to the incorporation of lexical alignment in spoken dialogue
system user models as well as suggesting that lexical alignment in
human-computer dialogue may be influenced by priming rather than
considered interlocutor modelling.
Accounting for diverse opinions in the labelling of paralinguistic
databases
Presented by: Ailbhe Cullen
Electronic and Electrical Engineering, Trinity College Dublin
Emotion or affect recognition has become a key concern in the design
of more natural human-computer interfaces. As with any recognition
task, this necessitates large, well-labelled databases. The difficulty
is that unlike words or phones which are clearly defined, emotions and
personality traits tend to be quite vague and subjective, and in the
absence of self-reported labels, it becomes a challenge to define
objective ground truth labels for a database.
Here we explore the problem of data selection for training. It is
generally accepted that the more examples of emotions in the training
data, the better the classifier performance will be. We show that in
fact ambiguous samples with a poor inter-rater agreement, which are
common in databases of natural emotions, may be harmful in
training. By discarding samples with weak emotion intensity or poor
inter-rater agreement we achieve significant increases in performance
on two emotional speech databases. We then move on to explore a new
database, collected to examine charisma in political speech, and
assess the inter-rater agreement for the labelling of four speaker
traits - charisma, likeability, inspiration, and enthusiasm - with a
focus on frameworks for dealing with opposing but equally valid
labeller opinions.
Toward Conversational Speech Synthesis
Presented by: Rasmus Dall
Centre For Speech Technology Research, University of
Edinburgh
This poster explores how we may utilise spontaneous conversational
speech and phenomena for creating speech synthesis systems
particularly suited for a conversational setting.
This includes the use of spontaneous speech as data, but also a style
transformation toward the spoken word. Such conversational systems
would be particularly well suited for conversational agents, robots
and other highly interactive speech devices.
Voice by Choice
Presented by: Alistair Edwards
Department of Computer Science, University of York
Despite great advances in speech synthesis technology, users of
alternative and augmentative communication devices (AACs) are usually
expected to use the same, 1980s synthesizer. Voice by Choice is a
short film which highlights this problem by demonstrating the humour
that can arise when three people all share the same voice - at a speed
dating event. The film was one product of the Creative Speech
Technology research network (CreST).
Just Talking - Exploring Casual Conversation
Presented by: Emer Gilmartin
Speech Communication Lab, Trinty College Dublin
Spoken dialogue is situated, and characteristics vary with the type of
interaction participants are engaged in. While types or genres of
written communication have been extensively described, categorization
of speech genres is not regarded as straightforward. Much attention
has been paid to ‘task-based’ or ‘transactional’ dialogue,
with dialogue system technology concentrating on this type of talk for
reasons of tractability. However, in real-life conversation there is
often no obvious short-term task to be accomplished through speech and
the dialogue is better described as ‘interactional’, aimed
towards building and maintaining social bonds. As an example of this
‘unmarked’ case of conversation, a tenant’s short chat about
the weather with the concierge of an apartment block is not intended
to transfer important meteorological data but rather to build a
relationship which may serve either of the participants in the
future. These ‘interactional’ dialogues, where the transfer of
verbal information is not the main goal, are not adequately modelled
in current frameworks.
Analysis of casual conversation at the syntactic, semantic, and
discourse levels has shown two major genres present – short
interactive sections (‘chat’) and longer, often narrative,
stretches (‘chunks’). Acoustic and prosodic analysis is
challenging as high levels of crosstalk in recordings of natural
multiparty interaction make automatic analysis difficult, particularly
around speaker changes. My current work involves detailed manual
annotation of natural multiparty conversations from the d64 and D-ANS
corpora, and the use of these annotations to investigate
•temporal (‘chronemic’) patterns in extensive (several hour)
conversations with particular focus on the distribution of intra- and
inter- speaker pauses and gaps in chat and chunk segments
•automaticity and the distribution of ‘disfluencies’ in
different genres of casual conversation
•the distribution and function of laughter in conversation
The study of different genres within casual or interactional social
talk, and of the contrasts between stretches of social and task-based
communication within the same interaction sessions, can help
discriminate which phenomena in one type of speech exchange system are
generalizable and which are situation or genre dependent. In addition
to extending our understanding of human communication, such knowledge
will be useful in human machine interaction design, particularly in
the field of ‘companion’ robots or relational agents, as the
very notion of a companion application entails understanding of social
spoken interaction.
Demo: The PARLANCE Mobile App for Interactive Search in English and
Mandarin
Presented by: Helen Hastie
Computer Science, Heriot-Watt University
We demonstrate a mobile application in English and Mandarin to test
and evaluate components of the PARLANCE
dialogue system for interactive search under real-world
conditions.
ALADIN: Voice interaction for people with disabilities.
Presented by: Jonathan Huyghe
Centre for User Experience Research (CUO), KU Leuven / iMinds
The ALADIN project aims to develop an assistive vocal interface for
people with physical impairments. Because people who experience
limited upper limb control also frequently suffer from speech
impairments, natural language speech recognition is often not an
option. Instead, ALADIN is a completely language-independent, learning
system, which is trained during its use to adapt to the individual
user.
In this demo, we demonstrate voice control within a virtual house,
which was developed for user tests to simulate the use in home
automation. We also show the companion tablet application, which
supports users during the training period and acts as a fallback
interface. Those interested can also record their own voice commands
to use in the virtual house.
Speech Synthesis with Character
Presented by: Graham Leary
CereProc Ltd.,
http://www.cereproc.com'
Speech is the natural man machine interface. With CereVoice any
software application can now talk with character. CereProc is an
enabling technology company, creating scaleable voices that are both
characterful and easy to integrate and apply. This makes our voices
suitable for any application where speech output would actively
improve the system or assist those with impaired vision.
CereProc has developed the world's newest and most innovative speech
synthesis system. Our advanced voice engine, CereVoice, is equally
happy on embedded systems like smart phones and on traditional IT
platforms. CereProc's award winning breakthrough voice creation system
enables practically anyone to create a synthesised voice in a matter
of minutes. CereProc can easily turn the voice of your brand into the
voice of your application, website, or multimedia interface.
Quorate Speech Recognition and Analysis Suite
Presented by: Mike Lincoln, CTO Quorate
http://quoratetechnology.com
Quorate is speech recognition and analysis suite which unlocks
information in recordings. In this demonstration we show how Quorate
can be used to extract information from a series of simulated Police
interview recordings. This allows the recordings to be browsed and
searched to identify common themes and locate relevant details, which
would otherwise be hidden.
17 ways to say
yes
Presented by: Graham Pullin, DJCAD University of Dundee
exploring tone of voice in augmentative communication and designing
new interactions with speech synthesis
SpeechCity: Conversational Interfaces for Urban
Environments
Presented by: Verena Rieser, Srini Janarthanam, Andy Taylor,
Yanchao Yu and Oliver Lemon
Interaction Lab, Heriot-Watt University
https://sites.google.com/site/speechcityapp/
We demonstrate a conversational interface that assists pedestrian
users in navigating and searching urban
environments. Locality-specific information is acquired from open
data sources, and can be accessed via intelligent interaction. We
therefore combine a variety of technologies, including Spoken
Dialogue Systems and geographical information (GIS) systems to
operate over a large spatial database.
In this demo, we present a system for tourist information within the
city of Edinburgh. We harvest points of interest from Wikipedia and
social networks, such as Foursquare, and we calculate walking
directions from Open Street Map (OSM). In contrast to existing
mobile applications, our Android agent is able to simultaneously
engage in multiple tasks, e.g. navigation and tourist information,
by using a multi-threaded dialogue manager.
For demonstrating the full functionality of the system, we simulate a
(user-specified) walking route, where the system pushes relevant
information to the user.
Through the use of open data, the agent is easily portable and
extendable to new locations and domains. Future possible versions of
the systems include an Edinburgh Festival app, a tourist guide for
San Francisco and the Bay Area, and a conference system for the
SemDial'14 workshop (to be held at Heriot-Watt University in
September). Part of the demonstration is a initial market survey to
gather information about target markets and potential
customers.
Multilevel Auditory Displays for Mobile Eyes-Free Location-Based
Interaction
Presented by: Yolanda Vazquez-Alvarez
School of Computing Science, University of Glasgow
http://www.dcs.gla.ac.uk/~yolanda
We explored the use of multilevel auditory displays to enable
eyes-free mobile interaction with location-based information in a
conceptual art exhibition space. Multilevel auditory displays enable
user interaction with concentrated areas of information. In this study
we used a gallery-like space as an Audio Augmented Reality (AAR)
environment in which we tested a number of different multilevel
auditory displays. A deeper understanding of how novel auditory
displays can impact the user experience will allow designers to make
more informed decisions when designing eyes-free auditory
interfaces.