SICSA logo

Speech Technology and Human Computer Interaction Workshop

March 27th, Informatics Forum, Edinburgh University, UK


09:00-09:20 Coffee
09:20-09:30 Opening Remarks (Steve, Matthew, Ben)
09:30-10:00 Per Ola Kristennson
   Efficient and flexible communication via natural user interfaces
10:00-10:30 Russell Beale
   Speaking volumes: a design perspective
10:30-11:00 Coffee
11:00-11:30 Holly Branigan
   Making the right impression -
   The challenge of speech-based technology from a psycholinguistic perspective

11:30-12:00 Steve Renals (25 min with 5 min questions)
   Spoken interaction technologies
12.00-12.30 Poster 'madness' session
12:30-13:00 Poster Session (manned)
13:00-14:00 Lunch
14:00-14:30 Poster Session (manned)
14:30-15:15 Interactivity Session (Demo tech being developed)
15:15-15:45 Coffee
15:45-17:30 Group Session (discussions/mapping of challenge areas)
19:00-23:00 Pub meeting in Holyrood 9A


Poster and Demo Abstracts

Interlingual Map Task Corpus Collection
Presented by: Hayakawa Akira, Saturnino Luz, Nick Campbell
School of Computer Science and Statistics, School of Linguistic, Speech and Communication Sciences, Trinity College Dublin

Although initially designed to support the investigation of linguistic phenomena in spontaneous speech, the HCRC Map Task corpus has been the focus of a variety of studies of communicative behaviour. The simplicity of the task on the one hand, and the complexity of phenomena it can elicit on the other, make the map task an ideal object of study. Here we investigate a map task based on speech-to-speech machine translated interaction where the instruction giver and the instruction follower speak different languages.  Although there are numerous studies into task oriented human behaviour, this Interlingual Map Task is, to the best of our knowledge, the first investigation of communicative behaviour in the presence of three additional filters: Automatic Speech Recognition (ASR), Machine Translation (MT) and Text To Speech (TTS) synthesis. We are interested in understanding how such filters affect the participants in terms of cognitive load, adaptation of communicative acts to the technology, repair strategies, and other factors that might have implications for the design of dialogue systems and, in particular, speech-to-speech translation systems.

Anthropomorphism and lexical alignment in human-computer dialogue
Presented by: Benjamin Cowan
HCI Centre, University of Birmingham

Our interlocutors affect our linguistic behaviours in dialogue. A common observation is that people tend to converge, or align, linguistically in dialogue scenarios. This alignment is a key component of natural and successful communication. Recent research suggests that alignment at the lexical level can be influenced by our judgments of the abilities of our interlocutors as effective communication partners. The use of speech as an interaction modality in mainstream computing is rising, yet little is known about the influence interlocutor design may have on user judgements of partner competence and therefore how design affects alignment behaviour in these interactions. The research presented uses a Wizard of Oz based referential communication task to explore how interlocutor design in human-computer dialogue in the form of voice anthropomorphism impacts lexical alignment. The results show a strong lexical alignment effect in spoken human-computer dialogue, yet there was no significant impact of interlocutor ability judgment on alignment levels. This gives support to the incorporation of lexical alignment in spoken dialogue system user models as well as suggesting that lexical alignment in human-computer dialogue may be influenced by priming rather than considered interlocutor modelling.

Accounting for diverse opinions in the labelling of paralinguistic databases
Presented by: Ailbhe Cullen
Electronic and Electrical Engineering, Trinity College Dublin

Emotion or affect recognition has become a key concern in the design of more natural human-computer interfaces. As with any recognition task, this necessitates large, well-labelled databases. The difficulty is that unlike words or phones which are clearly defined, emotions and personality traits tend to be quite vague and subjective, and in the absence of self-reported labels, it becomes a challenge to define objective ground truth labels for a database. Here we explore the problem of data selection for training. It is generally accepted that the more examples of emotions in the training data, the better the classifier performance will be. We show that in fact ambiguous samples with a poor inter-rater agreement, which are common in databases of natural emotions, may be harmful in training. By discarding samples with weak emotion intensity or poor inter-rater agreement we achieve significant increases in performance on two emotional speech databases. We then move on to explore a new database, collected to examine charisma in political speech, and assess the inter-rater agreement for the labelling of four speaker traits - charisma, likeability, inspiration, and enthusiasm - with a focus on frameworks for dealing with opposing but equally valid labeller opinions.

Toward Conversational Speech Synthesis
Presented by: Rasmus Dall
Centre For Speech Technology Research, University of Edinburgh

This poster explores how we may utilise spontaneous conversational speech and phenomena for creating speech synthesis systems particularly suited for a conversational setting. This includes the use of spontaneous speech as data, but also a style transformation toward the spoken word. Such conversational systems would be particularly well suited for conversational agents, robots and other highly interactive speech devices.

Voice by Choice
Presented by: Alistair Edwards
Department of Computer Science, University of York

Despite great advances in speech synthesis technology, users of alternative and augmentative communication devices (AACs) are usually expected to use the same, 1980s synthesizer. Voice by Choice is a short film which highlights this problem by demonstrating the humour that can arise when three people all share the same voice - at a speed dating event. The film was one product of the Creative Speech Technology research network (CreST).

Just Talking - Exploring Casual Conversation
Presented by: Emer Gilmartin
Speech Communication Lab, Trinty College Dublin

Spoken dialogue is situated, and characteristics vary with the type of interaction participants are engaged in. While types or genres of written communication have been extensively described, categorization of speech genres is not regarded as straightforward. Much attention has been paid to ‘task-based’ or ‘transactional’ dialogue, with dialogue system technology concentrating on this type of talk for reasons of tractability. However, in real-life conversation there is often no obvious short-term task to be accomplished through speech and the dialogue is better described as ‘interactional’, aimed towards building and maintaining social bonds. As an example of this ‘unmarked’ case of conversation, a tenant’s short chat about the weather with the concierge of an apartment block is not intended to transfer important meteorological data but rather to build a relationship which may serve either of the participants in the future. These ‘interactional’ dialogues, where the transfer of verbal information is not the main goal, are not adequately modelled in current frameworks. Analysis of casual conversation at the syntactic, semantic, and discourse levels has shown two major genres present – short interactive sections (‘chat’) and longer, often narrative, stretches (‘chunks’). Acoustic and prosodic analysis is challenging as high levels of crosstalk in recordings of natural multiparty interaction make automatic analysis difficult, particularly around speaker changes. My current work involves detailed manual annotation of natural multiparty conversations from the d64 and D-ANS corpora, and the use of these annotations to investigate •temporal (‘chronemic’) patterns in extensive (several hour) conversations with particular focus on the distribution of intra- and inter- speaker pauses and gaps in chat and chunk segments •automaticity and the distribution of ‘disfluencies’ in different genres of casual conversation •the distribution and function of laughter in conversation The study of different genres within casual or interactional social talk, and of the contrasts between stretches of social and task-based communication within the same interaction sessions, can help discriminate which phenomena in one type of speech exchange system are generalizable and which are situation or genre dependent. In addition to extending our understanding of human communication, such knowledge will be useful in human machine interaction design, particularly in the field of ‘companion’ robots or relational agents, as the very notion of a companion application entails understanding of social spoken interaction.

Demo: The PARLANCE Mobile App for Interactive Search in English and Mandarin
Presented by: Helen Hastie
Computer Science, Heriot-Watt University

We demonstrate a mobile application in English and Mandarin to test and evaluate components of the PARLANCE dialogue system for interactive search under real-world conditions.

ALADIN: Voice interaction for people with disabilities.
Presented by: Jonathan Huyghe
Centre for User Experience Research (CUO), KU Leuven / iMinds

The ALADIN project aims to develop an assistive vocal interface for people with physical impairments. Because people who experience limited upper limb control also frequently suffer from speech impairments, natural language speech recognition is often not an option. Instead, ALADIN is a completely language-independent, learning system, which is trained during its use to adapt to the individual user. In this demo, we demonstrate voice control within a virtual house, which was developed for user tests to simulate the use in home automation. We also show the companion tablet application, which supports users during the training period and acts as a fallback interface. Those interested can also record their own voice commands to use in the virtual house.

Speech Synthesis with Character
Presented by: Graham Leary
CereProc Ltd.,'
Speech is the natural man machine interface. With CereVoice any software application can now talk with character. CereProc is an enabling technology company, creating scaleable voices that are both characterful and easy to integrate and apply. This makes our voices suitable for any application where speech output would actively improve the system or assist those with impaired vision. CereProc has developed the world's newest and most innovative speech synthesis system. Our advanced voice engine, CereVoice, is equally happy on embedded systems like smart phones and on traditional IT platforms. CereProc's award winning breakthrough voice creation system enables practically anyone to create a synthesised voice in a matter of minutes. CereProc can easily turn the voice of your brand into the voice of your application, website, or multimedia interface.

Quorate Speech Recognition and Analysis Suite
Presented by: Mike Lincoln, CTO Quorate
Quorate is speech recognition and analysis suite which unlocks information in recordings. In this demonstration we show how Quorate can be used to extract information from a series of simulated Police interview recordings. This allows the recordings to be browsed and searched to identify common themes and locate relevant details, which would otherwise be hidden.

17 ways to say yes
Presented by: Graham Pullin, DJCAD University of Dundee
exploring tone of voice in augmentative communication and designing new interactions with speech synthesis

SpeechCity: Conversational Interfaces for Urban Environments
Presented by: Verena Rieser, Srini Janarthanam, Andy Taylor, Yanchao Yu and Oliver Lemon
Interaction Lab, Heriot-Watt University
We demonstrate a conversational interface that assists pedestrian users in navigating and searching urban environments. Locality-specific information is acquired from open data sources, and can be accessed via intelligent interaction. We therefore combine a variety of technologies, including Spoken Dialogue Systems and geographical information (GIS) systems to operate over a large spatial database. In this demo, we present a system for tourist information within the city of Edinburgh. We harvest points of interest from Wikipedia and social networks, such as Foursquare, and we calculate walking directions from Open Street Map (OSM). In contrast to existing mobile applications, our Android agent is able to simultaneously engage in multiple tasks, e.g. navigation and tourist information, by using a multi-threaded dialogue manager. For demonstrating the full functionality of the system, we simulate a (user-specified) walking route, where the system pushes relevant information to the user. Through the use of open data, the agent is easily portable and extendable to new locations and domains. Future possible versions of the systems include an Edinburgh Festival app, a tourist guide for San Francisco and the Bay Area, and a conference system for the SemDial'14 workshop (to be held at Heriot-Watt University in September). Part of the demonstration is a initial market survey to gather information about target markets and potential customers.

Multilevel Auditory Displays for Mobile Eyes-Free Location-Based Interaction
Presented by: Yolanda Vazquez-Alvarez
School of Computing Science, University of Glasgow
We explored the use of multilevel auditory displays to enable eyes-free mobile interaction with location-based information in a conceptual art exhibition space. Multilevel auditory displays enable user interaction with concentrated areas of information. In this study we used a gallery-like space as an Audio Augmented Reality (AAR) environment in which we tested a number of different multilevel auditory displays. A deeper understanding of how novel auditory displays can impact the user experience will allow designers to make more informed decisions when designing eyes-free auditory interfaces.