TOWARDS A UNIVERSAL SPEECH INTERFACE

Roni Rosenfeld, Xiaojin Zhu, Arthur Toth, Stefanie Shriver, Kevin Lenzo, Alan W Black

School of Computer Science

Carnegie Mellon University

{roni, zhuxj, atoth, sshriver, lenzo, awb}@cs.cmu.edu

ABSTRACT

We discuss our ongoing attempt to design and evaluate universal human-machine speech-based interfaces.  We describe one such initial design suitable for database retrieval applications, and discuss its implementation in a movie information application prototype.  Initial user studies provided encouraging results regarding the usability of the design, as well as suggest some questions for further investigation.

1. INTRODUCTION

Speech recognition technology has made spoken interaction with machines feasible.  However, no suitable universal interaction paradigm has yet been proposed for humans to communicate effectively, efficiently and effortlessly by voice with machines.

On one hand, natural language applications have been demonstrated in narrow domains, but building such systems is data-, labor- and expertise-intensive.  Perhaps more importantly, unconstrained natural language severely strains recognition technology, and fails to delineate the functional limitations of the machine.  On the other hand, directed dialog systems using fixed menus are commercially viable for some applications, but are inefficient, rigid, and impose high cognitive demands.

The optimal paradigm, or style, for human-machine speech communication arguably lies somewhere in between these two extremes: more regular than natural language, yet more flexible than hierarchical menus.  The Universal Speech Interface (USI) project at CMU is designing and evaluating such styles.  In essence, we are trying to do for speech what Graffitiä has done for mobile text entry.  A crucial aspect of the design is uniformity across applications.  In that regard, we are trying to do for speech what the Xerox/Macintosh revolution has done for GUIs.  As in the latter case, uniformity also means that toolkits can be used by application developers to facilitate compliance and dramatically reduce development time.  Another crucial aspect of the design is learnability: like Graffitiä, our style must be learned in no more than a few minutes, then be immediately useful and transferable to all other applications.  For a more detailed discussion of the motivation behind the USI approach, see [1].  For current information on the USI project at Carnegie Mellon, see http://www.speech.cs.cmu.edu/usi

 This paper discusses one such design, one that is most suitable for information retrieval from a database.  We chose to demonstrate the design in a prototype movie information application.  Our database is the one used by an existing natural language interface, the Carnegie Mellon MovieLine [2].  It contains information about movies and movie theaters in the Pittsburgh area and is updated weekly. We chose this as our first application for three reasons: a database was readily available; interfacing with an information server allowed us to focus mainly on the design of the interface while still creating a fully functional system; and the existing MovieLine interface facilitates head-to-head comparisons of natural language and USI interactions. We plan to implement and test USI systems for a variety of applications types, and eventually distribute a development toolkit to allow others to create USI interfaces for their own systems.

2. A SAMPLE INTERACTION

The following sample interaction with the USI movie line will form the basis for a discussion of our interface design.

User (U) wants to know where Casablanca is playing:

1    U:  Movie is Casablanca, theaters are what?, go!

2      Movieline (M):  Two matches: Showcase East, Waterworks Cinema. 

User would like to find a comedy in Squirrel Hill:

3    U:  Neighborhood is Squirrel Hill, now_what?

4    M:  Title is <dadada>, theater is <dadada>, genre is <dadada>, <ellsig>

5    U:  Genre is now_what?

6    M:  Comedy, drama, foreign, <ellsig>

7      U:  Comedy, titles are what?, go!

User inquires about the movie October Sky:

  8    U:  Movie is October, go!

9    M:  <oksig> movie is <confsig>  October

10  U:  October Sky, theaters are what?, go! 

User wants to know what time Casablanca is showing at Waterworks Cinema:

 11 U:  Title is Casablanca, theater is Watergate, scratch_that!

12  M:  Scratched.

13  U:  Movie is Casablanca, theater is Waterworks, ok?

14   M:  Okay.

15  U:  Times are what?, go!

16  M:  Six matches: 1:15, 2:45, 4:00 <ellsig>

17  U:  More   

18  M:  5:45, 7:50, 10:00

3. INTERFACE DESIGN

3.1. Syntax

The USI uses as its basic utterance a series of phrases followed by a terminator keyword. Each phrase specifies a slot name and its value.  Thus, in line 1, "movie is Casablanca" is a phrase specifying "movie" as the slot and "Casablanca" as its value; "go!" is the terminator.

The use of slot+value phrases simplifies the work of the parser and conforms to natural speaking patterns.  Phrases are order independent and synonyms are permitted when appropriate in the slots and values (e.g. "movie" and "title" in lines 1 & 3).  In our current implementation we restrict phrase syntax to "slotname/s is/are <value>", but in general the grammar for each slot type may be quite elaborate.  We use the Phoenix parser [3] developed at Carnegie Mellon, to define and parse the utterances. Some grammar variations we have considered include allowing prepositions in slots (e.g. "at [theater] Showcase East, on [day] Tuesday").  For database applications, the USI uses "what?" to indicate the slots to be queried, as in line 1.

The burden of processing is also eased by the use of terminators: the ASR engine simply watches for one of the terminators, and upon finding one sends the preceding string as a completed structure to the parser. From the user’s point of view, a terminator allows them to take as much time as needed to formulate a query. "Go!" was implemented as our basic terminator and signals that the user is ready to have their query executed.  We have also incorporated a fallback timeout feature.

3.2. V ocabulary

The vocabulary of a USI-enabled application consists of two parts: a set of universal USI keywords, and an application-specific lexicon. The keywords are used to perform basic functions in all USI applications and are discussed individually in this paper. The lexicons are specified by developers of individual applications.

For the USI to be truly universal, it must use a small set of words that non-technical users will feel comfortable with. Therefore, we have attempted to restrict our list of keywords to simple, everyday words and phrases such as "ok" and "scratch that" rather than more technical terms like "enter" and "execute".

It is also essential to keep the number of USI keywords to a minimum, to reduce the perplexity of the language and the burden of learning it. We have tried to limit the number of essential keywords in the USI to 7-9.

The size of the application-specific lexicon is naturally determined by the functionality and complexity of each application, and will generally be quite a bit larger than the USI keyword set (the movie line lexicon includes 791 words; however, 58% of these are movie names).  To add flexibility, synonyms are allowed where appropriate, as noted above. Although this increases the size of the vocabulary, it actually reduces the burden on the user’s memory.

3.3. Help/Orientation

An essential component of any interface is a simple, effortless help function. This is particularly important when the system has no visual component, as the user in this case must not only be able to remember how to access the help but must also be able to retain and use from short term memory the information that the help function provides to them.

We consider six types of help requests:

1.        what the machine is/does;

2.        local help while issuing a query;

3.        how to use a keyword or command;

4.        step-by-step help in issuing a query (a "wizard");

5.        help finding the appropriate keyword or command for performing a task;

6.        more information about something in the application.

We address the first situation by playing a short introduction at the beginning of each USI interaction. This introduction also includes a short sample of dialog appropriate to the application which is intended to instruct new users (or remind more experienced users) how to perform basic actions in the USI system.

The main mechanism for getting help in the USI is the keyword "now_what?", as shown in lines 3-7 of the sample dialogue. When a user says "now_what?", the system responds with a list of all the things that could come next at that point in the user’s query; the specific form and content of the list is determined by the context in which it is said.

In line 3, the user has asked "now_what?" at a phrase boundary, so the response is a list of all the phrases that could be used to continue the query. In line 5, the user has asked "now_what?" inside a phrase, at a point where they are expected to specify a value, so the USI responds with a list of possible values. 

Another principle of the USI design is that machine prompts should be phrased so as to entrain the user. Therefore the machine’s response in line 4 is structured in phrases just like the user is expected to use, rather than returning something like "ask about a title, theater, or genre." "Lexical entrainment" [4] such as this helps promote more efficient interaction with the machine with no added computational complexity (in fact it is probably often simpler than generating an appropriate, grammatical rephrasing like "at what time?"). As shown in line 6, the response to a mid-phrase "now_what?" also uses the exact words that the user is expected to say. In some cases however, the class of possible responses is too large, and a description of the response is given instead:

        U:  Location is now_what?

        M:  State the name of a neighborhood or city.

The <dadada> notation in line 4 indicates a fill-in-the-blank marker, and is currently implemented as a fast, low-stress "da-da-da."  The <ellsig> notation in lines 4, 6, and 17 is intended to indicate that the list continues beyond this. Since recitation of a very long list of items does not generally allow the user adequate time to process and retain each item, and because we want to encourage turns to be as brief as possible, USI lists are output in groups of three or four. The USI movie line currently implements the <ellsig> lexically, as the phrase "and more." Experiments have indicated that using audio signals or natural prosody to convey non-finality of lists is also effective, and we continue to explore this and other non-lexical alternatives [5].

Since the user can ask it at any point and get help specific to that context, "now_what?" addresses the second help situation. It also covers the third situation, since when it returns information it is also telling the user exactly how to use it.

For the fourth type of help, a user could move through a query one step at a time by repeatedly asking "now_what?" As a shortcut for this process, the user could say "lead_me" and be guided through the query in essentially the same way. With "lead_me," control of the dialog rests with the system, so that a query segment is elicited from the user, and then the next prompt is given by the machine.  The user can of course resume control of the dialog at any time. (This keyword has not been implemented yet.)

At the very beginning of an interaction, saying "now_what?" will result in a list of all possible phrases; this could help the user in the fifth help situation who knows what they want to do but is not sure how to do it. A more efficient solution is a simple keyword search. If a user has something in mind that they want to do, they can simply say "how_do_I <do something>."  Each application will include an index of words that might be associated with each of its main functions. The "how_do_I" function will search through this index to find items corresponding to the words in the user's utterance string and will report the matches back to the user as a list in a manner similar to the response to "now_what?"

The sixth type of help is handled with the keyword "explain." A user can say "explain <USI keyword>" or "explain <application term>," and the machine will respond with a brief, USI- or developer-specified description of what that item does or represents.

3.4. Errors

Our initial design includes mechanisms for alerting users to errors and also for helping users avoid errors in the first place. An example of the first case is shown in lines 8-10 of the sample dialogue. The user has intended to ask about the movie October Sky but instead has only said "October." However, as far as the system knows, there is no movie called October and therefore it cannot occur as a value for the movie slot, so it signals an error.

In general, a USI error can result from a failed parse (which could be due to a recognition error or to an ill-formed query, as above), invalid data (e.g. "February thirty-first"), or possibly as a result of a low confidence score from the ASR component.

We handle errors by conveying to the user which part of the query was understood, and in which part the error occurred.  In line 9 of the sample dialog, the <oksig> indicates that the system understood "movie is," and the <confsig> indicates that the system did not understand "October." The part of the query that was understood correctly is retained by the system, and the user can correct and continue their query from the point of the error. Currently, our design is deliberately left-to-right, so the processing stops as soon as an error is encountered.

The current version of the USI movie line implements the <oksig> and the <confsig> lexically, so that the actual error message for the above situation would be "I understood ‘movie is,’ but I didn’t understand ‘October.’" Experiments with non-lexical signals have indicated that simply repeating "movie is October?", where October is spoken with a rising, stressed, "confused" prosody is also a reasonable error alert for users, although it is not simple to implement [5]. We plan to conduct further user tests of noises and other non-lexical signals for their effectiveness as error alerts.

Another error strategy is shown in line 11 of the sample interaction. Here, the user recognizes that they have misspoken a word and uses the terminator "scratch_that" to clear the query and start over.  In addition, the keyword "rather" allows the user to make a correction without starting over:

        U:  Title is Casablanca, theater is Watergate, rather, theater is Waterworks, go!

The USI also includes two other of confirmation terminators. "Ok?", as shown in line 13, directs the system to parse the current utterance and respond with  an "okay" if there are no parsing or data errors.  "Restate" does the same thing, except that the machine responds with a listing of all the slot+value pairs parsed since the user's last "restate," so that the user can be sure the slots have been filled correctly.

4. SYSTEM ARCHITECTURE

Our implementation is modular, with the various components residing on multiple machines spanning two platforms (Linux and Windows NT).  The dialog manager consists of an application-independent USI engine and an application-specific domain manager.   The two interact via a USI API.  The USI engine calls on the Phoenix parser, and the domain manager interacts with a commercial database package.  These components together constitute a standalone text-based version of the system, which can be developed and tested independently of the ASR, synthesis, and telephony control.

Recognition is performed by CMU’s Sphinx-II engine [6], using acoustic models developed for the Communicator testbed [7].  For speech synthesis, we recorded a voice for unit-selection based limited-domain synthesis using the Festival system [8].  All the components are integrated using a VB framework borrowed from the CMU MovieLine, and a socket interface where needed.

Finally, new movies must be added to the application at least weekly.  For each such movie, one must update the database, the grammar, the language model, the pronunciation lexicon and the synthesis database.  To reduce costs and errors in development and maintenance, we are automating this process.

5. PRELIMINARY USER STUDIES

We conducted preliminary user studies to gauge how well new users understood the basic concepts of the interface. 15 subjects were asked to listen to a 100-second recorded introduction and sample dialog for the movie line application.  They were then asked to call the system and use it to get answers to five questions such as "Find the first showing of Chicken Run after 2:00 at the Galleria." In addition, before listening to the introductory recording, half the subjects were given approximately two minutes of personal instruction covering USI basics such as phrases, terminators, and the format of error messages. All users were asked to return three days later, listen to the introductory recording again, and use the USI movie line to answer a different set of five questions.

In general, users assimilated the interaction style quite well. Ten subjects issued a correctly formed USI query on the first or second try; an additional three users issued correct queries within five to seven tries. Only two users had critical problems formulating a query; after some additional help from the experimenter they were able to answer most of the questions (one with the aid of the USI basics "cheat sheet" which was used in the personal instruction sessions). All participants used "scratch_that" and "more" at least once. We found that only a small number of participants used "now_what?"; the rest guessed the necessary slot names, usually successfully.  This is likely to be the case with intuitive, self-suggesting slot names, but "now_what?" may still be useful in other cases.

Our user tests also provided support for the need for synonyms in the USI vocabulary. On the first day of testing, 11 out of 15 subjects used "movie" instead of  "title" in their queries – even though "title" was the phrase presented in all the introductory material.

Two of the issues we had anticipated as possible problems did indeed surface in the user tests: error correction and "go!" We found that many users had difficulty correcting errors at the appropriate location. Currently, our system expects the user to correct the problem at the point of the error and move on, but our testing showed that many users simply started the entire query over again, or at least restarted it at a phrase boundary. This inevitably led to further errors, since the slot+value structure of the query was disturbed.

While some users overused "go!" by adding it to other terminators like "now_what" and "more," almost all users failed at least once to say "go!" to send their query to the system.  This is not unlike the situation with novice computer users, who often forget to hit "Enter."  Although we believe that, as in the latter case, this is a habit that is easily acquired, we plan to experiment with shorter and/or user-adjustable timeouts and possibly eliminate "go!" from the set of terminators altogether. 

Another finding that deserves further study is that some users tried to answer more complex questions with multiple "what?" phrases in a single query. We would like to allow this functionality, but we have yet to determine the best way to present the resulting matrix of information.

6. FUTURE WORK

We plan to conduct more user studies to inform our future designs. In addition to addressing the issues noted in section 5, we hope to investigate when and how confirmation should be used, how learnable new USI applications are for those who have used the USI movie line, and how to introduce users to more advanced USI features. We also plan to run side-by-side user studies comparing the USI movie line interface with the CMU Communicator’s natural language interface.

7. ACKNOWLEDGEMENTS

We are grateful to Rita Singh and Ricky Houghton for help with acoustic modeling issues, and to Alex Rudnicky for much appreciated advice.  This research was sponsored in part by the Space and Naval Warfare Systems Center, San Diego, under Grant No. N66001-99-1-8905. The content of the information in this publication does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred.

8. REFERENCES

[1] Ronald Rosenfeld, Dan Olsen and Alexander Rudnicky, "A Universal Human-Machine Speech Interface," Technical Report CMU-CS-00-114, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, March 2000.

[2] Constantinides, P., Hansma, S., Tchou, C. and Rudnicky, A., "A schema-based approach to dialog control", ICSLP 1998.

[3] Ward, W. "The CMU Air Travel Information Service: Understanding Spontaneous Speech," Proceedings of the DARPA Speech and Language Workshop. 1990.

[4] Boyce, S., Karis, D., Mané, A., and Yankelovich, N. "User Interface Design Challenges," SIGCHI Bulletin Vol. 30 (2) p. 30-34. 1998.

[5] Shriver, S., Black, A., and Rosenfeld, R. "Audio Signals in Speech Interfaces," ICSLP 2000.

[6] Huang, X.D., Alleva, F., Hon, H.W., Hwang, M.Y., Lee, K.F. and Rosenfeld, R. "The SPHINX-II Speech Recognition System: An Overview," Computer, Speech and Language Vol. 2 p. 137-148. 1993.

[7] Rudnicky, A., Thayer, E., Constantinides, P., Tchou, C., Shern, R., Lenzo, K., Xu W., Oh, A., "Creating natural dialogs in the Carnegie Mellon Communicator system," Proc. Eurospeech, 1999, 4, 1531-1534 .

[8] Black, A., Taylor, P. and Caley, R. The Festival Speech Synthesis System. http://www.cstr.ed.ac.uk/projects/festival.html 1998.