WAMI 2.0 Javascript API

Authors: Ian McGraw, Alex Gruenstein, Chris Varenhorst, Andrew Sutherland



The WAMI Javascript API allows you add speech capabilities to any web page with just a few lines of code. 

Example Applications

Probably the easiest way to learn is by example.  Here are links to several self-explanatory examples.  You can try any of them, then right-click to view source, and save them as a starting point for writing your own application.


Getting Started

To use the Javascript API, you need a developer key associated with each web server (or path on the web server) on which you plan to use it.  You can use the API without obtaining one, but WAMI will annoy you with a message each time your page loads.

  1. Sign up for a WAMI developer account.  This lets you use both the WAMI Portal and the WAMI Javascript API.
  2. Log in to your account with the password you receive in your e-mail
  3. Use the tool to generate a key for your domain.  e.g. http://www.foo.com/

Adding the API to your page

Once you've generated a key for your web page, you simply add the following <script> tag inside the <head> of your HTML page:


<script src=http://wami.csail.mit.edu/wami-2.0.js></script>

We do not recommend downloading the JS itself, but instead linking to it just as above.

Creating a Wami.App

In your onload handler, construct a new Wami.App as follows.  It's probably easier to just look at the examples, but you can use this section for a reference of all the options.


// A typical Wami.App might specify these parameters 
// (additional options are described below)
var options = {
    guiID : 'gui-div',
    devKey : 'YOUR_KEY'
    grammar : {...},
    onReady : function() {...},
    onRecognition : function(result) {...}
}

var wamiApp = new Wami.App(options);



guiID -- The parent element under which the default GUI will be placed.  If you do not specify this, you are expected to provide your own GUI.  Note that the footprint for the default GUI is not fixed for desktop browsers because Flash will insert a security settings panel if microphone permissions have not yet been granted.  To deal with the security panel separately see the onSecurity handler.

grammar
-- This object tells the speech recognizer what to "listen" for. You can specify it in the options, and/or update it dynamically through the setGrammar method described later. Currently we support three languages: English (en-us), Mandarin Chinese (zh), Japanese (ja). For English, all words in the grammars should be lowercase. One nifty feature of WAMI is that you can ask for
incremental recognition results, and they will come back to you as the user is talking!

The main type of grammar WAMI supports, exemplified above, allows you to specify typical sentences you might expect your users to say in a compact way.  Read about the Java Speech Grammar Format (JSGF) to see how to specify a grammar of the jsgf type. 

 var grammar = {
language : "en-us",
    grammar : "#JSGF V1.0;  ....  ", 
type : "jsgf", aggregate : true, incremental : false
}


Tags are supported in JSGFs to help you embed semantics into your grammar:

<show> = (show {[command=show]} <animal>)+;
<animal> = dog {[animal=1]} | cat {[animal=2]};

If you have specified aggregate to be true in your grammar, your recognition results will come back with additional JSON key-values based on the semantic tags found in the recognition result.  Since a tag can occur multiple times due to loops in the grammar, a special split key (which defaults to "command") determines how the key-values are broken up.  You can change the default split tag by adding a split : 'mykey' to the grammar.  Concrete examples are given below. 

Weights are supported in the JSGF grammar.  Weights are treated as counts, so the following two rules are equivalent:

<animal> = /2/ dog | /4/ cat;
<animal> = /4/ dog | /8/ cat;

Another grammar type we support is a corpus grammar.  For this grammar, the text of the grammar key is simply a list of new-line separated sentences.  An n-gram is dynamically compiled on the server to construct a recognition "domain" corresponding roughly to the sentences provided.

var grammar = {
language : "en-us",
    grammar : "sentence one\n" + "sentence two\n", 
type : "corpus"
}

We will soon be providing the ability to upload grammars and they will then correspond to a cached ID.

var grammar = {
language : "en-us",
    grammar : "ABC123CACHEID", 
type : "cached"
}


onReady
-- The function you specify as your onReady handler will be called once the Wami.App has finished initializing. This means that microphone security settings have been taken care of and recording can begin. There is an optional argument (sessionid) to the function, which uniquely identifies the WAMI session.
 
onRecognition -- This handler will be called every time a new recognition result comes back to the browser.  The result argument is in a format described below.

Wami.App Methods

wamiApp.record(); -- Start recording and sending audio to the recognizer.

wamiApp.play(url); -- Play an audio file from a URL.  The audio file must be a .WAV.

wamiApp.
stop(); -- Stop playing or recording.

wamiApp.playback(); -- Play back the last audio your app recorded.

wamiApp.setGrammar(grammar);  -- Change the grammar.  A second optional parameter, incremental, will configure the recognizer to send or stop sending incremental results accordingly.

Advanced options for the constructor

onEvent -- This handler listens to events over the course of the WAMI application's life time.  The application is always in one of three states, and one of six events can occur in each state. 


onEvent = function(state, event, data) {     
    if (state == Wami.state.XXX &&
        event == Wami.event.YYY) {
        // do something
    }

// possible states
Wami.state.IDLE
Wami.state.PLAY
Wami.state.RECORD

// possible events
Wami.event.STARTING
Wami.event.STARTED
Wami.event.ALIVE
Wami.event.STOPPING
Wami.event.STOPPED

Wami.event.ERROR
The first event received by this handler is one that is the IDLE state STARTING up.  Once the application has initialized the IDLE/STARTED event will fire, followed by IDLE/ALIVE.  The IDLE/ALIVE state is the default state of the application when nothing else is going on.  

If the application begins recording RECORD/STARTING will occur followed by RECORD/STARTED.  There may be a short delay between the two while a connection is made for audio streaming.  When the recording is stopped (e.g. via wamiApp.stop()), the RECORDING/STOPPING and RECORDING/STOPPED events will fire.  A similar sequence of events would occur for Wami.state.PLAY.

If anything goes wrong while that application is in a particular state, the Wami.event.ERROR event will fire.

onTimeout -- Sessions with the recognizer can't last forever.  After a few minutes of inactivity the session times out and recording will no longer send audio to the recognizer.  Notify your users to refresh the page with this handler.

onError -- If an error occurs (perhaps due to a poorly specified grammar), this handler is called with one or two arguments: type, message.

onSecurity -- This handler gives you complete control over the placement of the security DIV.  It is called when the browser is denied OR granted microphone access.  Currently the user is required to click "remember" for the security settings to be valid, so you must ensure that this happens.

onSecurity : function() { var security = wamiApp.settings(Wami.settings.MICROPHONE); if (!security.granted() || !security.remembered()) { security.show('SecurityDiv'); } }


incremental -- If true, the onRecognition handler will get called not just with a settled recognition results, but with incremental recognition results as the speaker is talking.  

environment --  Some 3rd party Javascript APIs (e.g. the Google Maps API) use the same AJAX techniques WAMI does, and so their operation can at times conflict.  To fix this, simply host an additional HTML file in the same domain, and specify its location in this environment option.  The HTML file must contain the following:

<html>
<head>
<script>
buildScript = function(src) {
    var el = document.createElement('script');
    el.src = src;
    document.getElementsByTagName('head')[0].appendChild(el);
}
</script>
</head>
</html>


Wami.App Context Methods

var c = result.
context("expected transcript"); -- To help us out (and perhaps to improve your recognition results), create context objects when you have a good idea of what the user has spoken and report them to us so we can use them for our training algorithms. If the context begins with #JSGF, it is assumed to be a grammar. If you have a way of tracking individual users at your site, it would be great if you could provide the optional second parameter speakerid, which is just any ID uniquely identifying that user.

wamiApp.report([context1, context2, {...}]); -- You can batch the contexts you create from your recognition results into an array and report them at your convenience.  You can also report arbitrary objects for logging, but it's unlikely they will be used for anything.

wamiApp.setDefaultContext(c); -- If you come across a situation where multiple utterances have roughly the same context (but where the recognition grammar is much broader), you can specify a default context to be associated with each utterance.  Suppose for instance you were creating a flight reservation system.  If you display a set of flights, you expect your user to choose one of those flights.  If they say something outside of this context, it's likely a mis-recognition, if it's inside this context, the recognizer probably got it right.

Speech Recognition Results Format

Speech recognition results are returned as a JSON structure, which is simple to handle in Javascript.

Case 1: I just want the words the user said, after he finishes speaking

If you want a single speech recognition result at the end of each user utterance, with no semantic interpretation, you'll need the following settings:

grammar.incremental == false
grammar.aggregate == false;

Then, in your
onRecognition handler, you will receive a result object with the following properties:

result.settled() == true; result.uttID() == 0; // zero-indexed result.count() >= 1; // # hypotheses result.text(); // top hypothesis result.text(i); // ith hypothesis // E.g. to alert the best guess of what was said:
onRecognition : function(result) {
    alert(result.text());
}

Case 2: I just want the words the user said, but I want to get results as she speaks

This capability is referred to as incremental speech recognition: as the user speaks, we'll periodically tell you our best guess about what he has said so far.  To turn on incremental speech recognition results, specify it in your options as follows:

options.incremental == true
grammar.aggregate = false;

Imagine the users says "Hello wami nice to meet you", then your onRecognition handler will be called multiple times with successively updated results.

// Result 0 result.settled() == false; result.uttID() == 0; // remains the same throughout result.text() == "hello";

// Result 1 result.settled() == false; result.text() == "hello wami nice";

// Result 2 result.settled() == false; result.text() == "hello wami nice to meet";
// Result 3 result.settled() == false; result.text() == "hello wami nice to meet you";
// Result 4 result.settled() == true; result.text(0) == "hello wami nice to meet you"; result.text(1) == "hello wami nice to meet you"; result.text(2) == "hello wami it is nice to meet you";
Note that the full N-best list is only available for the settled recognition result (as indicated by the settled() method).

Case 3: I don't care what words a person said, just what the person meant.

In the tic-tac-toe example application, you can say "put an X in square 5" or "put an X in cell 5". We don't care whether you say "square" or "cell".  The JSGF grammars allow you to embed semantic tags to make this easy.  Here is the complete grammar:


#JSGF V1.0; 
grammar TicTacToe;

public <top> = (<command> [and])+ ;

<command> = <put> | <erase> ;

<put> = put {[command=put]} (<mark>+ (in ([<cellname>] <cell>)+))+ 

<erase> =      erase {[command=erase]} ([<cellname>] <cell>)+ ;

<mark> = an x {[mark=x]} | an oh {[mark=o]} ;

<cellname> = cell | box | square ;

<cell> = one	{[cell=1]}
| two {[cell=2]}
| three {[cell=3]}
| four {[cell=4]}
| five {[cell=5]}
| six {[cell=6]}
| seven {[cell=7]}
| eight {[cell=8]}
| nine {[cell=9]}
       ;


Notice the curly braces, which contain simple semantic tags inside of them like [cell=5].  These special tags will be returned as output with the speech recognition result, like this:


put [command=put] an x [mark=x] in square 5 [cell=5]


The WAMI aggregator makes it easy to extarct these key/value sequences.   You turn on the aggregator with a the following settings:


options.incremental = false;

grammar.aggregate = true;

Then, your settled speech recognition result will look like this:


result.settled() == true
result.text() == 
"put [command=put] and x [mark=x] in square 5 [cell=5]"
result.text(1) == "put [command=put] and x [mark=x] in square 6 [cell=6]"
result.get("command") == "put"
result.get("mark") == "x"
result.get("cell") == "5"
result.get("cell") == "
6"

What's going on? WAMI has pulled out the key/value pairs for you and put them directly in the recognition result. A single utterance can, actually, lead to multiple aggregated results. This is controlled by the split property of a jsgf grammar:


options.incremental == false
grammar.aggregate == true
grammar.split ==
"command"


The split key will be used to aggregate together multiple commands in the same utterance.  For example, image the users says:

put [command=put] an x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=0] in square 6 [cell=6]


You'll actually receive TWO recognition results.  First, you'll receive one for the first command, then you'll receive one for the second command.  This is because the split is "command" (actually this is the default anyway), so whenever the aggregator sees a "command" key, it will split the recognition result into multiple, aggregated results.

Case 4: Putting it together: incremental recognition results with the aggregator

The combination of incremental speech recognition with the aggregator can be very powerful.  The tic-tac-toe application is a great example of using both together.  It is configured like this:

options.incremental == true 
grammar.aggregate == true
grammar.split == "command"


With both turned on, you'll receive incremental recognition results
as the user speaks, which will have key-value pairs you can access with the aggregator. 

But, most interestingly, by using "splitTag", you can get "finalized" aggregated commands even before the user finishes speaking.  In Tic Tac Toe for example, if the user is speaking and says the following (without letting go of the hold-to-talk button):

put [command=put] an x [mark=x] in square 5 [cell=5] ...


Tic Tac Toe for example, if the user is speaking and says the following (without letting go of the hold-to-talk button):

// Result 1 result.settled() == false
result.text() == 
"put [command=put] and x [mark=x] in square"
result.get("command") == "put"
result.get("mark") == "x"
result.get("cell") == null
result.partial() ==
true


// Result 2 result.settled() == false
result.text() == 
"put [command=put] and x [mark=x] in square 5 [cell=5]"
result.get("command") == "put"
result.get("mark") == "x"
result.get("cell") == "5"
result.partial() ==
true


Notice that the partial is still set to true in the second result. This indicates that the user still has not finalized this command either by releasing the button or saying a new command. If they let go of the button, a similar recognition result in which partial is false will occur.  However, they could also continue to speak the next command (without letting go).


put [command=put] an x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=o] in cell 6 [cell=6]


The previous command will become finalized by virtue of the fact that we've seen a new "command" key, so here too you'll receive an aggregate which is marked with
partial is false, but this time it is due to the presence of a new command:

// Result 3 result.settled() == false
result.text() == 
"put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=o]"
result.get("command") == "put"
result.get("mark") == "x"
result.get("cell") == "5"
result.partial() ==
false


The utterance still is not over (note settled() is still false), but now a new command aggregate can begin.  Subsequent recognition results are shown below.  


// Result 4 result.settled() == false
result.text() == 
"put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an"
result.get("command") == "put"
result.get("mark") == null
result.get("cell") == null
result.partial() ==
true


// Result 5 result.settled() == false
result.text() == 
"put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=o]"
result.get("command") == "put"
result.get("mark") == "o"
result.get("cell") == null
result.partial() ==
true


// Result 4 result.settled() == false
result.text() == 
"put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an an oh [mark=o]
in cell 6 [cell=6]" result.get("command") == "put"
result.get("mark") == "o"
result.get("cell") == "6"
result.partial() ==
true

// Result 5 (recording has stopped) result.settled() == true
result.text() == 
"put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an an oh [mark=o]
in cell 6 [cell=6]" result.get("command") == "put"
result.get("mark") == "o"
result.get("cell") == "6"
result.partial() ==
false


In tic-tac-toe, this distinction between partial being true or false is first used to determine whether cell 5 should be highlighted (if partial()=true), or whether the X should actually be put in the cell (partial()=false).  The same is then done for O.  It's a nice way to show that you have understood what the user said, but you haven't actually done it yet.  It also makes your Wami.App feel much more responsive.