WAMI 2.0 Javascript API
Authors: Ian McGraw, Alex Gruenstein, Chris Varenhorst, Andrew Sutherland
Once you've generated a key for your web page, you simply add the following <script> tag inside the <head> of your HTML page:
<script src=http://wami.csail.mit.edu/wami-2.0.js></script>
We do not recommend downloading the JS itself, but instead linking to it just as above.
In your onload handler, construct a new Wami.App as follows. It's probably easier to just look at the examples, but you can use this section for a reference of all the options.
// A typical Wami.App might specify these parameters// (additional options are described below)var options = {guiID : 'gui-div',devKey : 'YOUR_KEY'grammar : {...},onReady : function() {...},onRecognition : function(result) {...}}var wamiApp = new Wami.App(options);
guiID -- The parent element under which the default GUI will be placed. If you do not specify this, you are expected to provide your own GUI. Note that the footprint for the default GUI is not fixed for desktop browsers because Flash will insert a security settings panel if microphone permissions have not yet been granted. To deal with the security panel separately see the onSecurity handler.
grammar -- This object tells the speech recognizer what to "listen" for. You can specify it in the options, and/or update it dynamically through the setGrammar method described later. Currently we support three languages: English (en-us), Mandarin Chinese (zh), Japanese (ja). For English, all words in the grammars should be lowercase. One nifty feature of WAMI is that you can ask for incremental recognition results, and they will come back to you as the user is talking!
The main type of grammar WAMI supports, exemplified above, allows you to specify typical sentences you might expect your users to say in a compact way. Read about the Java Speech Grammar Format (JSGF) to see how to specify a grammar of the jsgf type.
var grammar = {
language : "en-us",grammar : "#JSGF V1.0; .... ",
type : "jsgf", aggregate : true, incremental : false
}Tags are supported in JSGFs to help you embed semantics into your grammar:<show> = (show {[command=show]} <animal>)+;<animal> = dog {[animal=1]} | cat {[animal=2]};If you have specified aggregate to be true in your grammar, your recognition results will come back with additional JSON key-values based on the semantic tags found in the recognition result. Since a tag can occur multiple times due to loops in the grammar, a special split key (which defaults to "command") determines how the key-values are broken up. You can change the default split tag by adding a split : 'mykey' to the grammar. Concrete examples are given below.Weights are supported in the JSGF grammar. Weights are treated as counts, so the following two rules are equivalent:<animal> = /2/ dog | /4/ cat;<animal> = /4/ dog | /8/ cat;Another grammar type we support is a corpus grammar. For this grammar, the text of the grammar key is simply a list of new-line separated sentences. An n-gram is dynamically compiled on the server to construct a recognition "domain" corresponding roughly to the sentences provided.
var grammar = {
language : "en-us",grammar : "sentence one\n" + "sentence two\n",
type : "corpus"
}We will soon be providing the ability to upload grammars and they will then correspond to a cached ID.
var grammar = {
language : "en-us",grammar : "ABC123CACHEID",
type : "cached"
}
onReady -- The function you specify as your onReady handler will be called once the Wami.App has finished initializing. This means that microphone security settings have been taken care of and recording can begin. There is an optional argument (sessionid) to the function, which uniquely identifies the WAMI session.onRecognition -- This handler will be called every time a new recognition result comes back to the browser. The result argument is in a format described below.
Wami.App MethodswamiApp.record(); -- Start recording and sending audio to the recognizer.wamiApp.play(url); -- Play an audio file from a URL. The audio file must be a .WAV.
wamiApp.stop(); -- Stop playing or recording.wamiApp.playback(); -- Play back the last audio your app recorded.wamiApp.setGrammar(grammar); -- Change the grammar. A second optional parameter, incremental, will configure the recognizer to send or stop sending incremental results accordingly.Advanced options for the constructor
onEvent -- This handler listens to events over the course of the WAMI application's life time. The application is always in one of three states, and one of six events can occur in each state.
onEvent = function(state, event, data) {if (state == Wami.state.XXX &&event == Wami.event.YYY) {// do something}}// possible statesWami.state.IDLEWami.state.PLAYWami.state.RECORD// possible eventsWami.event.STARTINGWami.event.STARTEDWami.event.ALIVEWami.event.STOPPINGWami.event.STOPPEDWami.event.ERRORThe first event received by this handler is one that is the IDLE state STARTING up. Once the application has initialized the IDLE/STARTED event will fire, followed by IDLE/ALIVE. The IDLE/ALIVE state is the default state of the application when nothing else is going on.If the application begins recording RECORD/STARTING will occur followed by RECORD/STARTED. There may be a short delay between the two while a connection is made for audio streaming. When the recording is stopped (e.g. via wamiApp.stop()), the RECORDING/STOPPING and RECORDING/STOPPED events will fire. A similar sequence of events would occur for Wami.state.PLAY.If anything goes wrong while that application is in a particular state, the Wami.event.ERROR event will fire.onTimeout -- Sessions with the recognizer can't last forever. After a few minutes of inactivity the session times out and recording will no longer send audio to the recognizer. Notify your users to refresh the page with this handler.onError -- If an error occurs (perhaps due to a poorly specified grammar), this handler is called with one or two arguments: type, message.onSecurity -- This handler gives you complete control over the placement of the security DIV. It is called when the browser is denied OR granted microphone access. Currently the user is required to click "remember" for the security settings to be valid, so you must ensure that this happens.
onSecurity : function() { var security = wamiApp.settings(Wami.settings.MICROPHONE); if (!security.granted() || !security.remembered()) { security.show('SecurityDiv'); } } incremental -- If true, the onRecognition handler will get called not just with a settled recognition results, but with incremental recognition results as the speaker is talking.environment -- Some 3rd party Javascript APIs (e.g. the Google Maps API) use the same AJAX techniques WAMI does, and so their operation can at times conflict. To fix this, simply host an additional HTML file in the same domain, and specify its location in this environment option. The HTML file must contain the following:
<html>
<head>
<script>
buildScript = function(src) {
var el = document.createElement('script');
el.src = src;
document.getElementsByTagName('head')[0].appendChild(el);
}
</script>
</head>
</html>Wami.App Context Methods
var c = result.context("expected transcript"); -- To help us out (and perhaps to improve your recognition results), create context objects when you have a good idea of what the user has spoken and report them to us so we can use them for our training algorithms. If the context begins with #JSGF, it is assumed to be a grammar. If you have a way of tracking individual users at your site, it would be great if you could provide the optional second parameter speakerid, which is just any ID uniquely identifying that user.wamiApp.report([context1, context2, {...}]); -- You can batch the contexts you create from your recognition results into an array and report them at your convenience. You can also report arbitrary objects for logging, but it's unlikely they will be used for anything.wamiApp.setDefaultContext(c); -- If you come across a situation where multiple utterances have roughly the same context (but where the recognition grammar is much broader), you can specify a default context to be associated with each utterance. Suppose for instance you were creating a flight reservation system. If you display a set of flights, you expect your user to choose one of those flights. If they say something outside of this context, it's likely a mis-recognition, if it's inside this context, the recognizer probably got it right.Speech Recognition Results Format
Speech recognition results are returned as a JSON structure, which is simple to handle in Javascript.Case 1: I just want the words the user said, after he finishes speaking
If you want a single speech recognition result at the end of each user utterance, with no semantic interpretation, you'll need the following settings:grammar.incremental == falsegrammar.aggregate == false;
Then, in your onRecognition handler, you will receive a result object with the following properties:
result.settled() == true; result.uttID() == 0; // zero-indexed result.count() >= 1; // # hypotheses result.text(); // top hypothesis result.text(i); // ith hypothesis // E.g. to alert the best guess of what was said: onRecognition : function(result) {
alert(result.text());
}Case 2: I just want the words the user said, but I want to get results as she speaks
This capability is referred to as incremental speech recognition: as the user speaks, we'll periodically tell you our best guess about what he has said so far. To turn on incremental speech recognition results, specify it in your options as follows:Imagine the users says "Hello wami nice to meet you", then your onRecognition handler will be called multiple times with successively updated results.options.incremental == truegrammar.aggregate = false;
// Result 0 result.settled() == false; result.uttID() == 0; // remains the same throughout result.text() == "hello";
// Result 1 result.settled() == false; result.text() == "hello wami nice";
// Result 2 result.settled() == false; result.text() == "hello wami nice to meet"; Note that the full N-best list is only available for the settled recognition result (as indicated by the settled() method).
// Result 3 result.settled() == false; result.text() == "hello wami nice to meet you";
// Result 4 result.settled() == true; result.text(0) == "hello wami nice to meet you"; result.text(1) == "hello wami nice to meet you"; result.text(2) == "hello wami it is nice to meet you";
Case 3: I don't care what words a person said, just what the person meant.
In the tic-tac-toe example application, you can say "put an X in square 5" or "put an X in cell 5". We don't care whether you say "square" or "cell". The JSGF grammars allow you to embed semantic tags to make this easy. Here is the complete grammar:
#JSGF V1.0;
grammar TicTacToe;
public <top> = (<command> [and])+ ;<command> = <put> | <erase> ;<put> = put {[command=put]} (<mark>+ (in ([<cellname>] <cell>)+))+<erase> = erase {[command=erase]} ([<cellname>] <cell>)+ ;<mark> = an x {[mark=x]} | an oh {[mark=o]} ;<cellname> = cell | box | square ;<cell> = one {[cell=1]}
| two {[cell=2]}
| three {[cell=3]}
| four {[cell=4]}
| five {[cell=5]}
| six {[cell=6]}
| seven {[cell=7]}
| eight {[cell=8]}
| nine {[cell=9]}
;
Notice the curly braces, which contain simple semantic tags inside of them like [cell=5]. These special tags will be returned as output with the speech recognition result, like this:
put [command=put] an x [mark=x] in square 5 [cell=5]
The WAMI aggregator makes it easy to extarct these key/value sequences. You turn on the aggregator with a the following settings:
options.incremental = false;
grammar.aggregate = true;
Then, your settled speech recognition result will look like this:
|
result.settled() == true result.text() == "put [command=put] and x [mark=x] in square 5 [cell=5]" result.text(1) == "put [command=put] and x [mark=x] in square 6 [cell=6]" result.get("command") == "put" result.get("mark") == "x" result.get("cell") == "5" result.get("cell") == "6" |
What's going on? WAMI has pulled out the key/value pairs for you and put them directly in the recognition result. A single utterance can, actually, lead to multiple aggregated results. This is controlled by the split property of a jsgf grammar:
options.incremental == falsegrammar.aggregate == true
grammar.split == "command"
put [command=put] an x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=0] in square 6 [cell=6]
You'll actually receive TWO recognition results. First, you'll receive one for the first command, then you'll receive one for the second command. This is because the split is "command" (actually this is the default anyway), so whenever the aggregator sees a "command" key, it will split the recognition result into multiple, aggregated results.
The combination of incremental speech recognition with the aggregator can be very powerful. The tic-tac-toe application is a great example of using both together. It is configured like this:Case 4: Putting it together: incremental recognition results with the aggregator
options.incremental == true
grammar.aggregate == true
grammar.split == "command"
put [command=put] an x [mark=x] in square 5 [cell=5] ...
Tic Tac Toe for example, if the user is speaking and says the following (without letting go of the hold-to-talk button):
// Result 1 result.settled() == false
result.text() == "put [command=put] and x [mark=x] in square"
result.get("command") == "put"
result.get("mark") == "x"
result.get("cell") == null
result.partial() == true
// Result 2 result.settled() == false
result.text() == "put [command=put] and x [mark=x] in square 5 [cell=5]"
result.get("command") == "put"
result.get("mark") == "x"
result.get("cell") == "5"
result.partial() == true
Notice that the partial is still set to true in the second result. This indicates that the user still has not finalized this command either by releasing the button or saying a new command. If they let go of the button, a similar recognition result in which partial is false will occur. However, they could also continue to speak the next command (without letting go).
put [command=put] an x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=o] in cell 6 [cell=6]
// Result 3 result.settled() == false
result.text() == "put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=o]"
result.get("command") == "put"
result.get("mark") == "x"
result.get("cell") == "5"
result.partial() == false
The utterance still is not over (note settled() is still false), but now a new command aggregate can begin. Subsequent recognition results are shown below.
// Result 4 result.settled() == false
result.text() == "put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an"
result.get("command") == "put"
result.get("mark") == null
result.get("cell") == null
result.partial() == true
// Result 5 result.settled() == false
result.text() == "put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an oh [mark=o]"
result.get("command") == "put"
result.get("mark") == "o"
result.get("cell") == null
result.partial() == true
// Result 4 result.settled() == false
result.text() == "put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an an oh [mark=o] in cell 6 [cell=6]" result.get("command") == "put"
result.get("mark") == "o"
result.get("cell") == "6"
result.partial() == true
// Result 5 (recording has stopped) result.settled() == true
result.text() == "put [command=put] and x [mark=x] in square 5 [cell=5] and put [command=put] an an oh [mark=o] in cell 6 [cell=6]" result.get("command") == "put"
result.get("mark") == "o"
result.get("cell") == "6"
result.partial() == false
In tic-tac-toe, this distinction between partial being true or false is first used to determine whether cell 5 should be highlighted (if partial()=true), or whether the X should actually be put in the cell (partial()=false). The same is then done for O. It's a nice way to show that you have understood what the user said, but you haven't actually done it yet. It also makes your Wami.App feel much more responsive.