14 January 2013
A good understanding of JavaScript and jQuery as well as familiarity with Google's Chrome Canary browser will help you make the most of this article.
Intermediate
There are plenty of cutting-edge open web technologies to go around these days, but only two have the potential to enable users to interact with the web using only their voices. The APIs in question are Speech Input and Web Speech, both of which are experimental, draft W3C proposals designed to unlock the power of voice on the web by exposing speech synthesis and recognition capabilities to web developers. With these APIs, developers can create voice-driven web searches, speech-based site interaction paradigms, and even spoken responses from the browser. These capabilities represent a giant leap for the web, and in addition to providing convenience benefits to all users, they have the potential to take web accessibility to a whole new level.
In this tutorial, you'll learn more about these APIs, and you'll get a firsthand look at the features of each that are already finding their way into the Chrome and Chrome Canary browsers. As I walk through each API, I'll show you how to turn an everyday, interactive Twitter search page into a voice-enabled application. By the end of this article, you'll have a better understanding of how speech services are evolving in the browser, and where you can learn more about these exciting advancements.
If you're familiar with the W3C and the web's standardization process, you know that it often takes considerable behind the scenes work to get new standards and specifications supported consistently in all browsers. Speech is no different. Even though much of what I cover in this article is new and cutting-edge in terms of browser support, the foundations of in-browser speech were laid more than two years ago with the formation of the W3C HTML Speech Incubator Group. Since that time, the community has seen the Speech Input API, a final report and recommendations from the Speech Group, the formation of the Speech API Community Group, and finally, the publication of the Web Speech API Specification.
In late 2010, shortly after the Speech Incubator Group was formed, Google submitted the Speech Input API Specification for consideration. This spec centered on the addition of a speech attribute to the HTML input element. When present, a supporting browser could use this attribute to provide a voice entry option to users.
In early 2011, Google added support for the Speech Input API to Chrome, with the x-webkit-speech vendor-prefixed attribute. Adding this attribute to any text input field causes Chrome to add a microphone icon to that field (see Figure 1).

When a user clicks the icon, a popup window appears, prompting for a spoken search query (see Figure 2). When the user speaks, Chrome takes the input from the microphone and passes it off to Google's own speech recognition web service. That service analyzes the byte stream and passes back a JSON object with a list of possible matches. Chrome then takes the best match, pops it into the search box and triggers a search.

This implementation of voice search in Chrome prompted a great deal of interest when it was first released, and you can still see it in action today by visiting Google.com from any version of Chrome. More valuable than buzz and interest, however, was the feedback from developers that Google got after implementing the Speech Input API in their browser. According to public archives at lists.w3.org, the key feedback that Google received was that developers wanted a JavaScript-based API for speech and finer-grained control over the presentation of speech input controls. Essentially, developers told Google that the Speech Input API was a nice start, but that a better proposal was needed.
As a follow-up to that developer feedback, and in response to the final report published by the W3C Speech Incubator group, Google revised and expanded upon their original spec and presented the Speech JavaScript API Specification in late 2011. Rather than propose HTML elements and attributes for consideration, as Speech Input had, this spec was focused solely on new JavaScript APIs for speech recognition.
In the spring of 2012, the Speech API W3C Community Group was formed to produce a JavaScript Speech API that addressed many of the use cases identified by the W3C Speech Incubator group's final report. The starting point for their work was Google's Speech JavaScript API, which was renamed the Web Speech API.
The Web Speech API consists of three main feature areas:
At the time of this writing, the SpeechRecognition object is the only major feature of the spec implemented in any browser.
The Web Speech API Specification was finalized in October of 2012. The Speech API Community group is now collecting member commitments and, once done, will present the spec to the W3C for Working Group consideration. If accepted, the Web Speech API will enter the formal standardization process, which could mean implementation from multiple browsers in the near future.
As with Speech Input, you don't have to wait for the W3C to try out the Web Speech API. Thankfully, Google has already implemented some of the Web Speech API in Chrome Canary, a nightly, yet stable, build of the Google Chrome browser. Now, these API are also in Chrome Beta.
In this article, I'll demonstrate the approaches of both APIs so that you can get a feel for how speech in the browser is evolving.
For the main demo in this article, I'm going to add speech capabilities to a Twitter search page on a fictional conference and speaker site I named Faceplant (see Figure 3). This page, which I call the "The Buzz Page", enables the user to search Twitter's public timeline and displays the tweet (along with each user's profile photo) on a paged list. For the main functionality of this page, I'm using several features of Kendo UI, including the ListView, Pager, and DataSource, to connect to the Twitter API. To keep this tutorial focused on the Speech API itself, I'm not going to spend much time covering how Kendo UI is being used in this app, but you are free to learn more by downloading the source for this article, or grabbing the demo from GitHub.

As I mentioned earlier, the Speech Input API is exposed through a new speech attribute on the HTML input element. To enable speech input on my search box, I need only to add the attribute, as well as any vendor-prefixed variants:
<input id="query" type="search" class="k-input k-textbox"
x-webkit-speech speech />
Just as with experimental CSS properties, it’s a good idea to include the spec-defined property, speech, along with the experimental version Chrome currently supports. Once I've added this property, I can view my fancy new speech-enabled search box in any version of Chrome (see Figure 4).

When I click on the microphone icon, I'll get a popup prompting me for speech input, just like the one shown in Figure 2. When Chrome has data from the microphone, it instructs Chrome to call a web service to turn the audio into text. After obtaining the best match from the service and placing that value in the search input, the browser will then fire the speechchange event. To trigger a search based on that event, I'll need to listen for it.
var query = $('#query'),
ds;
query.on('webkitspeechchange speechchange', function() {
ds.read();
});
ds = new kendo.data.DataSource({
transport: {
read: {
url: "http://search.twitter.com/search.json",
dataType: "jsonp",
data: {
q: function() {
return query.val();
}
}
}
},
schema: {
data: "results",
total: "results_per_page"
},
pageSize: 5
});
In the code above, I obtain a reference to my search box from jQuery. Then, I use jQuery's on method to listen for the prefixed and non-prefixed versions of the speechchange event. When that event fires, I call the read() method of my Kendo UI DataSource object, which I've also included below. The key things to pay attention to in the DataSource are the url property, which specifies a remote endpoint to fetch JSON data, and the data.q property, which pulls the current value of the query search box each time the DataSource is refreshed, as it does when the read() method is called. Thus, if I speak the phrase "html5" and the Speech Input API adds that phrase to the search box, calling ds.read() will pull data from http://search.twitter.com/search.json?q=html5. Both the ListView and Pager control in Figure 4 are bound to my DataSource, so when I get tweets about HTML5 back from the Twitter API, those controls will be automatically refreshed (see Figure 5).

The Speech Input API is the easiest way to get started with in-browser speech. However, with the introduction of the Web Speech API, there's been little advancement of this API in a while. It's still built into Chrome, and the Chromium dev team has even been working to unify the base code for both of the speech APIs, so I don't see Speech Input going away any time soon. But since much of the recent standards work has been focused on the Web Speech API, the rest of the article will explore its nascent features.
To experiment with implemented features of the Web Speech API, you'll need to install Chrome Canary, the nightly version of the Chrome browser, or Chrome Beta. Canary can run side-by-side with Chrome stable, so you're not replacing your browser, just adding another one. That's not the case with Chrome Beta.
At the time of this writing, Chrome Canary and Chrome Beta enable the Web Speech API by default—as opposed enabling it with a flag—but only the speech recognition portions of the spec have been implemented. Speech synthesis and custom grammars have not made their way into Canary yet, but I'm hoping that they will soon.
Using the same Twitter search page as my example, I start working with Web Speech by constructing a new SpeechRecognition object.
var SpeechRecognition = window.SpeechRecognition ||
window.webkitSpeechRecognition ||
window.mozSpeechRecognition ||
window.oSpeechRecognition ||
window.msSpeechRecognition;
if (SpeechRecognition) {
var recognition = new SpeechRecognition();
recognition.maxAlternatives = 5;
}
Since Web Speech is implemented in Chrome with a vendor prefix, I do the responsible thing and future-proof my SpeechRecognition object. At the moment, only webkitSpeechRecognition is implemented, but when another browser implementation comes along, or the prefix is removed, I'll have built-in support ready to go. This is, of course, assuming the API doesn't change, which is always a real possibility with new API proposals.
Once I've got my SpeechRecognition object, I create a new instance via its constructor, and then use the maxAlternatives property to instruct the API to return no more than five alternative matches for each speech result. Now, I'm ready to register a couple of events and kick things off.
The latest version of the Web Speech API defines eleven event handlers related to speech capture and recognition. This includes events that fire when the API starts and stops, when speech is detected, when speech results are ready, and when an error occurs. Here are some example handlers:
var speak = $('#speak');
recognition.onaudiostart = function() {
speak.val("Speak now...");
};
recognition.onnomatch = function() {
speak.val("Try again please...");
};
recognition.onerror = function() {
speak.val("Error. Try Again...");
};
The speak variable holds a reference to the Click To Speak button in Figure 3. As a quick form of visual feedback, I change the text of this button when I have a message for the user. First, I watch for the audiostart event, which will fire once the speech recognition engine is ready to receive input. The Speech API is permission-based, as are all APIs that require access to external devices like microphones and webcams. As such, when the user triggers speech capture, a permissions window will pop up in the browser (see Figure 6). It's only after the user allows the microphone to be used that speech capture is initiated, and the audiostart event can be used to inform the user that recording has started.

The other two events I've captured here are the nomatch and error events. As their names indicate, these events are triggered when the speech service cannot find a reasonable match or when an error occurs, respectively. Error types are defined in the SpeechRecognitionError object that is passed via the error event.
The next event I want to capture is the result event, where all the magic happens.
recognition.onresult = function(event) { // SpeechRecognitionEvent
if (event.results.length > 0) {
// Results are ordered by confidence level, highest-confidence item first
var results = event.results[0], // SpeechRecognitionResultList
topResult = results[0]; // SpeechRecognitionResult
if (topResult.confidence > 0.5) {
speechSearch(results, topResult);
} else {
speak.val("Try again please...");
}
}
};
The first thing I do is make sure the event has the data I need. The result event passes in a SpeechRecognitionEvent object (see Figure 7), which contains a SpeechRecognitionResultsList object in the results property. Before doing anything, I need to make sure this list has at least one item.

If it does, I continue and grab the first result, which is a SpeechRecognitionResult object. By default, the Speech API performs single-shot recognition operations, meaning that recording is stopped after the first voice capture. In this case, the SpeechRecognitionResultList will have only one item. On the other hand, when continuous capture is enabled, which is covered in the next section, the list will contain an entry for every result captured during the current session.
The SpeechRecognitionResult object contains a collection of SpeechAlternative objects, which represent a list of possible matches returned by the speech web service (see Figure 7). Each alternative contains the text of the possible match (transcript) and a number from 0 to 1 representing how confident the service was that the transcript matches the spoken recording (confidence). The alternatives are ordered by confidence, so the best match will always be the first.
If the first item has a confidence value greater than 0.5, or 50%, I call the speechSearch method:
function speechSearch(results, topResult) {
var alts = "",
i,
len,
tmpl;
// Apply the transcript to the speech search input and
// trigger its click event
query.val(topResult.transcript);
search.click();
speak.val("Click to Speak");
// Display a list of remaining results in "Did you mean?" style
tmpl = kendo.template($("#altTemplate").html());
for (i = 1, len = results.length; i < len; i++) {
alts += tmpl({
alternative: results[i].transcript,
confidence: Math.floor(results[i].confidence*100)
});
}
// Show the alternative results list
list.append(alts).fadeIn("slow");
// When an alternative result is clicked, apply its value
// to the search input and trigger the click event
list.find('a').on('click', function(e) {
e.stopPropagation();
var id = e.currentTarget.id;
query.val(id);
search.click();
});
}
There are two important pieces to highlight in this method. First is the main search, which places the transcript value of the top result into the search input field (the same one used for the Speech Input API example) and then triggers the click event on the Search button. The Search button will refresh the Kendo DataSource, which searches Twitter using my spoken query.
Once I've triggered a search with the top speech result, I'd like to display the remaining alternatives to the user, and allow them to click on any of these to trigger a search for that term instead of the one I've chosen by default. So I loop through each remaining alternative and apply its transcript and confidence values to a Kendo Template, which will display my alternatives in a list with a link for each alternative. Finally, I bind each link to a callback which, when clicked, will perform the search for that term instead. With just a bit of work, I now have a voice-controlled search that displays alternatives to the user. In Figure 8, below, the 0% values next to the alternatives indicate that Google did not have confidence these words matched the spoken text.

In the previous example, I used single-shot capture to record the user's voice once, and I worked with the result. Also available in the Web Speech API is continuous capture, which will keep the user's microphone on and recording, and will process results and fire the result event at regular intervals. For my Twitter search example, I can use continuous mode to enable the user to trigger new searches or to page through the results list using their voice.
Continuous mode can be enabled via the continuous property, which I set when the user clicks the Enable Continuous Mode button.
var continuous = $('#continuousMode'),
continuousMode = false;
continuous.on('click', function() {
if (continuousMode) {
recognition.stop();
recognition.continuous = false;
continuous.val("Enable Continuous Mode");
continuousMode = false;
} else {
recognition.continuous = true;
recognition.start();
continuous.val("Disable Continuous Mode");
continuousMode = true;
}
});
In order to process a continuous stream of results, I'll need to modify my result callback (omitted here for brevity) to call a new speechInteract function and provide the latest entry in the SpeechRecognitionResultsList collection.
function logSpeechCommands(msg) { … }
function speechInteract(results, topResult) {
var commandWords = topResult.transcript.trim().split(" "),
len = commandWords.length,
firstWord = commandWords[0],
lastWord = commandWords[len-1],
pager,
pageCommands,
currentPage,
term;
if (firstWord === "search" && len > 1) {
term = commandWords.slice(1).join(" ");
logSpeechCommand("Requested search for " + term);
query.val(term);
search.click();
} else { // User is attempting to page through results
if (lastWord === "page") {
pager = tweetPager.data("kendoPager");
currentPage = pager.page();
pageCommands = {
first: function() {
pager.page(1);
},
next: function() {
if (currentPage !== pager.totalPages()) {
pager.page(currentPage + 1);
}
},
previous: function() {
if (currentPage !== 1) {
pager.page(currentPage - 1);
}
},
last: function() {
pager.page(pager.totalPages());
}
};
if (commandWords[0] in pageCommands) {
logSpeechCommand("Requested pager move to " +
commandWords[0] + " page");
pageCommands[commandWords[0]]();
} else {
logSpeechCommand("Did not recognize page command");
}
}
}
}
For speech interaction, I start by placing each word in the transcript into an array. If the first word is "search," I know the user wants to perform a voice search, so I'll trigger a search using the remaining words in the transcript. If, on the other hand, the last word of the transcript is "page," I know the user is attempting to page through the results list. If the word before "page" is "first," "next," "previous," or "last," I can tell the Kendo Pager control to update its current page accordingly. To provide a bit more feedback to the user, I display a message each time a command is received from the continuous stream (see Figure 9).

In-browser speech recognition might be cutting edge, but it's not just a parlor trick. Its applications go far beyond enabling voice-triggered web searches. As I showed in the second example, which I encourage you to try for yourself in the downloadable source, the Web Speech API is a powerful tool that can be used to open up new avenues for user interaction with the browser. With web speech, users will be able to use their voices for interactions like:
Beyond these, the Web Speech API also defines compelling use cases for text-to-speech output (language translation, for instance) as well as the ability to define custom grammars for tightly-defined speech interaction scenarios.
Web Speech certainly has the potential to benefit all web users, but nowhere is the potential greater than for disabled users. Beyond all of the reasons and features I've listed above, I believe that the greatest promise of Web Speech is making the web even more accessible, in a way that screen readers cannot replicate today. That's a big promise for Web Speech to deliver, but I believe the W3C and browser implementers are up to the task.
This article covered in-browser speech, starting with a short history of speech API specifications. It focused on two APIs, Speech Input and Web Speech, and walked through an example of how you can use both of these in your applications today with the Google Chrome and Chrome Canary browsers. For Web Speech, the examples showed using both one-shot speech requests and continuous speech capture.
Web Speech is definitely cutting edge, but it has the potential to be a fast-tracked specification in the near future. Since the technology is implemented in Chrome today, I strongly encourage you to check it out for yourself. And since the spec is new, there's plenty of time to jump into the Speech API Community Group to ask questions, make suggestions, and add your voice to the standardization process.
To learn more, see the following resources:
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License. Permissions beyond the scope of this license, pertaining to the examples of code included within this work are available at Adobe.