Menu Search
Jump to the content X X
Smashing Conf Barcelona

You know, we use ad-blockers as well. We gotta keep those servers running though. Did you know that we publish useful books and run friendly conferences — crafted for pros like yourself? E.g. our upcoming SmashingConf Barcelona, dedicated to smart front-end techniques and design patterns.

Enhancing User Experience With The Web Speech API

It’s an exciting time for web APIs, and one to watch out for is the Web Speech API. It enables websites and web apps not only to speak to you, but to listen, too. It’s still early days, but this functionality is set to open a whole array of use cases. I’d say that’s pretty awesome.

In this article, we’ll look at the technology and its proposed usage, as well as some great examples of how it can be used to enhance the user experience.

Image credit: Sebastian Schöld2

Disclaimer: This technology is pretty cutting-edge, and the specification is currently with the W3C as an “unofficial editor’s draft” (as of 6 June 2014). The likelihood that usage will differ slightly from the code snippets in this article is high. Checking the specification3 and testing thoroughly before releasing code are always wise.

Further Reading on SmashingMag:

Speech Synthesis Link

The API comes in two parts. To start, let’s look at the speech synthesis part, the bit that speaks to you. If your website has some textual content — whether body copy, forms inputs, alt tags, etc. — you could run some lovely functions and the device would speak the words to the user.

Let’s look at some of the code needed to make this happen. First, you would create a new instance of the SpeechSynthesisUtterance interface. Then, you would specify the text to be spoken. Then, you would add this instance to a queue, which tells the browser what to speak and when.

Below I have wrapped all of this in a function for us to call, named speak, with the text we want spoken as a parameter.

function speak(textToSpeak) {
   // Create a new instance of SpeechSynthesisUtterance
   var newUtterance = new SpeechSynthesisUtterance();

   // Set the text
   newUtterance.text = textToSpeak;

   // Add this text to the utterance queue

All we need to do now is call this function and pass in some words to be spoken:

speak('Welcome to Smashing Magazine');

More functionality is included in SpeechSynthesisUtterance. You can stop, start and pause the queue, as well as set the language, rate and voice for each utterance. Stopping, starting or pausing an utterance fires an event that you can hook into, as does changing the voice. Plenty to play around with!

At the moment, speech synthesis is supported only in Chrome and Safari (both on desktop and mobile devices). Also, the voices available to you via the API largely depend on the operating system. Google has its own set of default voices for Chrome, available on Mac OS X, Windows and Ubuntu. However, Mac OS X’s voices are also available and, thus, are the same as in Safari on OSX. You can easily see which voices are available in the Developer Tools console:


Tip: If you’re on OS X, check out the voice “Zarvox.”

Speech Recognition Link

The other part of the Web Speech API is speech recognition, which enables the user to speak into the device’s microphone and have their speech recognized by the website or web app.

Let’s run through some code. This time, we’ll create a new instance of the SpeechRecognition interface. Because this part is supported only in Chrome, we’ll have to include the webkit prefix.

var newRecognition = webkitSpeechRecognition();

SpeechRecognition comes with quite a few attributes. One that we are likely to change is continuous, whose default state of false means that the browser will stop listening after a break in speech. If you want your website or web app to keep listening, then set the attribute to true:

newRecognition.continuous = true;

To start and stop speech recognition, call the start() and stop() methods:

// start recognition

// stop recognition

Again, we can hook into plenty of events, such as soundstart, speechstart, result and error. I have prepared a demo8 that shows how to access the words detected, from the result event method. The code goes on to match the words spoken against some simple navigation, activating the appropriate link if detected.

Uses Link

Dictation Link

At the moment, the most common use of the Speech API is as a dictation or reading mechanism. That is, the user speaks into the mic and the device translates the speech into text (as demoed by Chrome’s development team9), or the user passes in text to be read out by the device.

Having a device speak out some information definitely has its advantages. Imagine your mirror telling you what the weather will be like first thing in the morning.

Plenty of car manufacturers have installed text-to-speech capabilities over the last couple of years. Imagine, in the not-too-distant future, your browser’s reading list being read out to you as you drive.

Voice Control Link

Dictation could easily be turned into voice control, as we saw with the recognition demo above, which could be modified to allow for navigation around a website. Add it to web-enabled TVs and we might just be living in the 2015 of Back to the Future 2.

I’m fortunate to work with some very talented colleagues, one of whom created a tennis scoring app. I was delighted to find that he could control the app with his voice, speaking the score out loud as he was playing a game.

Translation Link

Translation would look very different when done in real time. Someone could converse in one language, and another person’s device would speak out what is being said in their own language. Hook that up to a Bluetooth ear piece and eat your heart out Arthur Dent10. We’re getting a little closer to each person having their own Babel fish11.

Limitations Link

Offline capability needs more consideration. As it stands, Chrome sends the recorded audio to its servers and pings back the result. Thus, an Internet connection is needed for it to work — not ideal.

Conclusion Link

Nevertheless, it is still exciting, and the technology is opening up. I look forward to the day when looking for the remote is a thing of the past, and I can just tell the TV to stream the latest Sin City movie.

Would we actually use the web for this? Why not? It’s already universal. You can take the web and its speech wherever you go.

I have met some resistance when talking about this API. People either can’t see a need for it with the web, or they would feel uncomfortable talking to their device — both valid views. However, I hope I have inspired you to at least give it a go and think about it the next time you are building something. Start welcoming speech: It might be just what you’re listening for.

(ml, al, il)

Footnotes Link

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11

↑ Back to top Tweet itShare on Facebook

Ruth John has been wireframing, designing and coding for over a decade. She also tweets and blogs a bit too. You can often find her chatting about web development, building apps and how an extra div is not the answer to your styling problems. Either that or the lesser known Thundercats characters.

  1. 1

    Evandro Guedes

    December 5, 2014 5:07 pm

    Great article!

  2. 2

    Cool article – it will be interesting to see useful, practical usages of this tech in the coming months/years. With the need to be online, it suggests there is still questions around this tech being mass adopted on the move.

    The Xbox One speech recognition already is pretty good at the remoteless TV experience… encase that helps by the way! :)


  3. 4

    Jan Skovgaard

    December 5, 2014 9:23 pm

    Nice article Ruth – I think that there are many perspective in regards to the speech API. Especially when using voice recognition for dictation – For instance a speech enabled rich text editor would increase the speed of content generation in a CMS.

    I know that in some danish institution speech to text is already being used as an alternative to using the keyboard for writing notes and documents. It increases productivity indeed and I can’t wait to give the API a spin :)

  4. 5

    What the heck happened to web standards?

  5. 6

    The web will be all audio and widgets in the not too distant future. It’s the only way to make it work across the proliferation of all the devices we come across. Scrolling web pages are just not good enough to cut it on your cars dashboard, google glass and smart TV’s. We need to be able to listen and glance. I’m finished working on a next generation platform that’s incorporating web speech like the movie HER. Drop me a line if you want beta access.

  6. 7

    Glauber Ramos

    December 8, 2014 3:23 pm

    There is a nice web component for this api

  7. 8

    Thanks so much for this post. I have an upcoming project that this ties in to really well!

  8. 9
  9. 10


    This is so futuristic! …like…
    …1997! Microsoft Speech API (SAPI), Microsoft Agent, Tweedy, Robot, Merlin,etc…

  10. 12

    Regarding the translation part – I’ve written a simple app that does exactly that (although it’s rather “almost” real-time :)) – Of course you need 2 people to try it out and currently it works only in Chrome, but hopefully Firefox will have web speech recognition pretty soon as well (they’re already working on it!).

  11. 13

    I saw a great domain name for sale that could probably be put to nice use as a voice/speech API. It is at:

    Wish I had money to purchase. I am sure the $1 opening bid is not going to meet that reserve. Who knows, maybe I’ll bid.

  12. 14

    Great article !

  13. 15

    I love the idea of this. XBOX already has something in place, so I’m not sure televisions are too far off. I’d love to see it applied in an actual gaming experience. It certainly seems like a logical progression as interactive applications become more rich.

  14. 16

    Would like to here your comment to utilization of this API to increase productivity on a variety of applications, for example:
    a) Remote control of equipment through the use of voice detected commands
    b) Enhancing marketing personalization and instigation of virality. Could I increase the conversion on a product like a pdf converter or drag more social likes if they were voice controlled?

    Thank you

  15. 17

    I created a drop in module that implements this API for use in text s:


↑ Back to top