Menu Search
Jump to the content X X
Smashing Conf San Francisco

We use ad-blockers as well, you know. We gotta keep those servers running though. Did you know that we publish useful books and run friendly conferences — crafted for pros like yourself? E.g. upcoming SmashingConf San Francisco, dedicated to smart front-end techniques and design patterns.

Experimenting With speechSynthesis

I’ve been thinking a lot about speech for the last few years. In fact, it’s been a major focus in several of my talks of late, including my well-received Smashing Conference talk “Designing the Conversation1.” As such, I’ve been keenly interested in the development of the Web Speech API2.

If you’re unfamiliar, this API gives you (the developer) the ability to voice-enable your website in two directions: listening to your users via the SpeechRecognition interface3 and talking back to them via the SpeechSynthesis interface4. All of this is done via a JavaScript API, making it easy to test for support. This testability makes it an excellent candidate for progressive enhancement, but more on that in a moment.

A lot of my interest stems from my own personal desire to experiment with new ways of interacting with the web. I’m also a big fan of podcasts and love listening to great content while I’m driving and in other situations where my eyes are required elsewhere or are simply too tired to read. The Web Speech API opens up a whole range of opportunities to create incredibly useful and natural user interactions by being able to listen for and respond with natural language:

– Hey Instapaper, start reading from my queue!

– Sure thing, Aaron…

The possibilities created by this relatively simple API set are truly staggering. There are applications in accessibility, Internet of Things, automotive, government, the list goes on and on. Taking it a step further, imagine combining this tech with real-time translation APIs (which also recently began to appear). All of a sudden, we can open up the web to millions of people who struggle with literacy or find themselves in need of services in a country where they don’t read or speak the language. This. Changes. Everything.

But back to the Web Speech API. As I said, I’d been keeping tabs on the specification for a while, checked out several of the demos and such, but hadn’t made the time to play yet. Then Dave Rupert finally spurred me to action with a single tweet:

Tweet by Dave Rupert5

Within an hour or so, I’d gotten a basic implementation together for my blog6 that would enable users to listen to a blog post7 rather than read it. A few hours later, I had added more features, but it wasn’t all wine and roses, and I ended up having to back some functionality out of the widget to improve its stability. But I’m getting ahead of myself.

I’ve decided to hit the pause button for a few days to write up what I’ve learned and what I still don’t fully understand in the hope that we can begin to hash out some best practices for using this awesome feature. Maybe we can even come up with some ways to improve it.

Hello, World Link

So far, my explorations into the Web Speech API have been wholly in the realm of speech synthesis. Getting to “Hello world” is relatively straightforward and merely involves creating a new SpeechSynthesisUtterance (which is what you want to say) and then passing that to the speechSynthesis object’s speak() method:

var to_speak = new SpeechSynthesisUtterance('Hello world!');
window.speechSynthesis.speak(to_speak);

Not all browsers support this API, although most modern ones do8. That being said, to avoid throwing errors, we should wrap the whole thing in a simple conditional that tests for the feature’s existence before using it:

if ( 'speechSynthesis' in window ) {
  var to_speak = new SpeechSynthesisUtterance('Hello world!');
      window.speechSynthesis.speak(to_speak);
}

See the Pen Experimenting with `speechSynthesis`, example 135129 by Aaron Gustafson (@aarongustafson36331310) on CodePen37341411.

Once you’ve got a basic example working, there’s quite a bit of tuning you can do. For instance, you can tweak the reading speed by adjusting the SpeechSynthesisUtterance object’s rate property. It accepts values from 0.1 to 10. I find 1.4 to be a pretty comfortable speed; anything over 3 just sounds like noise to me.

See the Pen Experimenting with `speechSynthesis`, example 135129 by Aaron Gustafson (@aarongustafson36331310) on CodePen37341411.

You can also tune things such as the pitch15, the volume16 of the voice, even the language being spoken17 and the voice itself18. I’m a big fan of defaults in most things, so I’ll let you explore those options on your own time. For the purpose of my experiment, I opted to change the default rate to 1.4, and that was about it.

A Basic Implementation: Play And Pause Link

When I began working with this code on my own website, I was keen to provide four controls for my readers:

  • play
  • pause
  • increase reading speed
  • decrease reading speed

The first two were relatively easy. The latter two caused problems, which I’ll discuss shortly.

To kick things off, I parroted the code Dave had tweeted:

var to_speak = new SpeechSynthesisUtterance(
  document.querySelector('main').textContent
);
window.speechSynthesis.speak(to_speak);

This code grabs the text content (textContent) of the main element and converts it into a SpeechSynthesisUtterance. It then triggers the synthesizer to speak that content. Simple enough.

Of course, I didn’t want the content to begin reading immediately, so I set about building a user interface to control it. I did so in JavaScript, within the feature-detection conditional, rather than in HTML, because I did not want the interface to appear if the feature was not available (or if JavaScript failed for some reason). That would be frustrating for users.

I created the buttons and assigned some event handlers to wire up the functionality. My first pass looked something like this:

var $buttons = document.createElement('p'),
    $button = document.createElement('button'),
    $play = $button.cloneNode(),
    $pause = $button.cloneNode(),
    paused = false,
    to_speak;

if ( 'speechSynthesis' in window ) {

  // content to speak
  to_speak = new SpeechSynthesisUtterance(
    document.querySelector('main').textContent
  );

  // set the rate a little faster than 1x
  to_speak.rate = 1.4;

  // event handlers
  to_speak.onpause = function(){
    paused = true;
  };

  // button events
  function play() {
    if ( paused ) {
      paused = false;
      window.speechSynthesis.resume();
    } else {
      window.speechSynthesis.speak( to_speak );
    }
  }
  function pause() {
    window.speechSynthesis.pause();
  }

  // play button
  $play.innerText = 'Play';
  $play.addEventListener( 'click', play, false );
  $buttons.appendChild( $play );
  
  // pause button
  $pause.innerText = 'Pause';
  $pause.addEventListener( 'click', pause, false );
  $buttons.appendChild( $pause );

} else {

  // sad panda
  $buttons.innerText = 'Unfortunately your browser doesn’t support this feature.';

}

document.body.appendChild( $buttons );

This code creates a play button and a pause button and appends them to the document. It also assigns the corresponding event handlers. As you’d expect, the play button calls speechSynthesis.speak(), as we saw earlier, but because pause is also in play, I set it up to either speak the selected text or resume speaking — using speechSynthesis.resume() — if the speech is paused. The pause button controls that by triggering speechSynthesis.pause(). I tracked the state of the speech engine using the boolean variable paused. You can kick the tires of this code over on CodePen19.

I want to (ahem) pause for a moment to tuck into the speak() command, because it’s easy to misunderstand. At first blush, you might think it causes the supplied SpeechSynthesisUtterance to be read aloud from the beginning, which is why I’d want to resume() after pausing. That is true, but it’s only part of it. The speech synthesis interface actually maintains a queue for content to be spoken. Calling speak() pushes a new SpeechSynthesisUtterance to that queue and causes the synthesizer to start speaking that content if it’s not already speaking. If it’s in the process of reading something already, the new content takes its spot at the back of the queue and patiently waits its turn. If you want to see this in action, check out my fork of the reading speed demo20.

If you want to clear the queue entirely at any time, you can call speechSynthesis.cancel(). When testing speech synthesis with long-form content, having this at the ready in the browser’s console is handy.

Taking It Further: Adjusting Reading Speed Link

As I mentioned, I also wanted to give users control over the reading speed used by the speech synthesizer. We can tune this using the rate property on a SpeechSynthesisUtterance object. That’s fantastic, but you can’t (currently, at least) adjust the rate of a SpeechSynthesisUtterance once the synthesizer starts playing it — not even while it’s paused. I don’t know enough about the inner workings of speech synthesizers to know whether this is simply an oversight in the interface or a hard limitation of the synthesizers themselves, but it did force me to find a creative way around this limitation.

I experimented with a bunch of different approaches to this and eventually settled on one that works reasonably well, despite the fact that it feels like overkill. But I’m getting ahead of myself again.

Every SpeechSynthesisUtterance object offers a handful of events you can plug in to do various things. As you’d expect, onpause21 fires when the speech is paused, onend22 fires when the synthesizer has finished reading it, etc. The SpeechSynthesisEvent23 object passed to each of these includes information about what’s going on with the synthesizer, such as the position of the virtual cursor (charIndex24), the length of time after the current SpeechSynthesisUtterance started being read (elapsedTime25), and a reference to the SpeechSynthesisUtterance itself (utterance26).

Originally, my plan to allow for real-time reading-speed adjustment was to capture the virtual cursor position via a pause event so that I could stop and start a new recording at the new speed. When the user adjusted the reading speed, I would pause the synthesizer, grab the charIndex, backtrack in the text to the previous space, slice from there to the end of the string to collect the remainder of what should be read, clear the queue, and start the synthesizer again with the remainder of the content. That would have worked, and it should have been reliable, but Chrome kept giving me a charIndex of 0, and in Edge it was always undefined. Firefox tracked charIndex perfectly. I’ve filed a bug for Chromium27 and one for Edge28, too.

Thankfully, another event, onboundary29, fires whenever a word or sentence boundary is reached. It’s a little noisier, programmatically speaking, than onpause because the event fires so often, but it reliably tracked the position of the virtual cursor in every browser that supports speech synthesis, which is what I needed.

Here’s the tracking code:

var progress_index = 0;

to_speak.onboundary = function( e ) {
  if ( e.name == 'word' ) {
    progress_index = e.charIndex;
  }
};

Once I was set up to track the cursor, I added a numeric input to the UI to allow users to change the speed:

var $speed = document.createElement('p'),
    $speed_label = document.createElement('label'),
    $speed_value = document.createElement('input');

// label the field
$speed_label.innerText = 'Speed';
$speed_label.htmlFor = 'speed_value';
$speed.appendChild( $speed_label );

// insert the form control
$speed_value.type = 'number';
$speed_value.id = 'speed_value';
$speed_value.min = '0.1';
$speed_value.max = '10';
$speed_value.step = '0.1';
$speed_value.value = Math.round( to_speak.rate * 10 ) / 10;
$speed.appendChild( $speed_value );

document.body.appendChild($speed);

Then, I added an event listener to track when it changes and to update the speech synthesizer:

function adjustSpeed() {
  // cancel the original utterance
  window.speechSynthesis.cancel();
  
  // find the previous space
  var previous_space = to_speak.text.lastIndexOf( ' ', progress_index );
  
  // get the remains of the original string
  to_speak.text = to_speak.text.slice( previous_space );
  
  // math to 1 decimal place
  speed = Math.round( $speed_value.value * 10 ) / 10;
  
  // adjust the rate
  if ( speed > 10 ) {
    speed = 10;
  } else if ( speed < 0.1 ) {
    speed = 0.1;
  }
  to_speak.rate = speed;

  // return to speaking
  window.speechSynthesis.speak( to_speak );
}

$speed_value.addEventListener( 'change', adjustSpeed, false );

This works reasonably well, but ultimately I decided that I was not a huge fan of the experience, nor was I convinced it was really necessary, so this functionality remains commented out in my website’s source code30. You can make up your mind after seeing it in action over on CodePen31.

Taking It Further: Tweaking What’s Read Link

At the top of every blog post, just after the title, I include quite a bit of meta data about the post, including things like the publication date, tags for the post, comment and webmention counts, and so on. I wanted to selectively control which content from that collection is read because only some of it is really relevant in that context. To keep the configuration out of the JavaScript and in the declarative markup where it belongs, I opted to have the JavaScript look for a specific class name, “dont-read”, and exclude those elements from the content that would be read. To make it work, however, I needed revisit how I was collecting the content to be read in the first place.

You may recall that I’m using the textContent property to extract the content:

var to_speak = new SpeechSynthesisUtterance(
  document.querySelector('main').textContent
);

That’s all well and good when you want to grab everything, but if you want to be more selective, you’re better off moving the content into memory so that you can manipulate it without causing repaints and such.

var $content = document.querySelector('main').cloneNode(true);

With a clone of main in memory, I can begin the process of winnowing it down to only the stuff I want:

var to_speak = new SpeechSynthesisUtterance()
    $content = document.querySelector('main').cloneNode(true),
    $skip = $content.querySelectorAll('.dont-read');

// don’t read
Array.prototype.forEach.call( $skip, function( $el ){
  $el.innerHTML = '';
});

to_speak.text = $content.textContent;

Here, I’ve separated the creation of the SpeechSynthesisUtterance to make the code a little clearer. Then, I’ve cloned the main element ($content) and built a nodeList of elements that I want to be ignored ($skip). I’ve then looped over the nodeList — borrowing Array’s handy forEach method — and set the contents of each to an empty string, effectively removing them from the content. At the end, I’ve set the text property to the cloned main element’s textContent. Because all of this is done to the cloned main, the page remains unaffected.

Done and done.

Taking It Further: Synthetic Pacing Tweaks Link

Sadly, the value of a SpeechSynthesisUtterance can only be text. If you pipe in HTML, it will read the tag names and slashes. That’s why most of the demos use an input to collect what you want read or rely on textContent to extract text from the page. The reason this saddens me is that it means you lose complete control over the pacing of the content.

But not all is lost. Speech synthesizers are pretty awesome at recognizing the effect that punctuation should have on intonation and pacing. To go back to the first example I shared, consider the difference when you drop a comma between “hello” and “world”:

if ( 'speechSynthesis' in window ) {
  var to_speak = new SpeechSynthesisUtterance('Hello, world!');
  window.speechSynthesis.speak(to_speak);
} 

See the Pen Experimenting with `speechSynthesis`, example 232 by Aaron Gustafson (@aarongustafson36331310) on CodePen37341411.


Here’s the original again, just so you can compare:

See the Pen Experimenting with `speechSynthesis`, example 135129 by Aaron Gustafson (@aarongustafson36331310) on CodePen37341411.

With this in mind, I decided to tweak the pacing of the spoken prose by artificially inserting commas into the specific elements that follow the pattern I just showed for hiding content:

var $pause_before = $content.querySelectorAll(
  'h2, h3, h4, h5, h6, p, li, dt, blockquote, pre, figure, footer'
);

// synthetic pauses
Array.prototype.forEach.call( $pause_before, function( $el ){
  $el.innerHTML = ' , ' + $el.innerHTML;
});

While I was doing this, I also noticed some issues with certain elements running into the content around them. Most notably, this was happening with pre elements. To mitigate that, I used the same approach to swap carriage returns, line breaks and such for spaces:

var $space = $content.querySelectorAll('pre');

// spacing out content
Array.prototype.forEach.call( $space, function( $el ){
  $el.innerHTML = ' ' + $el.innerHTML.replace(/[\r\n\t]/g, ' ') + ' ';
});

With those tweaks in place, I’ve been incredibly happy with the listening experience. If you’d like to see all of this code in context, head over to my GitHub repository38. The code you use to drop the UI into the page will likely need to be different from what I did, but the rest of the code should be plug-and-play.

Is speechSynthesis Ready For Production? Link

As it stands right now, the Web Speech API has not become a standard and isn’t even on a standards track39. It’s an experimental API and some of the details of the specification remain in flux. For instance, the elapsedTime property of a SpeechSynthesisEvent originally tracked milliseconds and then switched to seconds. If you were doing math that relied on that number to do something else in the interface, you might get widely different experiences in Chrome (which still uses milliseconds) and Edge (which uses seconds).

If I was granted one wish for this specification—apart from standardization—it would be for real-time speed, pitch and volume adjustment. I can understand the need to restart things to get the text read in another voice, but the others feel like they should be manipulable in real time. But again, I don’t know anything about the inner workings of speech synthesizers, so that might not be technically possible.

In terms of actual browser implementations, basic speech synthesis like I’ve covered here is pretty solid in browsers that support the API40. As I mentioned, Chrome and Edge currently fail to accurately report the virtual cursor position when speech synthesis is paused, but I don’t think that’s a deal-breaker. What is problematic is how unstable things get when you start to combine features such as real-time reading-speed adjustments, pausing and such. Often, the synthesizer just stops working and refuses to start up again. If you’d like to see that happen, take a look at a demo I set up41. Chances are that this issue would go away if the API allowed for real-time manipulation of properties such as rate because you wouldn’t have to cancel() and restart the synthesizer with each adjustment.

Long story short, if you’re looking at this as a progressive enhancement for a content-heavy website and only want the most basic features, you should be good to go. If you want to get fancy, you might be disappointed or have to come up with more clever coding acrobatics than I’ve mustered.

Want To Learn More? Link

As with most things on the web, I learned a ton by viewing other people’s source, demos and such — and the documentation, naturally. Here are some of my favorites (some of which I linked to in context):

(rb, yk, il, al)

Footnotes Link

  1. 1 https://vimeo.com/184234783
  2. 2 https://developer.mozilla.org/docs/Web/API/Web_Speech_API
  3. 3 https://developer.mozilla.org/docs/Web/API/SpeechRecognition
  4. 4 https://developer.mozilla.org/docs/Web/API/SpeechSynthesis
  5. 5 https://twitter.com/davatron5000/status/818493871961341953
  6. 6 https://www.aaron-gustafson.com/notebook/
  7. 7 https://www.aaron-gustafson.com/notebook/insert-clickbait-headline-about-progressive-enhancement-here/
  8. 8 http://caniuse.com/#feat=speech-synthesis
  9. 9 http://codepen.io/aarongustafson/pen/qRZgzx/
  10. 10 http://codepen.io/aarongustafson
  11. 11 http://codepen.io
  12. 12 http://codepen.io/aarongustafson/pen/qRZgzx/
  13. 13 http://codepen.io/aarongustafson
  14. 14 http://codepen.io
  15. 15 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisUtterance/pitch
  16. 16 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisUtterance/volume
  17. 17 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisUtterance/lang
  18. 18 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisUtterance/voice
  19. 19 http://codepen.io/aarongustafson/pen/ygOrNj
  20. 20 http://s.codepen.io/aarongustafson/pen/ggryNo/
  21. 21 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisUtterance/onpause
  22. 22 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisUtterance/onend
  23. 23 https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisEvent
  24. 24 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisEvent/charIndex
  25. 25 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisEvent/elapsedTime
  26. 26 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisEvent/utterance
  27. 27 https://bugs.chromium.org/p/chromium/issues/detail?id=681026
  28. 28 https://twitter.com/aarongustafson/status/819944910308646913
  29. 29 https://developer.mozilla.org/docs/Web/API/SpeechSynthesisUtterance/onboundary
  30. 30 https://github.com/aarongustafson/aarongustafson.github.io/blob/source/source/_javascript/post/speak.js
  31. 31 https://codepen.io/aarongustafson/pen/vgKpMg?editors=0010
  32. 32 https://codepen.io/aarongustafson/pen/dNMrYq/
  33. 33 http://codepen.io/aarongustafson
  34. 34 http://codepen.io
  35. 35 http://codepen.io/aarongustafson/pen/qRZgzx/
  36. 36 http://codepen.io/aarongustafson
  37. 37 http://codepen.io
  38. 38 https://github.com/aarongustafson/aarongustafson.github.io/blob/source/source/_javascript/post/speak.js
  39. 39 https://dvcs.w3.org/hg/speech-api/raw-file/tip/webspeechapi.html#status
  40. 40 http://caniuse.com/#feat=speech-synthesis
  41. 41 https://codepen.io/aarongustafson/pen/ZLOxLe
  42. 42 https://dvcs.w3.org/hg/speech-api/raw-file/tip/webspeechapi.html
  43. 43 https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API
  44. 44 http://caniuse.com/#feat=speech-synthesis
  45. 45 https://developers.google.com/web/updates/2014/01/Web-apps-that-talk-Introduction-to-the-Speech-Synthesis-API
  46. 46 https://developer.microsoft.com/en-us/microsoft-edge/testdrive/demos/speechsynthesis/

↑ Back to top Tweet itShare on Facebook

As would be expected from a former manager of the Web Standards Project, Aaron Gustafson is passionate about web standards and accessibility. He has been working on the Web for two decades now and is a web standards advocate at Microsoft, working closely with their browser team. He writes about whatever’s on his mind at aaron-gustafson.com.

  1. 1

    Exciting stuff, thanks for the great article!

    I would recommend always defining the language. In your demo, my browser tried to speak the English text with an German accent – which turns out to be even worse than a real German accent. :) Once set to ‘en-US’, the speech quality was surprisingly good (tested in Chrome).

    3
    • 2

      Excellent point! I assumed it would automatically pick up the page’s language code rather than the browser’s native one. Seems like something the spec should include… I’ll do some research and see if I missed that or if the spec is lacking.

      3
    • 3

      Looks like the default lang should be the doc language. Which browser & OS were you in? And do you think it’s an issue of the language or the voice? I’d like to do a reduced test case and file bugs as necessary if there’s an implementation defect.

      0
      • 4

        I was using Chrome on Windows 10. German is primary language configured for my browser.

        Thanks for the note about the language of the document! That’s probably more elegant than defining it in JavaScript. I believe the document in your Codepen snippet (http://codepen.io/aarongustafson/pen/ygOrNj) does not have a language defined. That’s probably the cause of the issue.

        0
      • 5

        What if you have a blockquote in an foreign language? Attributed with lang=”fr-FR” for example. I couldn’t figure out how to get this working correctly. Which mean to change the voice while reading.

        0
  2. 6

    Nice write up.

    I had the same interest in the SpeechSynthesis API as well and ended up writing a jQuery plugin called Articulate.js. My introductory article detailing its uses can be found over at CSS-Tricks.

    Your long-story-short summation is spot on. At the very least, all of this is fun to play with.

    2
  3. 7

    Thanks for the write up, Aaron.

    I’m very interested in the accessibility uses (and possible issues) this will cause. Hopefully it gets standardised and we can start to see it being used to enhance the experience for those who need it.

    I am wondering whether we’re going to start seeing UX sound designers becoming a thing if we can start using browser speech and other sounds.

    0
  4. 8

    There are few small issue in I found web api speech synthesis (except that it’s pretty amazing):

    1) if you using any other voice then default/native you might face issue of text-to-speech stops, as workaround you can split you text based on DOT (end of sentence) character and create new utterance, for more details have a look here:
    https://bugs.chromium.org/p/chromium/issues/detail?id=369472&can=2&start=0&num=100&q=Web%20Speech%20stops&colspec=ID%20Pri%20M%20Stars%20ReleaseBlock%20Component%20Status%20Owner%20Summary%20OS%20Modified&groupby=&sort=

    2) some methods might not be available even in the latest versions of browsers – for example Safari does not implement method:
    speechSynthesis.onvoiceschanged…
    which is typically used to establish when voice are available and method:
    speechSynthesis.getVoices()
    can be run.
    As workaround you simply don’t call this method in Safari and assume that getVoices can be called safely.

    Actually I got very much interested in the speech synthesis web api that I decided to build product based on this technology:

    Here you can see early results of my work:
    http://guidely.appbucket.eu/guide/128/location/select

    To understand the idea please go to:
    http://blog.guidely.appbucket.eu/about/

    2
  5. 9

    I made a little file tagging Electron app using the web speech recognition api. Basically you could scroll through a list of images or pdfs with the down arrow key, and add some tags for the file to a database with your voice. A heck of a lot easier than typing in a bunch of tags. I was pretty impressed with the accuracy – especially with more commonly appearing text like dates. Sometimes it fell short, but not often.

    Interesting article, thanks a lot.

    0

Leave a Comment

You may use simple HTML to add links or lists to your comment. Also, use <pre><code class="language-*">...</code></pre> to mark up code snippets. We support -js, -markup and -css for comments.

↑ Back to top