Speech Recognition

Overview

Aculab Cloud uses Google Speech-to-Text, a multilingual natural language speech recogniser powered by machine learning. In practice, this means you tell it what language it'll be hearing then it will do its best to transcribe whatever you say to it. You can optionally provide hint information to adapt the recogniser to words or phrases which are more likely to come up in your application.

In combination with Text To Speech (TTS), our Speech Recognition allows your application to present a natural, conversational interface to the user. This gives you flexibility in driving the conversation, including use of AI driven chatbots.

Our Speech Recognition is available for REST applications only (those that use the REST API). It is used in the Get Input, Play, Run Speech Menu, Start Transcription and Stop Transcription actions.

When setting up an application to use speech recognition you need to specify the maximum number of audio streams in its service configuration. See Stream usage.

When using speech recognition, there are several configuration options that, with careful consideration, will improve the resulting transcription, dependent on the application and the language being spoken. While we can give guidance here on the best practice for setting these options, applications will generally benefit a great deal from experimentation, particularly in setting Word hints and Priority or using Class tokens.

Languages

Currently, our speech recognition supports over 140 languages and language variants. For the up to date list, see Speech Recognition Languages.

Models

Google Speech-to-Text defines a number of models that have been trained from millions of examples of audio from specific sources, for example telephony or video. Recognition accuracy can be improved by using the specialized model that relates to the kind of audio data being analysed.

For live telephone calls the telephony and telephony_short models are best tuned to the 8kHz sampled audio data and may produce more accurate transcription results than the latest_long, latest_short or default models.

The telephony_short model is optimised for audio where the spoken responses are single-word or very short phrases.

Premium models

Google have made premium models available for some languages, for specific use cases (e.g. medical_conversation). These models have been optimized to more accurately recognise audio data from these specific use cases. See Speech Recognition Languages for the premium models available in your language.

These have a higher charge than for the standard models.

Speech adaptation

Word hints

When starting speech recognition, in addition to specifying the language, the application may optionally provide Word Hints (a set of words or phrases) that guide the recognition towards what is more likely to be said. This reflects the fact that very few conversations are open ended - the application generally has some prior knowledge of the speech it is expecting to receive.

For example, when asking the caller to pick a colour, including "aquamarine" in the word hints will make the recogniser more likely to transcribe that than "aqua marine".

Similarly, without speech adaptation, the response "I would like to leave a callback" may transcribe "... callback" or "... call back". Specifying "callback" as a word hint can help to ensure it is transcribed consistently.

Please see Google's Speech Adaptation documentation for further information.

Priority

Each word or phrase specified as a word hint can also be given a priority value which can give it more weight during the recognition and improve accuracy. This has a practical range of 0 - 2000 and sets the boost property (a priority of 2000 sets a boost value 20).

Setting a higher priority for a word hint can increase the likelyhood that the phrase is recognised, but may increase the possibility of other audio being inaccurately recognised as that phrase. Experimenting with different values is recommended to find the best setting for your application.

Setting a higher priority to a word hint applies that priority to the whole phrase. It can improve recognition accuracy if each word hint phrase is kept short (1 or 2 words) so that minor variations in how something is phrased can be accommodated.

For example, giving a higher priority to the whole phrase "my strong preference is for an appointment at the Bedford walk-in clinic" improves the recognition of that whole phrase, word-for-word. Separating the hints into "my preference", "preference", "strong preference", "an appointment", "Bedford walk-in clinic" improves the likelyhood that variations in the response are still recognised accurately.

Class tokens

Within Word Hints you can provide one or more names of pre-built class tokens that guide the recognition to expect and sensibly format specific classes of phrase e.g. a digit sequence, a phone number, a day of the month, or a monetary value.

Each language supports its own list of Class tokens.

The class token name is prefixed by a "$" character. For example, to assist the transcription of a telephone number and transcribe it in the format that your language expects add the "$FULLPHONENUM" class token as a word hint, if your language supports it. For en-US this may result in the number being transcribed in the format "xxx-xxx-xxxx".

You can also embed class tokens in a word hint phrase. For example, you can include an address number in word hint

["my address is $ADDRESSNUM"]

However, it is good practice to cater for alternative phrases and always to include the class token on its own

["my address is $ADDRESSNUM", "$ADDRESSNUM"]

If you use an invalid or malformed class token, the recognition ignores the token without triggering an error but still uses the rest of the phrase for context.

Our Speech Recognition is not grammar based speech recognition. So, for example, you can't constrain its output to be four digits, a time or a date. However, armed with word_hints, class tokens and some post processing of the transcription, it can allow very natural, expressive dialogues and is well matched to the increasingly human-like conversations of modern AI chatbots.

Stream usage

The service configuration for a REST application has an advanced setting for the number of speech recognition streams it can use. This needs to be set to the maximum number of streams that might be used simultaneously, during the lifetime of the application.

For example, an application that runs Start Transcription with the transcription mode set to separate and then runs Run Speech Menu will need to be configured to have three speech recognition streams as two will be in use by the transcription (one for each audio direction) and when the Run Speech Menu action runs, it requires an additional one.

Use cases

Conversations

In most applications, the main action used to drive a conversation with the user is Get Input. This allows you to play a file or TTS prompt, then receive a transcription of the user's response passed to your next_page. An example interaction might be:

Prompt: "What would you like to do?"
Response: "Pay a bill."

The Run Speech Menu action, being somewhat more restricted, is ideal for menu driven applications. Here, a file or TTS prompt is played, and the user's response, which must be one of a set of specified words or short phrases, is passed to the selected next_page. An example interaction here might be:

Prompt:"Would you like to speak to Sales, Marketing or Support?"
Response: "Support"

Play with selective barge-in

The Play action seems at first sight an odd place to feature Speech Recognition. However, consider the case when the user is listening to a long recorded voicemail. They may say a small number of things to stop it, for example "Next", "Again" or "Delete" but, with it being a long voicemail, there is always the chance of the Speech Recognition transcribing some background speech. The Play action allows the application to specify whether barge-in on speech is allowed and, if so, whether it is restricted to specific supplied phrases.

Live transcription

The Start Transcription and Stop Transcription actions allow your application to receive a live transcription, sent to their chosen page, of the speech on any combination of the inbound and outbound audio streams, all performed outside the ongoing IVR call flow. These actions would typically be used to allow the application to be aware of and react to the content of human to human conversations. For example, a section of the agent or receptionist's screen may update to display a 'Book an appointment' button if the caller mentions they'd like to. Alternatively, the manager's screen may update to flag while an agent is involved in a particularly difficult conversation.

Live translation

The Connect action can be configured to include an AI Translator in the conversation. The translator will use Text-To-Speech to say translations of the speech recognized from each user to both parties.

Charging

On a trial account you can start using Speech Recognition straight away.

For other accounts, our Speech Recognition is charged per recognition, per minute with 15 second granularity. So, for example:

A Get Input which listens for 12 seconds will be charged for 15 seconds.
A Start Transcription for separate outbound and inbound audio which listens for 3 minutes 20 seconds will be charged for 7 minutes (each of the two transcriptions is charged for 3 minutes and 30 seconds).

You can obtain detailed charge information for a specific call using the Application Status web service. You can obtain detailed charge information for calls over a period of time using the Managing Reports web services. When using transcription in Separate mode there will be two corresponding entries in the Feature Data Record (FDR), one for each direction.