Skip to content

Call Control Parameters

Call control parameters are general-purpose parameters that can modify a call's behavior, including ASR/STT & TTS configurations.

Note

Automatic Speech Recognition (ASR) and Speech-to-Text (STT) are two terms that refer to the same technology. Both involve converting spoken language into written text by analyzing and interpreting audio input. The terms are used interchangeably, describing the same function—transforming speech into readable, actionable text.

There are two ways to define the Call Control Parameters - Node Level and Channel Level.

Node Level Call Control

The call control section is Available In Entity Node/Message Node/Confirmation Node > IVR Properties > Advanced Controls. Learn more.
Node Level Call Control

Channel Level Call Control

For information on configuring the Call Control Parameters at the channel level, refer to Define the Call Control Parameters.

Supported Speech Engines

Kore.ai supports the following third-party service providers for ASR/STT. Learn more.

Speech Engine ASR Name TTS Name Supported Environment
Microsoft Azure microsoft microsoft On Premise
Cloud
Google google google On-Premise
Cloud
Nvidia (Riva) nvidia nvidia On-Premise
Amazon (AWS) aws polly Cloud
Deepgram deepgram Not Supported Cloud
Elevenlabs Not Supported elevenlabs Cloud
Whisper Not Supported whisper Cloud
Ami voice amivoice Cloud

Common ASR Parameters

Parameter Type Supporting STT/ TTS Description Examples
alternativeLanguages Array of Objects Google Microsoft Deepgram An array of alternative languages that the speaker may be using Based on user utterance, the transcript will come from either of the selected languages. alternativeLanguages = [ { "language": "de-DE", "voiceName": "de-DE-KatjaNeural" }, { "language": "fr-FR", "voiceName": "fr-FR-DeniseNeural" } ]
sttMinConfidence Number Range- (0.1 to 0.9) ALL If the minConfidence parameter is set, and the transcript generated by the ASR falls below this confidence threshold, the Voice Gateway will disregard the input and trigger the timeout prompt to play. This ensures that only highly accurate speech recognition results are captured, improving the quality of the interaction. Example sttMinConfidence = 0.5, Any ASR transcript with a confidence score below 0.5 will be ignored, and the system will play the timeout prompt. This ensures that only inputs with sufficient accuracy are processed, improving reliability in voice interactions.
Hints with Phrase level hintsboost This is an additional feature present in allowing a boost factor to be specified at the phrase level Kore VG Key = hints   Array Of Objects : Google Nvidia The parameter can list phrases or words that are passed to the speech-to-text service as "hints" for improving the accuracy of speech recognition. For example - weather and whether have the same pronunciation, for more accuracy we gave hints to the bot.   Hints : [‘weather’] It will take weather as input Put this array in the Grammar section of the bot builder. "hints" = [   {"phrase": "benign", "boost": 50},   {"phrase": "malignant", "boost": 10},   {"phrase": "biopsy", "boost": 20}, ]
Hints with Separate HintBoost Google Microsoft Nvidia "hints": ["benign", "malignant", "biopsy"], "hintsBoost": 50
sttDisablePunctuation Boolean Google Microsoft Prevents or Includes the ASR to add punctuation in response. By default, ASR will add punctuation in the User Transcript (for example, periods, commas, and question marks). sttDisablePunctuation : true True: means remove the punctuation. False: Add the punctuation
vadEnable Boolean ALL If true, delay connecting to the cloud recognizer until the speech is detected
vadVoiceMS Number in MS ALL If vad is enabled, the number of milliseconds of speech is required before connecting to the cloud recognizer.
vadMode Number between (0-3) ALL If vad is enabled, this setting governs the sensitivity of the voice activity detector; the value must be between 0 to 3 inclusive, lower numbers mean more sensitive

Microsoft ASR

azureSpeechSegmentationSilenceTimeoutMs Number Speech_SegmentationSilenceTimeoutMs is a timeout that can be set in between the phrases It is similar to Continuous ASR, the Only Difference is Continuous ASR is handled by vocieGateway but Azure speech segmentation is handled by AZURE ASR, So accuracy will be higher as compared to Continios ASR. More Info
sttEndpointID String Custom service endpoint to connect to, instead of hosted Microsoft regional endpoint.
azurePostProcessing String improve the final transcript, such as text normalization (adjusting punctuation, casing, etc.) or specific custom handling based on the needs of the application.
azureSpeechRecognitionMode String (Enum) It can be either 1) AtStart, 2) Continuous "AtStart": Starts recognizing speech as soon as it detects audio input and stops when the speaker finishes. Suitable for short, one-time speech recognition tasks "Continuous": Continuously listens and transcribes speech, ideal for longer audio streams or uninterrupted speech sessions like meetings or dictation. azureSpeechRecognitionMode = Continuous“”
profanityOption String(enum) It is used to mask profane words in the transcript. It has three values masked, removed, or raw. Default: raw Example: profanityOption = “masked”
initialSpeechTimeoutMs Number in Ms Initial speech timeout in milliseconds.
requestSnr Boolean Request signal-to-noise information.
outputFormat String simple or detailed. Default: simple.

Google ASR

sttProfanityFilter Boolean A profanity filter provides a few options for dealing with profane words in the transcription. Default: false
singleUtterance Boolean If true, return only a single utterance/transcript.
sttModel String speech recognition model to use (default: phone_call)
sttEnhancedModel Boolean Use enhanced model
words Boolean Enable word offsets
diarization Boolean Enable speaker diarization
diarizationMinSpeakers Number Set the minimum speaker count.
diarizationMaxSpeakers Number Set the maximum speaker count.
interactionType String Set the interaction type: discussion, presentation, phone_call, voicemail, professionally_produced, voice_search, voice_command, dictation
naicsCode Number Set an industry NAICS code that is relevant to the speech.
googleServiceVersion String v1 or v2 Specifies the version of Google's ASR API in use to ensure compatibility.
googleRecognizerId String Identifies the specific speech recognition model for processing the input.
googleSpeechStartTimeoutMs Number Set the time (in milliseconds) to wait for the speaker to start speaking before timing out.
googleSpeechEndTimeoutMs Number Defines how long to wait (in milliseconds) for silence before determining the end of speech.
googleEnableVoiceActivityEvents Boolean Enables detection of when the user starts or stops speaking during recognition.
googleTranscriptNormalization Array Adjusts the transcript to make it more readable, applying corrections like punctuation and caseing.

AWS ASR

awsAccessKey String The AWS access key for authenticating requests.
awsSecretKey String The corresponding secret key is used with the access key for AWS service authentication.
awsSecurityToken String A temporary security token (optional) for requests that use AWS Security Token Service (STS).
awsRegion String Specifies the AWS region where the service requests will be sent (for example, us-west-2, eu-central-1).
String String The name of the vocabulary filter is used to filter certain words or phrases during transcription.
awsVocabularyFilterName String The name of the vocabulary filter is used to filter certain words or phrases during transcription.
awsVocabularyFilterMethod String/enum "remove", “mask", “tag” Specifies how words in the vocabulary filter are handled. It can take one of three values:
  • "remove": Completely remove the word from the transcription.
  • "mask": Mask the word (for example, replace it with asterisks).
  • "tag": Add tags to identify the filtered word.
awsLanguageModelName String The name of a custom language model is to be applied during transcription for better accuracy in a domain-specific language.
awsPiiEntityTypes Array A list of PII (Personally Identifiable Information) entity types to be detected (for example, ["NAME", "EMAIL", "SSN"]). This helps the system identify and protect sensitive information during transcription.
awsPiiIdentifyEntities Boolean A flag that indicates whether or not to identify and highlight PII entities within the transcribed text. If true, PII entities will be detected and processed according to the configuration.

Nvidia ASR

nvidiaRivaUri String grcp endpoint (ip:port) that Nvidia Riva is listening.
nvidiaMaxAlternatives Number The number of alternatives to return.
nvidiaProfanityFilter Boolean Indicates whether to remove profanity from the transcript.
nvidiaWordTimeOffsets Boolean indicates whether to provide word-level detail.
nvidiaVerbatimTranscripts Boolean Indicates whether to provide verbatim transcripts.
nvidiaCustomConfiguration Object An object of key-value pairs that can be sent to Nvidia for custom configuration.
nvidiaPunctuation Boolean Indicates whether to provide punctuation in the transcripts.

Deepgram ASR

deepgramApiKey String Deepgram API key to authenticate with (overrides setting in Kore VG portal).
deepgramTier String Deepgram tier you would like to use ('enhanced', 'base').
sttModel String Deepgram model used to process submitted audio ('general', 'meeting', 'phonecall', 'voicemail', 'finance', 'conversationalai', 'video', 'custom'). nova-2-phonecall
deepgramCustomModel String Id of the custom model.
deepgramVersion String Deepgram version of the model used.
deepgramPunctuate Boolean Indicates whether to add punctuation and capitalization to the transcript.
deepgramProfanityFilter Boolean Indicates whether to remove profanity from the transcript.
deepgramRedact Object {         "type": "string",         "enum": [           "pci",           "numbers",           "true",           "ssn"         ]       }, Whether to redact information from transcripts ('pci', 'numbers', 'true', 'ssn')
deepgramDiarize Boolean Whether to assign a speaker to each word in the transcript.
deepgramDiarizeVersion String If set to '2021-07-14.0' the legacy diarization feature will be used.
deepgramNer Boolean
deepgramMultichannel Boolean Indicates whether to transcribe each audio channel independently.
deepgramAlternatives Number The number of alternative transcripts to return.
deepgramNumerals Boolean Indicates whether to convert numbers from written format (for example, one) to numerical format (for example, 1).
deepgramSearch Array An array of terms or phrases to search for in the submitted audio.
deepgramReplace Array An array of terms or phrases to search for in the submitted audio and replace.
deepgramKeywords Array An array of keywords to which the model should pay particular attention to boosting or suppressing to help it understand the context.
deepgramEndpointing Boolean | Number Indicates the number of milliseconds of silence Deepgram will use to determine whether a speaker has finished saying a word or phrase. The value provided must be either several milliseconds or 'false' to disable the feature entirely. Note: The default endpoint value that Deepgram uses is 10 milliseconds. You can set this value higher to allow to require more silence before a final transcript is returned but we suggest a value of 1000 (one second) or less, as we have observed strange behaviors with higher values. If you wish to allow more time for pauses during a conversation before returning a transcript, we suggest using the utteranceEndMs feature instead.
deepgramVadTurnoff Number
deepgramTag String A tag to associate with the request. Tags appear in usage reports.
deepgramUtteranceEndMs Number This parameter is used to configure ASR to detect the end of speech in live-streaming audio.
deepgramShortUtterance Boolean This causes a transcript to be returned as soon as the Deepgram is_final property is set. This should only be used in scenarios where you are expecting a very short confirmation or directed command and you want minimal latency.
deepgramSmartFormatting Boolean Indicates whether to enable Deepgram's Smart Formatting feature. Deepgram's Smart Format feature applies additional formatting to transcripts to optimize them for human readability. Smart Format capabilities vary between models. When Smart Format is turned on, Deepgram will always apply the best-available formatting for your chosen combination of model, model option, and language.

Common TTS Parameters

Parameter Type Supporting STT/ TTS Description Examples
disableTtsCache Boolean ALL Using cache for calling TTS engine if same statement or word found.
ttsEnhancedVoice String AWS Amazon Polly has four voice engines that convert input text into life-like speech. These include Generative, Long-form, Neural, and Standard. To use an Amazon Polly voice Examples standard" , " neural", "generative", " long-form"
ttsGender String MALE, FEMALE, NEUTRAL Google
ttsLoop Number / String ALL The ttsLoop parameter is used in Text-to-Speech (TTS) systems to control the repeated playback of a TTS-generated message. When ttsLoop is enabled, the specified TTS message will be played multiple times in a loop, which is useful in scenarios where you want to ensure the message is heard clearly, or when the user might need more time to process the information. Example - ttsLoop = 2 Text will be played twice
earlyMedia Boolean ALL The Early Media parameter in TTS (Text-to-Speech) is used to control the playback of audio prompts or messages before a call is fully connected. This feature is typically employed in telecommunication systems, allowing messages to be played while the call is still in the "early" phase, meaning before the recipient answers the call.
ttsOptions Object PlayHt, Deepgram, ElevenLabs, Whisper It is used to tune the TTS.

TTS Options in Kore VG

Kore VG now supports a ttsOptions parameter that allows bot developers to customize Text-to-Speech (TTS) messages by passing dynamic objects tailored to the specific TTS provider. Depending on the provider, these options can be used to fine-tune aspects like voice settings, speed, and other properties.

Note

Each TTS provider will have its own set of customizable parameters. For more detailed information on the parameters they support, refer to their official websites.

Structure of ttsOptions

The ttsOptions object contains provider-specific settings in a key-value format. Below are examples of different TTS providers:

ElevenLabs

  • optimize_streaming_latency: Adjusts the latency during streaming.
  • voice_settings: Includes various voice customization options like stability, similarity_boost, and use_speaker_boost. Learn more.

PlayHT

  • quality: Sets the quality of the audio output.
  • speed: Controls the playback speed.
  • emotion, voice_guidance, style_guidance, and text_guidance: Allow further customization of the voice's emotional tone and style. Learn more.

Deepgram

Apart from generic parameters like ttsLanguage and voiceName, which are common across most TTS engines, Deepgram offers a few additional parameters that enhance customization:

  • encoding (string): You can specify the desired encoding format for the output audio file, such as mp3 or wav.
  • model (enum): Defines the AI model to be used for synthesizing the text into speech. The default model is aura-asteria-en, optimized for natural-sounding English voice output.
  • sample_rate (string): This enables you to set the sample rate of the audio output, offering control over the quality and clarity of the sound produced.
  • Container: The Container feature allows users to specify the desired file format wrapper for the output audio generated through text-to-speech synthesis.

These parameters provide additional flexibility for developers to fine-tune the audio output to meet their specific needs. All these parameters will be set inside ttsOptions. Learn more.

AWS

Apart from generic parameters like ttsLanguage and voiceName, which are common across most TTS engines, Aws offers a few additional parameters that enhance customization, like ttsEnhanceVoice, also known as an engine.

Amazon Polly has four voice engines that convert input text into lifelike speech. These include “standard," "neural," "generative," and "long-form."

ttsEnhancedVoice = “neural”

Open AI (Whisper)

Apart from generic parameters like ttsLanguage and voiceName, which are common across most TTS engines, Whisper offers a few additional parameters that enhance customization, like a model.

For real-time applications, the standard tts-1 model provides the lowest latency but at a lower quality than the tts-1-hd model. Due to how the audio is generated, tts-1 is likely to generate more static content in certain situations than tts-1-hd. In some cases, the audio may not have noticeable differences depending on your listening device and the person.

ttsOptions = {
   model = "tts-1"
}

Primary and Fallback ASR/TTS

ASR/TTS Fallback functionality can be implemented at various levels within the system, such as the application level, experience flow level, or even the call control parameter level. This mechanism ensures that if there is an error or failure with the primary ASR (Automatic Speech Recognition) or TTS (Text-to-Speech) service, the system will automatically switch to a secondary, or fallback, ASR/TTS configuration. By doing this, the fallback prevents interruptions in the service and ensures a seamless user experience, regardless of issues with the primary configuration. * For optimal performance, it’s advised to configure the fallback with the same vendor in a different region/label.

Configure Primary and Fallback ASR/TTS

Location 1 - Global Setting

In SmartAssist: Configurations > System Setup > Language & Speech > Voice Preferences > Show Advanced Settings.
Show Advanced Settings

Location 2 - Call Control Parameters

In SmartAssist: Automation > Select bot > Conversational Skills > Dialog Tasks > Select Dialog Task > Select the Node you want to configure > IVR Properties > Advance Controls > Call Control Parameters.
Call Control Parameters

Location 3 - Experience Flows

In SmartAssist: Configurations > Experience Flows > Update/New Experience Flow > Speech Recognition Engine (ASR/TTS) > Show Advanced Settings.
Experience Flows

Edit Experience Flows

Location 4 - Start Node in Experience Flow
Start Node
Start Node - Experience Flow

Note

  • This feature is available only in ‘SmartAssist’ and not implemented in ‘XO11’. We will implement it in the next releases.
  • For now, you can add Primary & Fallback ASR/TTS from the same vendor only.
    • Example: If you have selected the ‘Microsoft Azure Speech Services’ vendor as the ASR, you can enter a label name from the Microsoft vendor itself, such as ‘my_azure-US’.
    • You can configure the label name in Primary ASR/TTS configuration and Fallback ASR/TTS configuration under Show Advanced Settings.
    • The fallback ASR/TTS configuration should not be the same as the Primary ASR/TTS configuration.
    • Both Primary and Fallback ASR/TTS configurations should be available in SAVG Speech Services otherwise you will not be able to configure in SmartAssist.
    • The Credential Status of the Speech services configured in SAVG should be verified. If credential status is failed then ASR/TTS conversations will fail.
  • In Call control parameters,
    • You can configure the fallback for different vendors. But for optimal performance, it’s advised to configure the fallback with the same vendor in a different region.
    • In-call control parameters don’t have any validation of duplicate values for Primary and Fallback configurations, so you have to pay closer attention to spelling mistakes. Learn more.

Voice Gateway Properties

Parameter Type Supporting STT/ TTS Description Examples

Provider related parameters

Speech-to-text and text-to-speech services interface with the user using a selected language (for example, English US, English UK, or German). Text-to-speech services also use a selected voice to speak to the user (for example, female or male) For Recognizer, Speech-to-text is used, and for synthesizer, Text to Speech sttProvider => google,microsoft => Recognizer ttsProvider => google,microsoft,aws => Synthesizer JSON example { sttProvider : “google”, sttLangauge:”en-IN” ttsProvider : “google” Language : “en-IN”, voiceName :”‘en-IN-Wavenet-A” } For applying the below parameters we always have to use the STT engine as Recognizer otherwise the default is applied that was set as bot level or koreVG / smartAssist application
Note: Provider Properties will be Applied at the Session Level
sttProvider String ALL To Set the Speech to Text Engine At any stage of the call, the bot can dynamically change the speech provider (speech-to-text or text-to-speech) of the call. The provider change can be done for the entire call duration (the current text/audio that is played by the bot). sttProvider : “google”
sttLanguage String ALL To set STT Language in for recognizing user's voice Note:  Transcript will come according to sttLanguage sttLangauge = “zh-CN” All transcripts will come in Chinese Defines the language (for example, "en-ZA" for South African English) of the bot conversation and is used for the speech-to-text service. sttLanguage = “en-US”
ttsProvider String ALL Silimar to STT Provider ttsProvider:”microsoft”
ttsLanguage String ALL Similar To sttLanguage The parameter is required to set TTS  languages. Ex ttsLanguage = “en-US”
voiceName String sttLanguage : ‘en-AU’, ALL voiceName is mandatory to text to speech conversion Voice name should be correctly aligned to ttsLanguage VoiceName is used in TTS only for bot response voiceName : ‘en -AU-NatashaNeural’ Example-  { ttsPrrovider : ‘microsoft’, sttLanguage : ‘en-AU’, voiceName : ‘en-AU-NatashaNeural’ }
enableSpeechInput Boolean All If False, Allow only DTMF Input, By default It is always true and can be used in entity nodes Do not use this in Channel Over-rider Script. It is meant to be used only through the Call Control Parameter. Example - enableSpeechInput: false

Labels and Fallback Providers

Label - Assign/Create a label only if you need to create multiple speech services from the same vendor. Then, use the label in your application to specify which service to use. How to Configure Label 1) Add a speech service Inside the speech tab. 2) Select a provider and add a label with a unique name. 3) Use the same label in the call control parameter. 4) The NODE at which you use FallBack Call control parameters at the Same node Primary Recognizer and Synthesizer is NECESSARY to pass. example sttProvider = “google”, sttLanguage = “en-US”, sttLabel = “google-stt-2” Examples: STT “sttProvider”: “microsoft”, “sttLabel”: “my_azure-US”, “sttLanguage: “en-US” TTS “ttsProvider” : “microsoft”, “ttsLanguage”: “en-US”, “voiceName”: “en-US-AmberNeural” “ttsLabel” : “my_azure-US” FallBack Examples “sttProvider” : “microsoft”, “sttLabel” : “my_azure-US”, “sttLanguage: “en-US” “ttsProvider” : “microsoft”, “ttsLanguage”: “en-US”, “voiceName”: “en-US-AmberNeural” “ttsLabel” : “my_azure-US” “sttFallbackProvider” : “microsoft”, “sttFallbackLanguage: “en-US”, “sttFallbackLabel”:”my_azure_Europe” “ttsFallbackProvider” : “microsoft”, “ttsFallbackLanguage” : “en-US” “ttsFallbackLabel” : “my_azure_Europe” “ttsFallbackVoiceName”: “en-US-AmberNeural” Note: The NODE at which you use FallBack Call control parameters, at the same node Primary Recognizer and Synthesizer is NECESSARY to pass. The best practice is to keep the same ASR Engine in Fallback with a different Label. If the current provider fails, Kore VG will pick a fallback provider. Similarly, we can add a Fallback for the TTS Provider.
Note: Fallback properties will be applied at the session level.
sttLabel String Uniquely identify ASR engine in Kore VG.
sttFallbackLabel String If fallback is enabled in Kore VG at the application level, then in case of any error the switch will happen of ASR to fallback configuration, it is recommended to have a fallback to the same vendor with a different region. 
sttFallbackProvider String Fallback provider details 
sttFallbackLanguage String Fallback language details 
ttsLabel String Uniquely identify TTS engine in Kore VG
ttsFallbackLabel String Fallback Label details 
ttsFallbackProvider String Fallback provider details 
ttsFallbackLanguage String Fallback language details 
ttsFallbackVoice String Fallback voice details

Continuous ASR

Continuous ASR (automatic speech recognition) is a feature that allows speech-to-text (STT) recognition to be tuned for the collection of things like phone numbers, customer identifiers, and other strings of digits or characters, which, when spoken, often have pauses between utterances.
Note: For Only Microsoft Microsoft Azure Introduces one ASR Property that works the same way as Continuous ASR, AzureSegmentationSilenceTimeout. Since Silence is detected by ASR Engine Directly Instead of Voice Gateway, detect and merge the response. AzureSegmentationSilenceTimeout is more accurate than continuous ASR. Learn more.
Note: Continuous ASR / AzureSegmentationSilenceTimeout is applied at the session level. Throughout the Call, it will be active, and the developer can adjust the value at different nodes based on the requirement.
continuousASRTimeoutInMS Number in millisecond  ALL This is a duration of silence, in seconds, to wait after a transcript is received from the STT vendor before returning the result. If another transcript is received before this timeout elapses, then the transcripts are combined and recognition continues. The combined transcripts are returned once a timeout between utterances exceeds this value Ex-5000 for  5 sec
continuousASRDigits Any digit  Ex- *,%,<,# ALL a DTMF key which, if entered, will also terminate the gather operation and immediately return the collected results continuousASRDigits : &

Barge-IN

The Barge-In feature controls Kore VG behavior in scenarios where the user starts speaking or dials DTMF digits while the bot is playing its response to the user. In other words, the user interrupts ("barges-in") the bot.
Note: Barge-in Will be applied at the Node level.
listenDuringPrompt Boolean - True or false ALL If false, do not listen for user speech until the bot has finished playing its response to the user. Defaults to true Similar to Barge-in.
bargeInMinWordCount Number ALL If barge-in is true, only kill speech when this many words are spoken. Defaults to 1.
bargeInOnDTMF Boolean ALL Press any key to enable DTMF, and kill audio playback if the caller enters DTMF then you can tell your utterance or speech.

Timeout related parameters

Note: All Timeout Parameters will be applied at the Node level.
userNoInputTimeoutMS Number in millisecond 1 sec - 1000 ALL Define the maximum wait time to receive user input If userNoInputTimeoutMS = 0 Kore VG will wait for an infinite time for User Input. Defines the maximum time (in milliseconds) that VoiceAI Connect waits for input from the user. userNoInputTimeoutMS = 20000
dtmfCollectInterDigitTimeoutMS Number - Time in milliseconds ALL Defines the timeout that Kore VG waits for the user to press another digit before it sends all the digits to the bot.
dtmfCollectSubmitDigit Number ALL Defines a special DTMF "submit" digit that when received from the user, KoreVg immediately sends all the collected digits to the bot (as a DTMF message), without waiting for the timeout to expire or for the maximum number of expected digits.
dtmfCollectMaxDigits Number ALL Maximum number of DTMF digits expected to gather Example If maxDigit = 5  So Bot will take only a maximum of 5 digits Input 1234567 bot takes only 12345.
dtmfCollectminDigits Number ALL Minimum number of DTMF digits expected to gather. Defaults to 1
dtmfCollectnumDigits Number ALL The exact number of DTMF digits is expected to be gathered.