Skip to content

Streaming Responses

Introduction

Streaming responses in large language models (LLMs) enable real-time, incremental output generation. Instead of waiting for the entire response to be computed, the model starts transmitting pieces of the output as they become available. This approach not only reduces latency and enhances user interaction but also fosters a sense of immediate connection, mirroring real-time communication. Streaming is particularly beneficial in applications requiring dynamic updates, such as conversational AI, speech-to-text systems, and real-time content-generation tools.

Current Capabilities

We support the voice-based channel Kore Voice Gateway for GenAI features, including Agent Node, with PlayHT as the supported TTS engine. Additionally, we provide seamless integration with models from OpenAI and Azure OpenAI. Our Custom Prompt capability enables integration with other LLMs, allowing businesses to use non-system models by defining their own prompts, provided the LLM supports streaming.

Benefits of Streaming

  • Real-Time Output: Generates and displays text instantly, reducing wait times.
  • Lower Latency: Speeds up response time, enhancing user experience.
  • Improved Interaction Flow: Partial outputs support iterative writing and brainstorming.
  • Optimized for Live Applications: Enhances real-time chat, speech-to-text, and code autocompletion.

Use Case

Streaming responses unlocks significant benefits across various industries, enhancing real-time interactions and delivering incremental updates that improve user experience and operational efficiency.

Domain Use Cases
Healthcare Streaming comprehensive summaries of patient history or medical guidelines.
Delivering in-depth analysis of clinical studies or medical research in real time.
Finance Streaming detailed breakdowns of investment portfolios or market analysis.
Providing incremental summaries of compliance documents and regulations.
E-commerce Streaming extensive side-by-side comparisons of products for informed decision-making.
Education Delivering detailed outlines or summaries of academic courses or study materials.
Legal Streaming detailed explanations of legal precedents and their relevance to current cases. Providing incremental analysis and feedback on lengthy legal contracts.
Customer Support Streaming detailed troubleshooting steps for intricate customer issues or technical problems.
Human Resources Streaming detailed explanations of HR policies or benefit packages for employees.
Marketing Streaming in-depth analysis of marketing campaigns and their ROI.

Enable Streaming for Agent Node

Select Default Streaming Prompt

The XO Platform provides Default-Streaming prompts in addition to default(non-streaming) prompts for Agent Node for the OpenAI and Azure OpenAI models.

Go to Generative AI Tools > GenAI Features > Dynamic Conversations and select the default-streaming prompt for Agent Node. You can also create custom streaming prompts for this model.
default-streaming

Create Custom Streaming Prompts

To create a custom streaming prompt, see How to add Prompts and Requests and enable the streaming response toggle.

Note

  • When enabled, add the required stream parameter to the custom prompt for the model to recognize streaming. For example, "stream": true for OpenAI and Azure OpenAI.
  • The saved prompt will appear with a stream tag in the prompts library.
  • Enabling streaming disables the “Exit Scenario”, “Virtual Assistant Response”, “Collected Entities”, and “Tool Call Request” (for Agent Node) fields.

Configure Kore Voice Gateway

Streaming is currently supported only by the Kore Voice Gateway channel. To configure it, see Configure Kore Voice Gateway.

Benchmarking

Task Mode Input Tokens Output Tokens Time Taken (s) Reduction (%)
Agent Node Non-streaming 777 90 2.59 Output: -30%, Time: -83%
Streaming 676 62 0.44
50-word Joke Non-streaming 95 54 2.4 Output: +10%, Time: -80%
Streaming 68 60 0.47
500-word Joke Non-streaming 95 595 22.39 Output: +10%, Time: -98%
Streaming 68 649 0.41
500-word Joke Non-streaming 68 642 30.11 Output: -0.05%, Time: -97%
Streaming 68 641 0.88
500-word Story Non-streaming 68 616 16.86 Output: +2.27%, Time: -97.5%
Streaming 68 630 0.44
500-word Essay Non-streaming 70 687 22.23 Output: +1.46%, Time: -97.15%
Streaming 70 697 0.63

Efficiency in Streaming Mode:

  • Output < 100 Tokens: Time reduction of 80%-85%.
  • Output 100-600 Tokens: Time reduction of 97%-98%.
  • Output > 600 Tokens: Time reduction of 98%-99%.

Key Insights:

  • Streaming mode demonstrates exponential efficiency gains as output size increases.
  • Minimal impact on output quality (≤2.5% variance) ensures task reliability.
  • Significant time savings make streaming ideal for long-form content generation and real-time use cases.

Note

The following benchmarking results were conducted on our system under specific scenarios. Performance may vary depending on your unique environment and use cases. We recommend conducting your own due diligence before enabling streaming. These benchmarks are provided solely for reference purposes and do not guarantee similar outcomes in all situations.

Analytics

As part of the XO platform's Analytics module updates, enhancements have been introduced to better track and differentiate streaming and non-streaming responses. In the LLM Usage Logs main screen, a new column called "TTFT" (Time to First Token) has been added, applicable only to streaming responses. Additionally, the Detailed Log page now includes a Response Type field to indicate whether a response is streaming or non-streaming, offering more precise insights into response behavior.

  • TTFT (Time to First Token): Reflects the time taken for the first token to appear in the response. For the final chunk of a response volley, TTFT is blank, as no further messages are sent to the user.

  • Response Duration: Indicates the time taken by the LLM from generating the first chunk to the final chunk.

This detailed analytics data can be found under Usage Logs in the platform.

Usage Logs

Usage Logs

Considerations for Streaming

While streaming enhances real-time interactions and user experience, certain limitations should be considered.

  • Post-processing operations are not possible, as they require the complete response, which conflicts with the nature of streaming.
  • Guardrails are not supported in streaming mode, as content moderation typically requires full-context evaluation, which is incompatible with token-by-token streaming.
  • The effectiveness of voice-based streaming also depends on the TTS engine’s support for bi-directional streaming, limiting compatibility to specific engines like PlayHT, Deepgram.
  • When BotKit is enabled, interception of streamed response messages is unsupported due to the real-time delivery process.

Note

The quality of streaming responses is highly dependent on the prompts written, and since LLMs are subject to hallucination, it is essential to perform your own due diligence. Ensure that the prompts are accurate and aligned with the desired output for reliable performance.