Streaming Responses¶
Introduction¶
Streaming responses in large language models (LLMs) enable real-time, incremental output generation. Instead of waiting for the entire response to be computed, the model starts transmitting pieces of the output as they become available. This approach not only reduces latency and enhances user interaction but also fosters a sense of immediate connection, mirroring real-time communication. Streaming is particularly beneficial in applications requiring dynamic updates, such as conversational AI, speech-to-text systems, and real-time content-generation tools.
Current Capabilities¶
We support the voice-based channel Kore Voice Gateway for GenAI features, including Agent Node, with PlayHT as the supported TTS engine. Additionally, we provide seamless integration with models from OpenAI and Azure OpenAI. Our Custom Prompt capability enables integration with other LLMs, allowing businesses to use non-system models by defining their own prompts, provided the LLM supports streaming.
Benefits of Streaming¶
- Real-Time Output: Generates and displays text instantly, reducing wait times.
- Lower Latency: Speeds up response time, enhancing user experience.
- Improved Interaction Flow: Partial outputs support iterative writing and brainstorming.
- Optimized for Live Applications: Enhances real-time chat, speech-to-text, and code autocompletion.
Use Case¶
Streaming responses unlocks significant benefits across various industries, enhancing real-time interactions and delivering incremental updates that improve user experience and operational efficiency.
Domain | Use Cases |
Healthcare | Streaming comprehensive summaries of patient history or medical guidelines. |
Delivering in-depth analysis of clinical studies or medical research in real time. | |
Finance | Streaming detailed breakdowns of investment portfolios or market analysis. |
Providing incremental summaries of compliance documents and regulations. | |
E-commerce | Streaming extensive side-by-side comparisons of products for informed decision-making. |
Education | Delivering detailed outlines or summaries of academic courses or study materials. |
Legal | Streaming detailed explanations of legal precedents and their relevance to current cases. Providing incremental analysis and feedback on lengthy legal contracts. |
Customer Support | Streaming detailed troubleshooting steps for intricate customer issues or technical problems. |
Human Resources | Streaming detailed explanations of HR policies or benefit packages for employees. |
Marketing | Streaming in-depth analysis of marketing campaigns and their ROI. |
Enable Streaming for Agent Node¶
Select Default Streaming Prompt¶
The XO Platform provides Default-Streaming prompts in addition to default(non-streaming) prompts for Agent Node for the OpenAI and Azure OpenAI models.
Go to Generative AI Tools > GenAI Features > Dynamic Conversations and select the default-streaming prompt for Agent Node. You can also create custom streaming prompts for this model.
Create Custom Streaming Prompts¶
To create a custom streaming prompt, see How to add Prompts and Requests and enable the streaming response toggle.
Note
- When enabled, add the required stream parameter to the custom prompt for the model to recognize streaming. For example, "stream": true for OpenAI and Azure OpenAI.
- The saved prompt will appear with a stream tag in the prompts library.
- Enabling streaming disables the “Exit Scenario”, “Virtual Assistant Response”, “Collected Entities”, and “Tool Call Request” (for Agent Node) fields.
Configure Kore Voice Gateway¶
Streaming is currently supported only by the Kore Voice Gateway channel. To configure it, see Configure Kore Voice Gateway.
Benchmarking¶
Task | Mode | Input Tokens | Output Tokens | Time Taken (s) | Reduction (%) |
Agent Node | Non-streaming | 777 | 90 | 2.59 | Output: -30%, Time: -83% |
Streaming | 676 | 62 | 0.44 | ||
50-word Joke | Non-streaming | 95 | 54 | 2.4 | Output: +10%, Time: -80% |
Streaming | 68 | 60 | 0.47 | ||
500-word Joke | Non-streaming | 95 | 595 | 22.39 | Output: +10%, Time: -98% |
Streaming | 68 | 649 | 0.41 | ||
500-word Joke | Non-streaming | 68 | 642 | 30.11 | Output: -0.05%, Time: -97% |
Streaming | 68 | 641 | 0.88 | ||
500-word Story | Non-streaming | 68 | 616 | 16.86 | Output: +2.27%, Time: -97.5% |
Streaming | 68 | 630 | 0.44 | ||
500-word Essay | Non-streaming | 70 | 687 | 22.23 | Output: +1.46%, Time: -97.15% |
Streaming | 70 | 697 | 0.63 |
Efficiency in Streaming Mode:
- Output < 100 Tokens: Time reduction of 80%-85%.
- Output 100-600 Tokens: Time reduction of 97%-98%.
- Output > 600 Tokens: Time reduction of 98%-99%.
Key Insights:
- Streaming mode demonstrates exponential efficiency gains as output size increases.
- Minimal impact on output quality (≤2.5% variance) ensures task reliability.
- Significant time savings make streaming ideal for long-form content generation and real-time use cases.
Note
The following benchmarking results were conducted on our system under specific scenarios. Performance may vary depending on your unique environment and use cases. We recommend conducting your own due diligence before enabling streaming. These benchmarks are provided solely for reference purposes and do not guarantee similar outcomes in all situations.
Analytics¶
As part of the XO platform's Analytics module updates, enhancements have been introduced to better track and differentiate streaming and non-streaming responses. In the LLM Usage Logs main screen, a new column called "TTFT" (Time to First Token) has been added, applicable only to streaming responses. Additionally, the Detailed Log page now includes a Response Type field to indicate whether a response is streaming or non-streaming, offering more precise insights into response behavior.
-
TTFT (Time to First Token): Reflects the time taken for the first token to appear in the response. For the final chunk of a response volley, TTFT is blank, as no further messages are sent to the user.
-
Response Duration: Indicates the time taken by the LLM from generating the first chunk to the final chunk.
This detailed analytics data can be found under Usage Logs in the platform.
Considerations for Streaming¶
While streaming enhances real-time interactions and user experience, certain limitations should be considered.
- Post-processing operations are not possible, as they require the complete response, which conflicts with the nature of streaming.
- Guardrails are not supported in streaming mode, as content moderation typically requires full-context evaluation, which is incompatible with token-by-token streaming.
- The effectiveness of voice-based streaming also depends on the TTS engine’s support for bi-directional streaming, limiting compatibility to specific engines like PlayHT, Deepgram.
- When BotKit is enabled, interception of streamed response messages is unsupported due to the real-time delivery process.
Note
The quality of streaming responses is highly dependent on the prompts written, and since LLMs are subject to hallucination, it is essential to perform your own due diligence. Ensure that the prompts are accurate and aligned with the desired output for reliable performance.