XO GPT: Answer Generation Model¶
Introduction¶
Large Language Models (LLMs) have transformed the landscape of natural language processing (NLP), particularly in the development of Question and Answer (Q&A) systems. By utilizing the Retrieval Augmented Generation (RAG) approach, these models combine the power of vast textual data with real-time retrieval of relevant information, enabling them to generate accurate and contextually rich answers to user queries. The RAG technique enhances the model's ability to deliver precise responses, especially in scenarios where up-to-date or domain-specific information is crucial. This hybrid approach allows reliability and relevance, making LLM-based Q&A systems highly effective across various applications, from customer service to research assistance.
Implementing RAG-driven Q&A systems also presents challenges, such as ensuring the quality of retrieved content and managing potential biases. The Kore.ai’s XO GPT Model for Answer Generation is a fine-tuned Large Language LLM (LLM) that addresses the key challenges faced when pre-trained LLMs are used out-of-the-box.
Challenges with Commercial Models¶
- Latency: The time consumed by the commercial LLMs to process and return a response can be significant, especially when dealing with high volumes of requests or real-time applications. This impacts the user experience.
- Cost: Commercial models often have a per-request cost, and it rises with high scale usages. This makes managing costs difficult, especially for large-scale deployments.
- Data Governance: Sending user queries to external models raises data privacy and security concerns. This is crucial in industries that involve sensitive or proprietary information.
- Lack of Customization: Commercial models are not tailored to specific use cases or industries, leading to less accurate or relevant responses.
- Limited Control: There is minimal control over the internal workings of commercial models, making it difficult to correct or refine their behavior when they generate incorrect or undesirable outputs.
- Compliance and Regulatory Constraints: Certain industries have stringent compliance and regulatory requirements that may not be fully supported by commercial LLM providers, complicating their use in those sectors.
RAG Framework and XO GPT Integration¶
The XO GPT Model is a key component within our RAG framework. It is activated right after the retrieval phase. It takes the retrieved data and generates accurate, contextually relevant answers, ensuring the overall system delivers high-quality responses.
Key Assumptions¶
The following are a few key assumptions made for the XO GPT Answer Generation Model:
- The answer generation is assumed to be based only on text-based data chunks, excluding non-textual content like images or videos.
- The model processes questions that the XO GPT User Query Rephrasing Model has first rephrased for improved clarity and relevance.
- The model assumes that the data chunks it retrieves are accurate and relevant, and its performance is evaluated accordingly.
- The model focuses on generating answers based solely on the text, so the responses may not include information within links or images.
Benefits of XO GPT Answer Generation Model¶
The XO GPT Answer Generation model offers several potential advantages for businesses seeking to enhance their Q&A capabilities:
-
Consistent and Accurate
XO GPT employs retrieval-augmented generation techniques to deliver contextually relevant and precise answers, enhancing the reliability and effectiveness of responses in Q&A applications. Detailed benchmarking results, including latency and accuracy metrics compared to other models, can be found here.
-
Cost-Effective Performance
For customers in the Enterprise Tier, XO GPT completely eliminates the commercial models’ usage costs. The following is an illustration of GPT -4 models. (Note: actual costs could vary based on token usage). For instance, with an average of 10 input tokens per user question and 10,000 daily Q&A pairs, where each answer averages 80 tokens, the cost comparison between models is as follows:
Model Name | Input Cost / MTok | Output Cost / MTok | Total Cost / Annum |
GPT-4 Turbo | $30 | $60 | $427,050 |
GPT-4 | $10 | $30 | $158,775 |
GPT-4o Mini | $0.15 | $0.6 | $2,628 |
-
Enhanced Data Security and Safety Our model is designed to safeguard information by ensuring that no client or user data is utilized for model retraining. Our systems are robust enough to handle both client and user data securely.
Guardrails: XO GPT uses several key safety measures to ensure responsible and secure interactions:
- Content Moderation: Detects and blocks harmful or inappropriate content.
- Behavioral Guidelines: Maintains professionalism and appropriateness in responses.
- Response Oversight: Monitors and addresses flagged or potentially problematic interactions.
- Input Validation: Ensures inputs are appropriate and comply with usage guidelines.
- Usage Controls: Applies limits to prevent misuse and support responsible operation.
AI Safety Measures: XO GPT incorporates essential safety protocols to prevent harmful behaviors and maintain ethical standards:
- Ethical Guidelines: Strict protocols ensure AI decisions align with ethical standards.
- Bias Monitoring: Regular checks to prevent bias and ensure fairness in responses.
- Transparency: Clear, understandable responses to promote trust and accountability.
- Continuous Improvement: Ongoing updates to enhance safety and incorporate feedback.
Note
The exact performance, features, and language support may vary based on specific implementations and use cases. We recommend thorough testing in your specific environment to assess the model's suitability for your needs.
Use Cases¶
The use cases of an Answer Generation / Q&A model, particularly one leveraging Retrieval-Augmented Generation (RAG), span various domains, each benefiting from the model’s ability to provide accurate, contextually relevant answers.
Domain | Use Cases |
Healthcare | Answering patient queries based on medical records and guidelines |
Assisting doctors with evidence-based diagnosis recommendations | |
Providing drug information and potential interactions | |
Finance | Answering customer questions about loan eligibility and application status |
Providing personalized financial advice based on user profiles | |
Assisting with tax-related queries and calculations | |
E-commerce | Recommending products based on user queries and past purchases |
Answering questions about shipping, returns, and order status | |
Providing detailed product information and comparisons | |
Education | Assisting students with homework and study questions |
Providing explanations for complex concepts in various subjects | |
Recommending study materials and resources based on curriculum needs | |
Legal | Answering legal queries based on case law and statutes |
Assisting with contract review and interpretation | |
Providing information on legal rights and procedures | |
Customer Support | Automating responses to frequently asked questions |
Assisting with troubleshooting common issues | |
Providing personalized support based on customer history | |
Human Resources | Answering employee questions about policies and benefits |
Assisting with recruitment by matching job descriptions to candidate profiles | |
Providing guidance on performance evaluations and career development |
Sample Outputs¶
The following section presents a few examples of the answers generated by the XO GPT Answer Generated model based on the shortlisted chunks:
Ingested Content (Chunks)
- ChunkId: chk-1 Content: A home loan is a secured loan that is obtained to purchase a property by offering the property itself as collateral. The calculation of a home loan involves several factors, including the loan amount, interest rate, tenure, and the borrower’s financial profile.
- ChunkId: chk-2 Content: The principal amount is the initial loan amount sanctioned by the lender. The interest rate can be either fixed or floating. Fixed interest rates remain constant throughout the loan tenure, whereas floating rates fluctuate with changes in market rates.
- ChunkId: chk-3 Content: The tenure of the loan is the period over which the loan is to be repaid. It typically ranges from 5 to 30 years. Longer tenures cause lower monthly installments but higher interest payments over the life of the loan.
- ChunkId: chk-4 Content: The Equated Monthly Installment (EMI) is the amount a borrower needs to pay every month. It includes both the principal and interest components. The EMI can be calculated using the formula: EMI = [P x R x (1+R)^N] / [(1+R)^N-1], where P is the principal loan amount, R is the monthly interest rate, and N is the number of monthly installments.
- ChunkId: chk-5 Content: Lenders also consider the borrower’s financial profile, including income, credit score, and existing liabilities, before sanctioning the loan. A good credit score and stable income can help you get a lower interest rate. Borrowers should use online home loan calculators to get an estimate of their EMI and plan their finances accordingly.
Generated Responses
Question:
How do you compute EMI?
XO GPT Model Generated Responses:
The EMI is calculated using the formula: EMI = [P x R x (1+R)^N] / [(1+R)^N-1], where P is the principal loan amount, R is the monthly interest rate, and N is the number of monthly installments [chk-4].
Question:
What is a home loan? What is photosynthesis?
XO GPT Model Generated Responses:
I found only the answer to one question 'What is a home loan?': A home loan is a secured loan that is obtained to purchase a property by offering the property itself as collateral. The calculation of a home loan involves several factors, including the loan amount, interest rate, tenure, and the borrower’s financial profile [chk-1]. The remaining questions cannot be answered from the given context.
XO GPT - Model Building Process¶
The model-building process consists of several key stages that form the backbone of AI system development. To know more see Model Building Process.
Model Benchmarks¶
This section highlights the features, updates, and changes that vary between different versions of the Answer Generation Model. It provides version-specific information that can help identify what is unique to each version.
The following table summarizes the versions covered in this document:
Model Version | Accuracy | Tokens/sec (TPS) | Latency (secs) | Benchmark Comparison | Test Data & Results |
Version 1.0 | 94% | 20 | 1.36 | Benchmark summary | Test date and results spreadsheet |
Version 2.0 | 96% | 54 | 1.03 | Benchmark summary | Test date and results spreadsheet |
Version 2.0¶
Model Choice¶
We evaluate various community models that are suitable for the objective of response generation and fine-tune with our proprietary data described in the previous section. One or more candidate models are used throughout the training and evaluation phase. The model that performs better in terms of accuracy, safety, latency, etc., will be deployed. We continue to evaluate the models as part of ongoing improvements and may choose to use a different base model in the newer versions of the model. Currently, we are using Llama 3.1 8B Instruct as the base model for fine-tuning and deployment.
Base Model | Developer | Language | Release Date | Status | Knowledge Cutoff |
Llama 3.1 8B Instruct | Meta | Multi-lingual | July, 2024 | Static | December 2023 |
Fine-tuning Parameters¶
Parameters | Description | Value |
Load in 4-bit Precision | Loads the model weights in 4-bit precision to reduce memory usage. | True |
Use Double Quantization | Uses double quantization to improve model accuracy. | True |
4-bit Quantization Type | Type of quantization used for 4-bit precision. | nf4 |
Computation Data Type | The data type used for computation with 4-bit quantized weights. | torch.float16 |
LoRA Rank | The rank of the low-rank decomposition in LoRA. | 32 |
LoRA Alpha | LoRA scaling factor. | 16 |
LoRA Dropout Rate | Dropout rate for LoRA layers to prevent overfitting. | 0.05 |
Bias Term Inclusion | Specifies whether to add bias terms in the LoRA layers. | - |
Task Type | The type of task for which LoRA is applied is Causal Language Modeling (CAUSAL_LM). | CAUSAL_LM |
Targeted Model Modules | Specific layers in the model where LoRA is applied. | ['k_proj', 'q_proj', 'v_proj', 'o_proj'] |
General Parameters¶
As a pre-requisite, an Agent AI License is required, and the model is hosted on infrastructure with 2 A10 GPUs. Some of the other general fine-tuning parameters include the following:
Parameters | Description | Value |
Learning Rate | Controls how quickly or slowly the model reaches the minimum of loss. | 2e-4 (0.0002) |
Batch Size | A number of examples the model learns from at once. | 2 |
Epochs | Number of times the model sees the entire training data. | 4 |
Warm-up Steps | Gradual start for the learning rate to help the model stabilize early on. | – |
Max Sequence Length | The maximum length of input data the model can handle. | 32768 |
Early Stopping | Stops training if the model stops improving to prevent overfitting. | – |
Optimizer | Algorithm that adjusts the model's learning. | paged_adamw_8bit |
Layer-wise LR Decay | Uses different learning rates for different parts of the model to improve stability. | – |
Learning Rate Scheduler | Adjust the learning rate during training to improve performance. | – |
AWQ Model Quantization¶
To optimize the performance and efficiency of the model, we apply AWQ (Activation-aware Weight Quantization), a technique designed to reduce memory and computational requirements while maintaining accuracy significantly. Below is a detailed breakdown of the quantization process and parameters used with AWQ for the Llama 3.1 8B Instruct model.
Parameters | Description | Value |
Zero Point | Indicates whether to include a zero-point in the quantization scheme for better weight representation. | True |
Quantization Group Size | The size of the weight groups for quantization, balancing between compression and computational overhead. | 128 |
Weight Precision | The number of bits used to represent weights in the quantization process. | 4 |
Quantization Version | The specific version of AWQ is optimized for GEMM (General Matrix Multiplication) operations. | "GEMM" |
Computation Data Type | The data type is used for computation during inference with quantized weights. | torch.float16 |
Model Loading | Configuration to load the model with reduced CPU memory usage for efficient deployment. | {"low_cpu_mem_usage": True} |
Tokenizer Loading | This includes trust configuration for loading the tokenizer with remote code compatibility. | trust_remote_code=True |
Quantization Benefits | AWQ enables lower memory footprint and faster inference without significant loss of accuracy. | - |
Benchmarks Summary¶
To compare and contrast the performance of the fine-tuned model, we have considered the following other models:
- Mistral 7b v2: An open-source large language model with 7 billion parameters, known for its strong performance despite its relatively small size.
- Llama 3.1 8b: A powerful open-source large language model with 8 billion parameters, known for its strong performance across various tasks, including multilingual dialogue, text generation, and understanding.
- GPT 4o mini: A large language model developed by OpenAI, known for its advanced capabilities across a wide range of tasks (note: there's no official "GPT 4o mini" version).
- Claude 3.5 Sonnet: Part of the Claude 3 model family by Anthropic, designed to balance intelligence and speed for various tasks.
By leveraging its strengths in performance, latency, and responsible AI principles, XO GPT is well-positioned as a high-performing language model. The Test Data and Results V2.0 report provides a deeper dive into the evaluation process and results.
Version 1.0¶
Model Choice¶
We evaluate various community models that are suitable for the objective of response generation and fine-tune with our proprietary data described in the previous section. One or more candidate models are used throughout the training and evaluation phase. The model that performs better in terms of accuracy, safety, latency, etc., will be deployed. We continue to evaluate the models as part of ongoing improvements and may choose to use a different base model in the newer versions of the model. Currently, we are using Mistral 7B Instruct v0.2 as one of the base models for fine-tuning and deployment.
Base Model | Developer | Language | Release Date | Status | Knowledge Cutoff |
Mistral 7B Instruct v0.2 | Mistral AI | Multi-lingual | September, 2024 | Static | September, 2024 |
Fine-tuning Parameters¶
Parameters | Description | Value |
Load in 4-bit Precision | Loads the model weights in 4-bit precision to reduce memory usage. | True |
Use Double Quantization | Uses double quantization to improve model accuracy. | True |
4-bit Quantization Type | Type of quantization used for 4-bit precision. | nf4 |
Computation Data Type | The data type used for computation with 4-bit quantized weights. | torch.float16 |
LoRA Rank | The rank of the low-rank decomposition in LoRA. | 32 |
LoRA Alpha (Scaling Factor) | LoRA scaling factor. | 16 |
LoRA Dropout Rate | Dropout rate for LoRA layers to prevent overfitting. | 0.05 |
Bias Term Inclusion | Specifies whether to add bias terms in the LoRA layers. | - |
Task Type | The type of task for which LoRA is applied is Causal Language Modeling (CAUSAL_LM). | CAUSAL_LM |
Targeted Model Modules | Specific layers in the model where LoRA is applied. | ['k_proj', 'q_proj', 'v_proj', 'o_proj'] |
General Parameters¶
As a pre-requisite, an Agent AI License is required, and the model is hosted on infrastructure with 2 A10 GPUs. Some of the other general fine-tuning parameters include the following
Parameters | Description | Value |
Learning Rate | Controls how quickly or slowly the model reaches the minimum of loss. | 1e-3 (0.001) |
Batch Size | A number of examples the model learns from at once. | 1 |
Epochs | Number of times the model sees the entire training data. | 3 |
Warm-up Steps | Gradual start for the learning rate to help the model stabilize early on. | – |
Max Sequence Length | The maximum length of input data the model can handle. | 32768 |
Early Stopping | Stops training if the model stops improving to prevent overfitting. | – |
Optimizer | Algorithm that adjusts the model's learning. | paged_adamw_8bit |
Layer-wise LR Decay | Uses different learning rates for different parts of the model to improve stability. | – |
Learning Rate Scheduler | Adjust the learning rate during training to improve performance. | – |
Benchmarks Summary¶
To compare and contrast the performance of the fine-tuned model, we have considered the following other models:
- Llama 3.1 8B: A powerful open-source large language model with 8 billion parameters, known for its strong performance across various tasks, including multilingual dialogue, text generation, and understanding.
- GPT-40 mini: A large language model developed by OpenAI, known for its advanced capabilities across a wide range of tasks (note: there's no official "GPT 4o mini" version).
- Claude-3-5-Sonnet: Part of the Claude 3 model family by Anthropic, designed to balance intelligence and speed for various tasks.
By leveraging its strengths in performance, latency, and responsible AI principles, XO GPT is well-positioned as a high-performing language model. The Test Data and Results V1.0 report provides a deeper dive into the evaluation process and results.