Best Practices for Data Ingestion¶
The most crucial step in building a RAG solution is to ensure the correct data is ingested and available in the system. Search AI supports various data sources, including:
- Files
- Web pages
- Third-party applications via connectors
Supported content types and formats vary across these sources.
Files¶
Search AI supports document ingestion in the following formats:
✅ PDF, DOCX, PPTX, and TXT
Refer to the guidelines below for the optimum ingestion of content from documents.
Document Quality & Structure¶
For optimal processing and search accuracy, documents should:
- Be system-generated (digitally created) rather than scanned or handwritten.
- Maintain a consistent layout across all pages.
- Be unencrypted and free of password protection to enable processing.
- Use shorter, well-structured content to improve search precision.
- Avoid unnecessary elements like headers, footers, and metadata that do not contribute to content relevance.
Formatting & Layout Considerations¶
- Single-column documents provide the highest retrieval accuracy.
- Multi-column layouts may reduce search precision and require fine-tuning.
- Documents should have clear section headers and logical content organization for improved readability and search effectiveness.
Content & File Restrictions¶
To prevent data loss and ensure optimal retrieval:
- Avoid Compressed PDFs, which can cause data distortion.
- Avoid Multi-page tables, as they are challenging to process accurately.
- Avoid Inconsistent formatting, such as switching between single-column and multi-column layouts within the same document.
- Avoid Scanned or heavily formatted files, which may not be processed effectively.
Media & Table Handling¶
Search AI supports image and table extraction using a layout-aware model, but only text is extracted by default. To improve retrieval accuracy:
- Provide text descriptions for any key information in images.
- Include contextual summaries before or after tables and images.
- Label images and tables with meaningful titles for easier indexing and searchability.
- Update extraction strategies when working with documents containing significant visual elements.
Webpages/HTML Content¶
Optimizing Web Page Structure¶
- Follow schema.org rules to standardize metadata and improve content extraction.
- If schema.org is not used, ensure consistent logic is applied for structuring headings, subheadings, and content. This enables using the document workbench for custom processing of content.
- Websites using standard HTML tags (like, <h1>, <h2>, <p>, <img>, <table>) provide the best results.
Handling Non-Standard Web Content¶
- Custom CSS-based structures may cause processing issues and require fine-tuning.
- Override default processing using Document Workbench to define custom extraction rules for non-standard layouts.
Connector Integrations¶
Relevance & Filtering¶
- Ingest only relevant data to align with the specific use case.
- Utilize advanced filters to ingest valuable content for search and retrieval selectively.
- Avoid pulling in entire datasets from a connector, as unnecessary content can introduce noise and require additional fine-tuning.
Performance Considerations¶
- Limit ingestion frequency to avoid excessive data processing and system overload.
- Monitor ingestion logs and adjust configurations based on search performance and user feedback.
- Use incremental updates instead of full re-ingestion to optimize system efficiency.