Best Practices for Data Ingestion¶

The most crucial step in building a RAG solution is to ensure the correct data is ingested and available in the system. Search AI supports various data sources, including:

Files
Web pages
Third-party applications via connectors

Supported content types and formats vary across these sources.

Files¶

Search AI supports document ingestion in the following formats:

✅ PDF, DOCX, PPTX, and TXT

Refer to the guidelines below for the optimum ingestion of content from documents.

Document Quality & Structure¶

For optimal processing and search accuracy, documents should:

Be system-generated (digitally created) rather than scanned or handwritten.
Maintain a consistent layout across all pages.
Be unencrypted and free of password protection to enable processing.
Use shorter, well-structured content to improve search precision.
Avoid unnecessary elements like headers, footers, and metadata that do not contribute to content relevance.

Formatting & Layout Considerations¶

Single-column documents provide the highest retrieval accuracy.
Multi-column layouts may reduce search precision and require fine-tuning.
Documents should have clear section headers and logical content organization for improved readability and search effectiveness.

Content & File Restrictions¶

To prevent data loss and ensure optimal retrieval:

Avoid Compressed PDFs, which can cause data distortion.
Avoid Multi-page tables, as they are challenging to process accurately.
Avoid Inconsistent formatting, such as switching between single-column and multi-column layouts within the same document.
Avoid Scanned or heavily formatted files, which may not be processed effectively.

Media & Table Handling¶

Search AI supports image and table extraction using a layout-aware model, but only text is extracted by default. To improve retrieval accuracy:

Provide text descriptions for any key information in images.
Include contextual summaries before or after tables and images.
Label images and tables with meaningful titles for easier indexing and searchability.
Update extraction strategies when working with documents containing significant visual elements.

Webpages/HTML Content¶

Optimizing Web Page Structure¶

Follow schema.org rules to standardize metadata and improve content extraction.
If schema.org is not used, ensure consistent logic is applied for structuring headings, subheadings, and content. This enables using the document workbench for custom processing of content.
Websites using standard HTML tags (like, <h1>, <h2>, <p>, <img>, <table>) provide the best results.

Handling Non-Standard Web Content¶

Custom CSS-based structures may cause processing issues and require fine-tuning.
Override default processing using Document Workbench to define custom extraction rules for non-standard layouts.

Connector Integrations¶

Relevance & Filtering¶

Ingest only relevant data to align with the specific use case.
Utilize advanced filters to ingest valuable content for search and retrieval selectively.
Avoid pulling in entire datasets from a connector, as unnecessary content can introduce noise and require additional fine-tuning.

Performance Considerations¶

Limit ingestion frequency to avoid excessive data processing and system overload.
Monitor ingestion logs and adjust configurations based on search performance and user feedback.
Use incremental updates instead of full re-ingestion to optimize system efficiency.

Send Feedback