Skip to content

Best Practices for Data Ingestion

The most crucial step in building a RAG solution is to ensure the correct data is ingested and available in the system. Search AI supports various data sources, including:

  • Files
  • Web pages
  • Third-party applications via connectors

Supported content types and formats vary across these sources.

Files

Search AI supports document ingestion in the following formats:

PDF, DOCX, PPTX, XLSX and TXT

Refer to the guidelines below for the optimum ingestion of content from documents.

Document Quality & Structure

For optimal processing and search accuracy, documents should:

  • Be system-generated (digitally created) rather than scanned or handwritten.
  • Maintain a consistent layout across all pages.
  • Be unencrypted and free of password protection to enable processing.
  • Use shorter, well-structured content to improve search precision.
  • Avoid unnecessary elements like headers, footers, and metadata that do not contribute to content relevance.

Formatting & Layout Considerations

  • Single-column documents provide the highest retrieval accuracy.
  • Multi-column layouts may reduce search precision and require fine-tuning.
  • Documents should have clear section headers and logical content organization for improved readability and search effectiveness.

Content & File Restrictions

To prevent data loss and ensure optimal retrieval:

  • Avoid Compressed PDFs, which can cause data distortion.
  • Avoid Multi-page tables, as they are challenging to process accurately.
  • Avoid Inconsistent formatting, such as switching between single-column and multi-column layouts within the same document.
  • Avoid Scanned or heavily formatted files, which may not be processed effectively.

Media & Table Handling

Search AI supports image and table extraction using a layout-aware model, but only text is extracted by default. To improve retrieval accuracy:

  • Provide text descriptions for any key information in images.
  • Include contextual summaries before or after tables and images.
  • Label images and tables with meaningful titles for easier indexing and searchability.
  • Update extraction strategies when working with documents containing significant visual elements.

Handling Spreadsheets

Search AI enables extraction from xlsx files via Markdown Extraction strategy. Refer to this for more details and best practices on the structure and layout of the spreadsheets for effective extraction.

Webpages/HTML Content

Optimizing Web Page Structure

  • Follow schema.org rules to standardize metadata and improve content extraction.
  • If schema.org is not used, ensure consistent logic is applied for structuring headings, subheadings, and content. This enables using the document workbench for custom processing of content.
  • Websites using standard HTML tags (like, <h1>, <h2>, <p>, <img>, <table>) provide the best results.

Handling Non-Standard Web Content

  • Custom CSS-based structures may cause processing issues and require fine-tuning.
  • Override default processing using Document Workbench to define custom extraction rules for non-standard layouts.

Connector Integrations

Relevance & Filtering

  • Ingest only relevant data to align with the specific use case.
  • Utilize advanced filters to ingest valuable content for search and retrieval selectively.
  • Avoid pulling in entire datasets from a connector, as unnecessary content can introduce noise and require additional fine-tuning.

Performance Considerations

  • Limit ingestion frequency to avoid excessive data processing and system overload.
  • Monitor ingestion logs and adjust configurations based on search performance and user feedback.
  • Use incremental updates instead of full re-ingestion to optimize system efficiency.