Unified Schema for Connector Content¶
Search AI utilizes a Unified Schema to standardize data ingestion from diverse content sources, including enterprise applications, files, and webpages. This schema defines a consistent structure that allows data from different formats and systems to be interpreted and utilized uniformly for search operations.
When content is ingested via connectors, data from various fields across different applications is automatically mapped to the most relevant fields in the unified schema. This ensures that Search AI maintains a consistent representation of content, regardless of the source.
The Unified Schema has a predefined set of fields, also referred to as Document Fields, to store the content and metadata of the ingested content. During ingestion, data from the source application is automatically assigned to the most relevant unified schema field. Users can override default mappings using the Field Mapping option in the connector configuration. The schema can also be extended to accommodate new custom fields.
The following are the default fields of the Unified Schema.
Note
Note that some of the fields in the list are system fields and can't be updated.
Document Fields | Description | Is System Field |
access_level | Defines the visibility or permission level associated with the document. | No |
archived_at | Timestamp indicating when the document or record was archived. | No |
assignee | Identifier of the user or entity responsible for the document, task, or record. | No |
assignee_email | Email address of the user assigned to the document | No |
assignee_name | Display name of the assignee | No |
blockedAcl | A list of users or groups explicitly restricted from accessing the document. | No |
branch | Represents the branch, version, or division of content, particularly in systems that support branching (For example, code repositories, knowledge bases) | No |
category | Classification label used to group similar documents or content types | No |
channel_id | Unique identifier for the communication channel from where the document originates. | No |
checksum | A unique hash value generated for the document content. | No |
chunkType | Type of chunk. | Yes |
closedOn | Timestamp indicating when the item (For example, issue, task, or conversation) was closed. | No |
comment_count | Total number of comments associated with the item. | No |
comments | List or collection of user comments related to the item | No |
commit_id | Unique identifier of the commit associated with the item. | No |
company_id | Unique identifier for the company or organization. | No |
company_name | Name of the company associated with the record. | No |
contact_id | Unique identifier for the contact person. | No |
contact_name | Name of the contact person. | No |
content | Main textual or structured content of the record (for example, body of a document, note, or comment). | No |
contentId | Unique identifier of the content entity. | No |
conversation_id | Unique identifier of the conversation or thread. | No |
createdBy | User ID or name of the person who created the item. | No |
createdOn | Timestamp when the item was created. | No |
deleted_at | Timestamp when the item was deleted (if soft-deleted). | No |
doc_created_by | Identifier or name of the user who created the document. | No |
doc_created_by_email | Email address of the document creator. | No |
doc_created_by_id | Unique ID of the document creator. | No |
doc_created_by_name | Full name of the document creator. | No |
doc_created_on | Timestamp when the document was created. | No |
doc_id | Timestamp when the document was created. | No |
doc_path | File path or storage path of the document. | No |
doc_source_type | Type of source from which the document was ingested. | No |
doc_updated_by | Identifier or name of the user who last updated the document. | No |
doc_updated_by_email | Email address of the user who updated the document. | No |
doc_updated_by_id | Unique ID of the user who last updated the document. | No |
doc_updated_on | Timestamp when the document was last updated. | No |
downvote_count | Number of downvotes received by the item (for example, post, comment, or answer). | No |
due_date | The due date or deadline associated with the task or item. | No |
extractionMethod | Method used to extract data from the source. | Yes |
extractionStrategy | Strategy or approach followed for data extraction | Yes |
file_content | Actual text or encoded content of the file. | No |
file_image_url | URL to the preview image of the file. | No |
file_preview | Short summary or visual preview of the file content. | No |
file_title | Title or display name of the file. | No |
file_url | Direct URL link to access or download the file. | No |
html | Raw HTML version of the document or page content. | No |
issueType | Type or category of issue. | No |
keywords | List of keywords or tags extracted or assigned to the content. | No |
labels | Labels or classifications applied to the item | No |
language | Language in which the content is written | No |
lastSyncAt | Timestamp of the most recent synchronization with the source system. | No |
location | Physical or virtual location associated with the record | No |
mentioned_users | List of users mentioned or tagged within the content. | No |
message_type | Type of message | No |
mime_type | MIME type of the file or document | No |
object_created_by_email | Email address of the user who created the object. | No |
object_created_by_id | ID of the user who created the object. | No |
object_created_by_name | Name of the user who created the object. | No |
object_created_on | Timestamp when the object was created. | No |
object_type | Type of object | No |
organization_id | Unique identifier for the organization. | No |
organization_name | Name of the organization associated with the record. | No |
owner_email | Email address of the item owner or assignee. | No |
owner_id | Unique ID of the item owner or assignee | No |
owner_name | Full name of the item owner or assignee. | No |
page_body | Text content or body of an HTML page | No |
page_count | Number of pages in the document from which the content is ingested. | No |
page_html | Page content in HTML format. | No |
page_image_url | URL for the page image or thumbnail | No |
page_preview | Short preview of the page content. | No |
page_title | Title of the page. | No |
page_number | Page number of the content | No |
page_url | URL of the page or web resource. | No |
parent_url | URL of the parent document or source from which this page is derived. | No |
parent_name | Name of the parent entity. | No |
priority | Priority level of the item. | No |
project_description | Description or summary of the project. | No |
project_id | Unique identifier for the project. | No |
project_name | Name of the project. | No |
project_owner_email | Email address of the project owner | No |
project_owner_id | ID of the project owner. | No |
project_owner_name | Name of the project owner. | No |
project_status | Current status of the project. | No |
projectName | Name of the project. | No |
published_at | Timestamp when the item or content was published. | No |
reporter | Identifier or name of the person who reported the issue. | No |
reporter_email | Email address of the reporter. | No |
reporter_name | Full name of the reporter. | No |
repository_id | Unique ID of the code or content repository | No |
repository_name | Name of the repository. | No |
resource_type | Type of resource | No |
share_count | Number of times the item has been shared | No |
size | File size or data volume | No |
sprint | Sprint or iteration to which the item belongs | No |
status | Current status of the item | No |
sys_file_type | System-defined file type classification | Yes |
sys_racl | Role-based Access Control List defining permissions for the resource. | No |
sourceType | Type of content source: web crawl, file upload, or connector. | No |
sys_source_name | Name of the system or connector from which the item originated. | Yes |
tags | Tags associated with the record for categorization or search. | No |
thread_id | Unique identifier of the thread or discussion chain. | No |
title | Title or name of the item. | No |
updatedBy | Identifier or name of the user who last updated the record. | No |
updatedOn | Timestamp when the record was last updated. | No |
url | Link to access the resource or item. | No |
upvote_count | Number of upvotes received by the item. | No |
view_count | Number of times the item has been viewed. | No |
visibility | Access level of the item . | No |
workspace_id | Unique identifier for the workspace or environment. | No |
workspace_name | Name of the workspace associated with the item. | No |
Custom Fields in Schema¶
Search AI allows the extension of the Unified Schema by adding up to 50 custom fields, enabling users to include additional data from third-party applications as searchable content. This flexibility ensures that unique business requirements and specialized metadata can be accommodated seamlessly.
Custom fields can also be used in the workbench, where users can map any value to them. For example, they can send the ingested content to an LLM and ask to summarize, and then store the summarized values in the custom field.
Adding a New Field¶
To add a new custom field,
- Click on the Manage Schema button on the Manage Content page in the connector.
- Click on +New Field button.
- Enter the following fields:
- Display Name - The user-friendly name for the field (for display in UI only).
- Data Type - Type of value of the field. This can be a string or array .
- Field Name - This is the technical name of the field. This name is used as a reference in the scripts in the document workbench or in the post-processor script for field mapping in connectors. For array-type fields, use cfa1 to cfa5, and for string-type fields, use cfs1 to 45.
- Description - A brief description of the intended use of the field.
Field Mapping¶
By default, the fields ingested from a connector are automatically mapped to the most appropriate fields in the unified schema. But this can be customized for specific business requirements.
For example, assume an organization uses a Google Drive connector to ingest documents into Search AI. By default, the Google Drive field createdTime is mapped to the unified schema field createdOn. However, if the org wants to display the last modified user information in search results. To achieve this, the field mapping can be updated to include the Google Drive field lastModifyingUser.displayName, mapping it to the unified schema field updatedBy.
Implementing Field Mapping
After an initial sync with a connector, you can view the payload of the response and use it to map the fields as required with the post-processor script.
- Go to the Field Mapping tab under Manage Content.
- The source payload shows the actual response from the connector. The mandatory fields required by Search AI are listed on the right pane.
- Use the source payload and post-processor scripts to map fields from the source applications to the fields of the unified schema. A default script is presented for each connector, which shows how the fields are mapped for the connector by default.
For instance, if the source payload is as follows and you need to map the createdAt field to the doc_created_on field in the unified schema, add the following line to the script.
Source Payload
{
"incidents": {
"title": "I : System Outages duplicates ----",
"content": "System Outages , Impact Start Date : 2025-04-04T12:14:32.419Z, Impact End Date : Mon May 12 2025 10:12:25 GMT+0000 (Coordinated Universal Time), Responders : User : John Doe , Actions : ",
"type": "incident",
"id": "79d68c5a-762f-4c0a-b412-49a6d75b92b0",
"tinyId": "5",
"status": "open",
"labels": [
"System Outages"
],
"createdAt": "2025-04-04T12:14:32.419Z",
"updatedAt": "2025-04-04T12:14:49.526Z",
"priority": "P3",
"responders": "User: John Doe, ",
"actions": [],
"impactStartDate": "2025-04-04T12:14:32.419Z",
"impactEndDate": "2025-05-12T10:12:25.985Z"
}
}
Script Updates
If a connector supports multiple objects, the source payload displays a concatenated set of fields for all those objects. When mapping fields from two or more supported objects to custom fields, create separate custom fields for each object,as shown below. Even though the records for these objects are distinct, the field mapping section is currently set up to configure them together.
For instance, if a connector supports incidents and alerts, and the titles of these are to be assigned to custom fields, use separate custom fields.