Version: 2.0

Indexing Documents

Manage document data efficiently by addressing challenges like data sprawl and metadata inconsistencies for creating, querying, and maintaining documents. This guide covers both indexing new documents and managing existing ones, making it ideal for building scalable search solutions or automating content governance.

Create structured or core documents with custom metadata
Index documents from text content or upload files
List and filter documents in a corpus
Retrieve, update, and delete documents by ID
Summarize content using LLM-powered tools

Prerequisites

This guide assumes you have a corpus called my-docs. If you haven't created a corpus yet, follow the Quick Start guide to set up your first corpus.

Create a structured document

CREATE A STRUCTURED DOCUMENT

Code example with python syntax.

Create and index a structured document into your corpus to make it searchable. Structured documents are organized into sections, each with optional titles and metadata, making them ideal for contracts, reports, or other organized content.

The documents.create method corresponds to the HTTP POST /v2/corpora/{corpus_key}/documents endpoint.

Key Parameters:

id (string, required): Unique identifier for the document within the corpus
type (string, required): Must be "structured" for section-based documents
sections (array, required): List of document sections with text content
metadata (object, optional): Document-level metadata for filtering

Section Parameters:

title (string, optional): Section heading or title
text (string, required): The actual content text for this section
metadata (object, optional): Section-level metadata for fine-grained filtering

Use structured documents for organized content like employee handbooks, policies, or technical manuals where clear section organization improves searchability.

Error Handling:

400 Bad Request: Invalid document structure or parameters
403 Forbidden: Insufficient permissions - ensure API key has indexing rights
404 Not Found: Corpus doesn't exist
409 Conflict: Document with the same ID already exists

Create a core document

CREATE A CORE DOCUMENT

Code example with python syntax.

Create and index a core document using document parts. Core documents are more flexible than structured documents and work well for unstructured content like support articles, FAQs, or knowledge base entries.

Key Differences from Structured Documents:

Uses document_parts instead of sections
Parts don't have titles, only text content and optional metadata
Better suited for unstructured or semi-structured content

Use Core Documents When:

Content doesn't have clear section structure
You want maximum flexibility in document organization
Working with imported content from various sources

To update or overwrite the document, you must delete it using client.documents.delete() and then re-index it, as direct updates to content are not supported. Attempting to re-index with the same ID and different content will result in a 409 error.

Error Handling:

400 Bad Request: Invalid document structure or parameters
403 Forbidden: Insufficient permissions - ensure API key has indexing rights
404 Not Found: Corpus doesn't exist
409 Conflict: Document with the same ID already exists with different content
413 Payload Too Large: Document exceeds size limit

List documents in a corpus

LIST DOCUMENTS IN A CORPUS

Code example with python syntax.

Explore powerful methods to retrieve and manage document listings within a corpus, enabling efficient data access and organization.

The documents.list method corresponds to the HTTP GET /v2/corpora/{corpus_key}/documents endpoint. For more details on request and response parameters, see the List Documents REST API.

Parameters:

corpus_key (string, required): Unique identifier for the corpus
limit (int, optional): Maximum number of documents to return per page (default: 10)
metadata_filter (string, optional): Filter expression for document metadata
page_key (string, optional): Token to fetch the next page of results

Returns: Iterator of Document objects (containing id and metadata, but not full content).

Use metadata filters to find specific document types or categories. The method returns paginated results for efficient handling of large document collections.

Get a document by ID

GET A DOCUMENT BY ID

Code example with python syntax.

Access specific documents efficiently by their unique IDs, enabling detailed inspection or display within your corpus.

The documents.get method corresponds to the HTTP GET /v2/corpora/{corpus_key}/documents/{document_id} endpoint.

Parameters:

corpus_key (string, required): Unique identifier of the corpus
document_id (string, required): Unique identifier of the document

Returns: Document object with full text content and metadata.

Use this method when you need to retrieve the complete document content, not just the metadata returned by the list operation.

Update document metadata

UPDATE DOCUMENT METADATA

Code example with python syntax.

Enhance document management by updating metadata fields, perfect for tagging, categorization, and maintaining document status.

The documents.update method corresponds to the HTTP PATCH /v2/corpora/{corpus_key}/documents/{document_id} endpoint.

Parameters:

corpus_key (string, required): Unique identifier of the corpus
document_id (string, required): Unique identifier of the document
metadata (object, required): New metadata to merge with existing metadata

The update operation merges the provided metadata with existing metadata, allowing you to add new fields or modify existing ones without losing other data.

Delete a document

DELETE A DOCUMENT

Code example with python syntax.

Manage your corpus effectively by permanently removing documents, supporting data cleanup and lifecycle management.

The documents.delete method corresponds to the HTTP DELETE /v2/corpora/{corpus_key}/documents/{document_id} endpoint.

Parameters:

corpus_key (string, required): Unique identifier of the corpus
document_id (string, required): Unique identifier of the document to delete

caution

Deletion is permanent and cannot be undone. Ensure you have backups if the document might be needed later.

Summarize a document

SUMMARIZE A DOCUMENT

Code example with python syntax.

Generate LLM-powered summaries for specific documents in your corpus. Use this for content previews, search snippets, or generative UI applications.

The documents.summarize method corresponds to the HTTP POST /v2/corpora/{corpus_key}/documents/{document_id}/summarize endpoint.

Parameters:

corpus_key (string, required): Unique identifier of the corpus
document_id (string, required): Unique identifier of the document
llm_name (string, optional): LLM model to use for summarization
prompt_template (string, optional): Custom prompt with $document_content placeholder

Returns: Summary response object with the generated summary text.

Use custom prompt templates to tailor summaries for specific use cases like customer support, technical documentation, or content previews.

Workflow: Create corpus and index document

COMPLETE WORKFLOW: CREATE CORPUS AND INDEX DOCUMENT

Code example with python syntax.

This example demonstrates the fundamental two-step workflow for establishing a new knowledge base in Vectara.

Corpus creation: The first step creates a new corpus with a unique identifier (key) and human-readable name. The corpus acts as a namespace for your documents and defines important characteristics like metadata schemas, filter attributes, and access controls. The example includes error handling for the common case where the corpus already exists.
Document ingestion: The second step uploads and indexes a structured document into the corpus. The document is parsed into searchable sections, with each section containing both text content and optional metadata. Vectara processes the content automatically, making it immediately queryable through the search API.

Best Practices

Descriptive naming: Use meaningful corpus keys and names that clearly identify the content domain and purpose.
Consistent metadata: Establish a uniform metadata schema across all documents within a corpus to enable effective filtering.
Robust error handling: Implement comprehensive logic that handles both creation failures and "already exists" scenarios gracefully.
Verification steps: Confirm corpus creation success before attempting document indexing to avoid orphaned content.
Resource management: Consider using unique corpus keys for testing to avoid conflicts with existing resources.

Next steps

After understanding document management and indexing, you can:

Query documents: Use client.query() to search across document content with the Query guide.
Upload files: Use client.upload.file() to index PDFs, DOCX, and other file formats with the Upload Files guide.
Manage corpora: Create and configure corpora with client.corpora.create() using the Corpora guide.
Batch operations: Process multiple documents efficiently for large-scale content management.
Advanced filtering: Leverage metadata for sophisticated document organization.

Create a structured document​

CREATE A STRUCTURED DOCUMENT

Create a core document​

CREATE A CORE DOCUMENT

List documents in a corpus​

LIST DOCUMENTS IN A CORPUS

Get a document by ID​

GET A DOCUMENT BY ID

Update document metadata​

UPDATE DOCUMENT METADATA

Delete a document​

DELETE A DOCUMENT

Summarize a document​

SUMMARIZE A DOCUMENT

Workflow: Create corpus and index document​

COMPLETE WORKFLOW: CREATE CORPUS AND INDEX DOCUMENT

Best Practices​

Next steps​

Create a structured document

Create a core document

List documents in a corpus

Get a document by ID

Update document metadata

Delete a document

Summarize a document

Workflow: Create corpus and index document

Best Practices

Next steps