Indexing Documents
Manage document data efficiently by addressing challenges like data sprawl and metadata inconsistencies for creating, querying, and maintaining documents. This guide covers both indexing new documents and managing existing ones, making it ideal for building scalable search solutions or automating content governance.
- Create structured or core documents with custom metadata
- Index documents from text content or upload files
- List and filter documents in a corpus
- Retrieve, update, and delete documents by ID
- Summarize content using LLM-powered tools
This guide assumes you have a corpus called my-docs
. If you haven't created
a corpus yet, follow the Quick Start guide to set up your first corpus.
Create a structured document
1
Create and index a structured document into your corpus to make it searchable. Structured documents are organized into sections, each with optional titles and metadata, making them ideal for contracts, reports, or other organized content.
The documents.create
method corresponds to the HTTP POST
/v2/corpora/{corpus_key}/documents
endpoint.
Key Parameters:
id
(string, required): Unique identifier for the document within the corpustype
(string, required): Must be "structured" for section-based documentssections
(array, required): List of document sections with text contentmetadata
(object, optional): Document-level metadata for filtering
Section Parameters:
title
(string, optional): Section heading or titletext
(string, required): The actual content text for this sectionmetadata
(object, optional): Section-level metadata for fine-grained filtering
Use structured documents for organized content like employee handbooks, policies, or technical manuals where clear section organization improves searchability.
Error Handling:
- 400 Bad Request: Invalid document structure or parameters
- 403 Forbidden: Insufficient permissions - ensure API key has indexing rights
- 404 Not Found: Corpus doesn't exist
- 409 Conflict: Document with the same ID already exists
Create a core document
1
Create and index a core document using document parts. Core documents are more flexible than structured documents and work well for unstructured content like support articles, FAQs, or knowledge base entries.
Key Differences from Structured Documents:
- Uses
document_parts
instead ofsections
- Parts don't have titles, only text content and optional metadata
- Better suited for unstructured or semi-structured content
Use Core Documents When:
- Content doesn't have clear section structure
- You want maximum flexibility in document organization
- Working with imported content from various sources
To update or overwrite the document, you must delete it using
client.documents.delete()
and then re-index it, as direct updates to
content are not supported. Attempting to re-index with the same ID and
different content will result in a 409 error.
Error Handling:
- 400 Bad Request: Invalid document structure or parameters
- 403 Forbidden: Insufficient permissions - ensure API key has indexing rights
- 404 Not Found: Corpus doesn't exist
- 409 Conflict: Document with the same ID already exists with different content
- 413 Payload Too Large: Document exceeds size limit
List documents in a corpus
1
Explore powerful methods to retrieve and manage document listings within a corpus, enabling efficient data access and organization.
The documents.list
method corresponds to the HTTP GET /v2/corpora/{corpus_key}/documents
endpoint. For more details on request and response parameters, see the List Documents REST API.
Parameters:
corpus_key
(string, required): Unique identifier for the corpuslimit
(int, optional): Maximum number of documents to return per page (default: 10)metadata_filter
(string, optional): Filter expression for document metadatapage_key
(string, optional): Token to fetch the next page of results
Returns: Iterator of Document objects (containing id
and metadata
, but not full content).
Use metadata filters to find specific document types or categories. The method returns paginated results for efficient handling of large document collections.
Get a document by ID
1
Access specific documents efficiently by their unique IDs, enabling detailed inspection or display within your corpus.
The documents.get
method corresponds to the HTTP GET
/v2/corpora/{corpus_key}/documents/{document_id}
endpoint.
Parameters:
corpus_key
(string, required): Unique identifier of the corpusdocument_id
(string, required): Unique identifier of the document
Returns: Document object with full text content and metadata.
Use this method when you need to retrieve the complete document content, not just the metadata returned by the list operation.
Update document metadata
1
Enhance document management by updating metadata fields, perfect for tagging, categorization, and maintaining document status.
The documents.update
method corresponds to the HTTP PATCH
/v2/corpora/{corpus_key}/documents/{document_id}
endpoint.
Parameters:
corpus_key
(string, required): Unique identifier of the corpusdocument_id
(string, required): Unique identifier of the documentmetadata
(object, required): New metadata to merge with existing metadata
The update operation merges the provided metadata with existing metadata, allowing you to add new fields or modify existing ones without losing other data.
Delete a document
1
Manage your corpus effectively by permanently removing documents, supporting data cleanup and lifecycle management.
The documents.delete
method corresponds to the HTTP DELETE
/v2/corpora/{corpus_key}/documents/{document_id}
endpoint.
Parameters:
corpus_key
(string, required): Unique identifier of the corpusdocument_id
(string, required): Unique identifier of the document to delete
Deletion is permanent and cannot be undone. Ensure you have backups if the document might be needed later.
Summarize a document
1
Generate LLM-powered summaries for specific documents in your corpus. Use this for content previews, search snippets, or generative UI applications.
The documents.summarize
method corresponds to the HTTP POST
/v2/corpora/{corpus_key}/documents/{document_id}/summarize
endpoint.
Parameters:
corpus_key
(string, required): Unique identifier of the corpusdocument_id
(string, required): Unique identifier of the documentllm_name
(string, optional): LLM model to use for summarizationprompt_template
(string, optional): Custom prompt with$document_content
placeholder
Returns: Summary response object with the generated summary text.
Use custom prompt templates to tailor summaries for specific use cases like customer support, technical documentation, or content previews.
Workflow: Create corpus and index document
1
This example demonstrates the fundamental two-step workflow for establishing a new knowledge base in Vectara.
- Corpus creation: The first step creates a new corpus with a
unique identifier (
key
) and human-readable name. The corpus acts as a namespace for your documents and defines important characteristics like metadata schemas, filter attributes, and access controls. The example includes error handling for the common case where the corpus already exists. - Document ingestion: The second step uploads and indexes a structured document into the corpus. The document is parsed into searchable sections, with each section containing both text content and optional metadata. Vectara processes the content automatically, making it immediately queryable through the search API.
Best Practices
- Descriptive naming: Use meaningful corpus keys and names that clearly identify the content domain and purpose.
- Consistent metadata: Establish a uniform metadata schema across all documents within a corpus to enable effective filtering.
- Robust error handling: Implement comprehensive logic that handles both creation failures and "already exists" scenarios gracefully.
- Verification steps: Confirm corpus creation success before attempting document indexing to avoid orphaned content.
- Resource management: Consider using unique corpus keys for testing to avoid conflicts with existing resources.
Next steps
After understanding document management and indexing, you can:
- Query documents: Use
client.query()
to search across document content with the Query guide - Upload files: Use
client.upload.file()
to index PDFs, DOCX, and other file formats with the Upload Files guide - Manage corpora: Create and configure corpora with
client.corpora.create()
using the Corpora guide - Batch operations: Process multiple documents efficiently for large-scale content management
- Advanced filtering: Leverage metadata for sophisticated document organization