Table of contents

 

 

 

 

Official Content
  • This documentation is valid for:

You can customize the geai Ingestion Provider using the following parameters.

Important concepts

  • Processing strategy defines how the document content is extracted (text-only, high-resolution OCR, or page-by-page LLM interpretation).
  • Chunking strategy defines how the extracted content is split into fragments (chunks) for indexing and semantic search.
  • These are independent configurations. Processing happens first, chunking happens afterwards.

Processing Parameters (Content Extraction)

Parameter Description
strategy Determines how the document is processed before chunking. Options:

- auto (default): Text-based processing. Extracts machine-readable text only. Fastest and lowest cost. Recommended for standard PDFs, TXT, DOCX, etc. without complex visual structures.

- hi_res: High-resolution processing. Requires the model parameter. Uses a visual-capable model to better interpret complex layouts, images, and tables. More expensive than auto but recommended for PDFs containing images, scanned sections, or structured tables. See Documents with images hi_res strategy.

- llm: Processes each PDF page independently using a multimodal LLM. Most accurate for complex visual documents but also the most expensive and slowest option. Suitable for highly visual or poorly structured PDFs.(2)
model Specifies the multimodal model used when visual interpretation is required (hi_res or llm). Format: provider/modelname.

Must support visual input (see visual support).

Examples: openai/gpt-4o, openai/gpt-4o-mini, anthropic/claude-3-5-sonnet-20240620, vertex_ai/gemini-1.5-flash.
imagePrompt Custom prompt used when interpreting images embedded in documents. Overrides the default visual interpretation instructions.
scannedPrompt Custom prompt used when the entire page is treated as an image (e.g., scanned PDFs).
tablePrompt Custom prompt used to interpret detected tables. Overrides the default table-processing instructions.
logoProcess Determines whether detected logos are processed by the visual model.

- false (default): Logos are ignored.
- true: Logos are analyzed and described (e.g., extracting brand meaning or context).
dpi Defines the DPI (Dots Per Inch) used when rendering document pages as images for visual processing. Default: 200. Higher DPI may improve accuracy but increases processing cost and time.
password Password used to unlock protected PDF files.
startPage First page to process. Default: 1. Useful when indexing only a subset of a document.
endPage Last page to process. If 0, processing continues to the end of the document. Useful for partial ingestion.
outputFormat Output format: plainText (default), html, markdown.

Chunking Parameters

Chunking determines how extracted content is split into fragments stored in the vector index.

Warning: Changing chunking parameters requires re-ingesting all documents in the assistant index.
Parameter Description
chunkStrategy(1) Defines how content is fragmented into chunks. Options:

- (empty) (default): Standard size-based chunking using chunkSize and chunkOverlap.

- byLayoutType: Attempts to preserve logical layout blocks (e.g., large tables or structured sections) as single chunks. Less strict about size limits. Recommended for documents containing tables or structured visual content to avoid splitting related data across multiple chunks.
chunkSize Target chunk size in characters. Default: 1000 characters (recommended semantic “sweet spot” for similarity search).

Actual chunk size may vary slightly depending on structure and layout preservation rules.
chunkOverlap Number of overlapping characters between consecutive chunks. Helps preserve context across boundaries.
structure Defines assumptions about structured/tabular documents.

- (empty) (default): No special structure assumed.
- table: Assumes tabular structure. Applicable only to csv and xls* formats. See table structure ingestion strategy.

File-Type Considerations

Different file types may be processed differently internally:

  • TXT / plain text → Standard size-based chunking.
  • Markdown → Currently not automatically chunked by heading hierarchy. If hierarchical chunking is required, clients must upload a custom pre-structured file.
  • CSV / XLSX → Can use structure=table.

To inspect how a document was processed internally:
- Use the “View” option in the console.
- Check the “Chunks” link to see the actual fragmentation result.

Custom Document Configuration

You can upload a custom document configuration file (JSON) (file with .custom extension) to explicitly define how content should be structured or chunked.

This can be done:
- Via API
- Via Console → Add documents

This is especially useful when:
- Special chunking is required and not provided by the platform.
- The JSON file requires a strict preservation of fields. For more details, see .custom File Format.

Media (Video & Audio) Parameters

Parameter Description
dialogue Indicates whether the media contains spoken dialogue.

- 1: Media contains speech → transcription is performed.
- 0: No speech → visual summarization is used.
mediaPrompt Prompt used to describe or summarize visual frames when dialogue = 0.
frameSamplingRate Interval (in seconds) between extracted frames. Example: 2 = one frame every 2 seconds. Default: 5.
merge Number of consecutive transcript lines to merge. Set 0 to merge all lines into a single block.
whisperModel Whisper model variant for audio transcription: tiny (default),small, medium, large.

1 Specific for RAG Assistants.
2 Available since 2026-01 release.

Last update: 2026 | © Globant S.A. All rights reserved.