The geai ingestion option supports a wide range of file types for intelligent content extraction and transformation, including PDFs, images, and multimedia formats. Depending on the type of document, different strategies are applied to maximize the retrieval of useful information, including text, tables, images, and audio transcripts.
For PDF files they can be processes differently based on their content:
- Scanned PDF where each page is treated as an image and will use by default the hi_res strategy and the scannedPrompt parameter. Therefore, for each page an LLM call will be executed to interpret the image and get as much text as possible.
- Standard PDF containing text, images and tables will take into account the selected strategy parameter with a combination of imagePrompt/tablePrompt parameters.
When treating media files (including audio and video from the supported RAG File Formats) it applies automated transcription or image frame processing based on configuration.
Check geai Ingestion Provider Parameters.
The imagePrompt default is as follows:
You are an assistant tasked with extracting all text from images in the image text language.
These images are pages from a PDF document. Extract and transcribe all visible text in the image,
maintaining the structure and layout as much as possible. Include any headers, footers, and page numbers.
Be thorough and don't miss any text, no matter how small or where it's positioned in the image.
The tablePrompt default is as follows:
You are an assistant tasked with extracting all text from images in the image text language.
These images are pages from a PDF document. Extract and transcribe all visible text in the image,
maintaining the structure and layout as much as possible. Include any headers, footers, and page numbers.
Be thorough and don't miss any text, no matter how small or where it's positioned in the image.
The scannedPrompt default is as follows:
You are an assistant tasked with extracting all text from images in the image text language.
These images are pages from a PDF document. Extract and transcribe all visible text in the image,
maintaining the structure and layout as much as possible. Include any headers, footers, and page numbers.
Be thorough and don't miss any text, no matter how small or where it's positioned in the image.
The mediaPrompt default is as follows:
Generate a descriptive caption of this video frame optimized for search and retrieval.
Use keywords and phrases that capture the visual content, scene type, objects, people, and potential actions or events depicted.
Check geai Ingestion Provider Samples.