Play 15
Multi-Modal DocProc
Medium🔧 Skeleton
Process documents with text + images using GPT-4o multi-modal vision.
GPT-4o's vision capability processes documents that contain images, charts, tables, and text together. Document Intelligence handles OCR, then GPT-4o interprets visual elements like graphs, stamps, signatures. Outputs structured JSON. Handles multi-page documents with page-level processing.
Architecture Pattern
Multi-modal extraction, images+text+tables→structured JSON
Azure Services
Azure OpenAI (gpt-4o vision)Document IntelligenceBlob StorageCosmos DBAzure Functions
DevKit (.github Agentic OS)
- agent.md — multimodal processor persona
- instructions.md — image handling guide
- mcp/index.js — image validation tools
- plugins/ — image processor, table recognizer, extractor
TuneKit (AI Config)
- config/openai.json — gpt-4o, vision prompts
- config/extraction.json — field schemas, image handling rules
- config/guardrails.json — PII in images
- evaluation/ — extraction accuracy per doc type
Tuning Parameters
Image promptsExtraction schemasConfidence thresholdsPage processing order
Estimated Cost
Dev/Test
$120–280/mo
Production
$1.5K–4K/mo