How We Built a RAG Document Assistant That Cut Proposal Time by 85%
One of our enterprise clients — a construction technology company operating across the Middle East — was spending two full working days preparing each client proposal. Their team manually reviewed technical specifications, extracted requirements, matched them against their own capability documents, and wrote tailored responses. We built a Retrieval-Augmented Generation (RAG) system that automated the research and drafting phase, cutting proposal time to under three hours. This is the architecture behind that system and the key decisions that made it work in production.
The Problem: Five Years of Knowledge Trapped in Files
The client had accumulated five years of proposal history, technical specifications, product datasheets, and case studies — totalling over 4,000 pages across PDFs, Word documents, and Excel files. The institutional knowledge existed but was inaccessible at the speed needed to stay competitive.
- Manual search: Finding relevant past proposals for a new RFP took 2-3 hours alone, and relied entirely on one senior employee's memory of what existed and where
- Inconsistency: Different team members produced proposals with varying quality, terminology, and technical depth depending on their individual experience level
- Scalability ceiling: The team could handle 3-4 proposals per week at maximum capacity. Winning more contracts required either significant hiring or a fundamentally different approach
- Compliance gaps: Critical technical requirements buried in 80-page RFP documents were sometimes missed in manual review, resulting in scope gaps discovered late in the sales process
The RAG Architecture We Built
The system uses a retrieval-augmented generation pipeline: when the user submits an RFP, relevant document sections are retrieved from a vector store and provided as context to the language model. The model generates structured responses grounded in the retrieved content — rather than hallucinating from general training data.
- Document ingestion: PDF, DOCX, and XLSX files processed via Apache Tika for text extraction, with custom post-processing to handle tables, headers, and embedded metadata
- Chunking strategy: Semantic chunking using paragraph boundaries rather than fixed token counts. Chunks average 300-500 tokens with 50-token overlap. This preserves context that fixed-size chunking destroys
- Embedding model: OpenAI text-embedding-3-large (3072 dimensions) — chosen for its strong performance on technical domain text and consistent cross-lingual quality for Arabic and English documents
- Vector store: pgvector extension on PostgreSQL — chosen for integration with the client's existing PostgreSQL infrastructure, ACID compliance, and avoiding a separate managed vector database service
- Hybrid search: BM25 keyword search combined with cosine similarity vector search, merged via Reciprocal Rank Fusion (RRF). Hybrid consistently outperforms pure semantic search on domain-specific technical queries
- LLM: Anthropic Claude 3.5 Sonnet for generation — selected for its 200k token context window, strong instruction-following on structured tasks, and consistent JSON output via tool use
Key Engineering Decisions That Made the Difference
Several architectural choices significantly improved output quality. These are non-obvious and each required real A/B testing to validate. We document them here because they are the decisions most teams get wrong when building RAG systems.
- Re-ranking before generation: After hybrid search returns 20 candidate chunks, a cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2) re-scores them for contextual relevance. This single step improved answer accuracy by 23% in our internal evaluation set
- Citation tracking: Every generated sentence is linked to its source document with page or section reference. This lets the proposal writer verify any claim before sending, and gave the client the confidence to trust and use the output
- Structured output via tool use: Instead of free-form text generation, we used Claude's tool use feature to output structured JSON — executive summary, technical requirements list, proposed solution section, and pricing indicators — making downstream formatting fully deterministic
- Map-reduce for long RFPs: For 100+ page documents, a single-pass approach exceeds context limits and produces incoherent output. We implemented a map-reduce pattern — summarize each section independently, then synthesize across summaries — maintaining coherence across the full document
- Prompt versioning in database: All prompts are stored with version history and can be updated without code deployments. This allowed the client's team to tune prompts themselves after handover, and gave us the ability to A/B test prompt variants against a golden evaluation set
Results and What We Learned
After three weeks of development and two weeks of calibration with real proposals, these were the measured outcomes at the six-week mark.
- Proposal preparation time: reduced from 2 working days to under 3 hours — an 85% reduction in the time-to-send metric
- Output volume: increased from 3-4 proposals per week to 12+ per week using the same team size, with no reduction in quality
- Accuracy: 94% of generated technical sections required no substantive edits, measured across 60 consecutive proposals in the calibration period
- Knowledge democratization: new team members with no prior product knowledge could produce proposals at senior-quality level within their first week
- Biggest lesson: the single highest-impact improvement was the re-ranking step, not the choice of LLM. A better retriever consistently beats a larger model. Invest in retrieval quality first
- Second lesson: structured output (tool use / JSON mode) is non-negotiable for production AI systems. Free-form text generation cannot be reliably parsed, formatted, or integrated into downstream workflows
Book a Free Consultation
Ready to secure your application or build something with AI? Let's talk.
Send Enquiry