An overview of building internal RAG systems for business — architecture, tooling, costs, and the decisions that make or break a production RAG deployment.
Building internal RAG systems has gone from cutting-edge research in 2023 to a standard mid-tier business AI project in 2026. The architecture is well-understood. The tooling is mature. And yet roughly half the internal RAG projects we see at Waymouth Tech stall before reaching production. This guide walks through what a production-grade internal RAG implementation actually looks like — architecture, tooling, costs, and the decisions that determine whether you ship.
RAG — Retrieval-Augmented Generation — is a pattern, not a product. The shape is simple:
Underneath the simplicity is a surprising number of decisions. The decisions are what determine whether your RAG system answers reliably or hallucinates plausibly.
The case for internal RAG is straightforward: generic AI assistants do not know your business. They have not read your SOPs, your customer history, your contracts, your engineering decisions. RAG closes that gap without trying to retrain the model.
Common use cases that justify the build:
Where you should not build RAG: when an off-the-shelf assistant grounded in your M365 or Google Workspace tenant already does the job. See our Microsoft Copilot implementation guide — Copilot is a RAG system over your tenant, ready-made.
A useful production-grade internal RAG system has six layers:
Sources include SharePoint, Confluence, Notion, file shares, databases, ticketing systems, CRMs and code. Each source needs a connector, an extraction pipeline, and a normalisation step.
Key decisions:
Each chunk is converted to a vector using an embedding model. Decisions:
Where the embeddings live. See our deeper guide on vector databases explained for business. The shortlist:
For most Australian mid-market builds, pgvector or Qdrant on AU-region infrastructure is a sensible starting point.
The runtime layer that takes a query, retrieves the right chunks, and assembles the prompt. Where the real engineering work lives:
The LLM call itself. Decisions:
The layer most teams skip and most projects regret skipping:
Across dozens of RAG projects, the same handful of decisions separate success from stall.
The single biggest predictor of success is starting narrow. One knowledge domain. One user group. One use case. Teams that try to build a "company-wide knowledge assistant" in their first project almost always stall. Teams that ship a "tenant onboarding question answerer" in eight weeks then iterate.
Internal RAG often touches sensitive data. Get permissions right from day one:
A RAG system without a baseline eval set will degrade quietly. Build the eval set during scoping, not after launch.
End-to-end RAG latency is the sum of embedding, retrieval, re-ranking, and generation. Aim for sub-three-second end-to-end for interactive use. Streaming helps the perceived experience.
For a mid-sized internal RAG (10,000–100,000 documents, 100–500 active users) running on managed infrastructure in 2026:
For tactics on controlling the inference component, see our LLM API cost management guide.
Increasingly there are credible vendor options for internal RAG — Glean, Microsoft Copilot with custom connectors, AWS Q Business, Vectara, and others. Buy when:
Build when:
Most Australian mid-market businesses we work with end up with a hybrid — a vendor RAG for general knowledge and a custom RAG for the one or two domains that matter most.
Pick one narrow domain. Build a prototype in 4–8 weeks. Measure quality against a real eval set. If it holds up, harden it for production. Resist the temptation to build the everything-assistant first.
FAQ
Retrieval-Augmented Generation (RAG) is a pattern where a language model retrieves relevant content from your own data sources before answering a question. It lets the model reason over your proprietary knowledge without retraining.
A useful internal RAG prototype takes 4–8 weeks. A production-grade system with proper access controls, evaluation, and observability typically takes 3–6 months depending on data complexity.
Usually not. Frontier models with good retrieval and prompting outperform fine-tuned smaller models for most internal RAG use cases. Fine-tuning is a later-stage optimisation, not a starting point.
Inference costs typically dominate at AUD 500–10,000 per month for mid-sized internal use. Vector database, embeddings, and infrastructure usually add another 20–40% on top depending on data volume.
Both are viable. LlamaIndex tends to be cleaner for retrieval-heavy systems. LangChain is broader but heavier. Increasingly teams roll their own thin orchestration layer rather than commit to either framework.
Waymouth Tech · Melbourne, Australia
We’re a Melbourne-based AI implementation consultancy. We scope, build and ship production AI for Australian organisations — typically 8–14 weeks from kickoff to live, billed by scope so you know what you’ll pay before we start.
Or email hello@waymouthtech.com — usually back within 24 hours.
Continue reading
A practical decision framework for choosing AI tools for business in 2026 — covering selection criteria, build vs buy, and a tooling shortlist.
Vector databases explained for business — what they are, when you need one, how to pick between the major options, and what they actually cost.
A practical look at Notion AI for operations teams — what it does well, where it falls short, and how to roll it out without creating workspace chaos.