RAG and AI Agents: The Architecture Behind Intelligent AI Systems

April 9, 2026 · 10 min read

Senior Software Engineer at OCB

LLMs can reason brilliantly — but they're frozen in time. They don't know what happened last week. They've never seen your company's private database. And when they don't know something, they might confidently make something up.

RAG and AI Agents are the two architectural patterns that solve this. One gives the LLM relevant context. The other gives it the ability to act. Together, they form the backbone of every serious AI product built in 2025.

📖 References: Original RAG Paper (2020) • ReAct Paper • Weaviate on Agentic RAG

Part 1: RAG — Giving the LLM Memory

The Problem RAG Solves

Imagine you're building a customer support bot for a Vietnamese bank. You have 10,000 pages of product documentation, internal policies, and transaction FAQs. The LLM has never seen any of it — it was trained on generic internet data from a year ago.

You could fine-tune the model, but that's expensive, slow to update, and doesn't handle the next update to the policy PDF. RAG is the practical answer: augment the LLM's input with the exact documents it needs, at query time.

Core idea: Instead of baking knowledge into the model's weights, retrieve it on demand and inject it into the prompt.

How RAG Works

RAG has two distinct pipelines — one runs offline, one runs at query time.

The Indexing Pipeline (Offline)

This runs once (and re-runs whenever your data changes):

Raw Documents (PDFs, DB rows, Confluence pages, etc.)
        │
        ▼
  [1] Parse & Clean
        │
        ▼
  [2] Chunk
        (split into ~500-token segments with overlap)
        │
        ▼
  [3] Embed
        (each chunk → a vector via embedding model)
        │
        ▼
  [4] Store in Vector Database
        (Pinecone, Weaviate, Chroma, pgvector...)

The Query Pipeline (Real-time)

This runs every time a user asks a question:

User Question: "What is OCB's overdraft fee?"
        │
        ▼
  [1] Embed the question (same embedding model)
        │
        ▼
  [2] Vector Search → retrieve Top-K most similar chunks
        │
        ▼
  [3] Build augmented prompt:
        "Answer using this context: [chunk1] [chunk2]...
         Question: What is OCB's overdraft fee?"
        │
        ▼
  [4] LLM generates grounded, cited answer

Chunking Strategy Matters

The quality of your RAG system is almost entirely determined by how you chunk. Getting this wrong is the #1 source of irrelevant retrieved context.

Strategy	How It Works	Best For
Fixed-size	Split every N tokens	Simple text, uniform docs
Sentence-based	Split on sentence boundaries	Conversational content
Recursive character	Split by `\n\n`, then `\n`, then	Mixed documents
Semantic	Split when topic changes	Long-form technical docs
Parent-child	Store small chunks, retrieve with parent context	Complex hierarchical docs

Tip: Use a chunk overlap of 10–20% (e.g., 100 token overlap on 512-token chunks). This prevents answers from being cut in half at a chunk boundary.

Retrieval: Vector Search vs Hybrid Search

Basic RAG uses vector (semantic) search — cosine similarity between the query vector and chunk vectors. This finds semantically related content even if the words don't match exactly.

Hybrid search combines vector search with BM25 (keyword/lexical search). It's better for:

Proper nouns (names, product codes, account numbers)
Exact-match queries ("OCB-OVERDRAFT-FEE-2024")
Short, specific queries that semantic search handles poorly

Score = α × semantic_score + (1-α) × bm25_score
          (typically α = 0.5–0.7)

Reranking: The Often-Skipped Step

After retrieval, add a reranker model to reorder your Top-K results. Rerankers (like Cohere Rerank or a cross-encoder) are slower but much more accurate than the embedding similarity alone — they read both the query and each chunk together to score actual relevance.

Retrieved: [chunk3, chunk17, chunk8, chunk2, chunk41] (by vector similarity)
           ↓ Reranker
Reranked:  [chunk17, chunk3, chunk41, chunk8, chunk2]  (by true relevance)
           ↓ Take Top-3
Final context: [chunk17, chunk3, chunk41]

Part 2: AI Agents — Giving the LLM Agency

What Is an AI Agent?

A chatbot answers a question. An agent completes a task.

The difference is autonomy and tool use. An agent can:

Break a complex task into sub-steps
Decide which tools to call (search, database, API, calculator...)
React to results and adjust its plan
Loop until the task is done

The foundational pattern behind almost every production agent is ReAct: Reason + Act.

The ReAct Loop

ReAct forces the LLM to make its reasoning visible before acting. Each iteration has three phases:

┌─────────────────────────────────────────────────────┐
│                    ReAct Loop                       │
│                                                     │
│  User: "What's the total revenue of our top 3       │
│          clients in Q1 2026?"                       │
│                                                     │
│  Iteration 1:                                       │
│  ┌────────────────────┐                             │
│  │ THOUGHT: I need    │                             │
│  │ to find top 3      │ ← LLM reasons               │
│  │ clients first.     │                             │
│  └─────────┬──────────┘                             │
│            │                                        │
│  ┌─────────▼──────────┐                             │
│  │ ACTION: call       │                             │
│  │ query_db("SELECT   │ ← LLM calls a tool          │
│  │ top 3 clients")    │                             │
│  └─────────┬──────────┘                             │
│            │                                        │
│  ┌─────────▼──────────┐                             │
│  │ OBSERVATION:       │                             │
│  │ [Vietcombank,      │ ← Tool returns result       │
│  │  Techcombank, MB]  │                             │
│  └─────────┬──────────┘                             │
│            │                                        │
│  Iteration 2:                                       │
│  THOUGHT: Now I need revenue for each →             │
│  ACTION: query_db("Q1 revenue WHERE client IN...") →│
│  OBSERVATION: [120B, 95B, 88B VND]                 │
│            │                                        │
│  FINAL ANSWER: "Total Q1 revenue from top 3         │
│  clients: 303 billion VND."                         │
└─────────────────────────────────────────────────────┘

Every major agent framework (LangChain, LangGraph, AutoGen, Claude's tool_use) implements this loop — they just differ in how they manage state and orchestrate multiple agents.

Anatomy of an Agent

┌──────────────────────────────────────────┐
│                AI Agent                  │
│                                          │
│  ┌──────────┐    ┌─────────────────────┐ │
│  │  Memory  │    │       Planner       │ │
│  │          │    │ (LLM with ReAct     │ │
│  │ - Short  │◄──►│  or Plan-and-Exec)  │ │
│  │   term   │    └──────────┬──────────┘ │
│  │   (ctx)  │               │            │
│  │ - Long   │               ▼            │
│  │   term   │    ┌─────────────────────┐ │
│  │   (RAG)  │    │    Tool Executor    │ │
│  └──────────┘    │                     │ │
│                  │  • search_web()     │ │
│                  │  • query_database() │ │
│                  │  • call_api()       │ │
│                  │  • run_code()       │ │
│                  │  • send_email()     │ │
│                  └─────────────────────┘ │
└──────────────────────────────────────────┘

Key components:

Planner: The LLM itself — decides what to do next based on the task and current state.
Memory: Short-term (the context window), long-term (a vector DB or conversation store).
Tools: The agent's hands — anything with a well-defined input/output schema (your MCP server from the previous post is a perfect tool definition!).

Production Safeguards You Cannot Skip

Agents that loop are dangerous without guardrails:

❌ Bad: while true { agent.run() }

✅ Good:
  maxSteps := 10
  for step := 0; step < maxSteps; step++ {
      result := agent.step()
      if result.isDone { break }
  }
  // Force-stop if max steps reached — log and alert

Essential safeguards:

Max iteration limit — prevent infinite loops burning API budget
Tool input validation — reject malformed arguments before execution
Output verification — parse and validate tool results before passing back to the LLM
Human-in-the-loop — gate destructive actions (delete, send, deploy) on human approval
Cost tracking — set per-agent token budgets (Paperclip does this automatically — see this post)

Part 3: RAG + Agents = Agentic RAG

Why Combine Them?

Simple RAG has a fundamental limitation: it retrieves once, then generates. If the first retrieval misses, the answer is wrong. There's no recovery.

In an Agentic RAG system, retrieval becomes one tool among many. The agent decides when to search, which knowledge base to query, evaluates if the results are good enough, and retries with a rephrased query if not.

Traditional RAG:             Agentic RAG:
                             
User Query                   User Query
    │                            │
    ▼                            ▼
Retrieve (once)          ┌──── Agent Planner ────┐
    │                    │   Thought: need data  │
    ▼                    └──────────┬────────────┘
Generate answer                     │
                              ┌─────▼─────────────────┐
                              │  Is this a simple      │
                              │  Q&A or complex task?  │
                              └──┬──────────┬──────────┘
                                 │          │
                          Simple Q&A    Complex Task
                                 │          │
                                 ▼          ▼
                           Vector Search   Plan sub-tasks:
                           (get_context)   1. Search DB
                                 │         2. Call API
                           ┌─────▼──────┐  3. Calculate
                           │ Sufficient │  4. Synthesize
                           │ context?   │
                           └──┬────┬───┘
                           Yes│    │No
                              │    └─► Rephrase → Retry
                              ▼
                          Generate

Real-World Example: Fintech Support Agent

At a bank, a "smart support agent" needs to handle mixed queries:

Query	What the agent does
"What's my account balance?"	Calls `get_account_balance` API tool
"Why was I charged an overdraft fee?"	Searches policy KB and calls `get_transaction_history` API
"Is this fee correct per your policy?"	Retrieves policy from RAG and compares against the transaction data
"Dispute this charge and email me confirmation"	Multi-step: validate → submit dispute API → send email tool

The first two queries are single-tool. The last two require an agent loop — RAG alone can't handle them.

Architecture of a Production Agentic RAG System

                    ┌─────────────────────────────────┐
                    │         Agentic RAG              │
                    │                                  │
  User Query ──────►│  ┌──────────────────────────┐   │
                    │  │   Intent Classifier       │   │
                    │  │   (route to right path)   │   │
                    │  └────────────┬──────────────┘   │
                    │               │                   │
                    │    ┌──────────▼──────────┐        │
                    │    │   Agent Planner     │◄──┐    │
                    │    │   (ReAct LLM)       │   │    │
                    │    └──────────┬──────────┘   │    │
                    │               │               │    │
                    │   ┌───────────▼───────────┐  │    │
                    │   │     Tool Executor     │  │    │
                    │   │                       │  │    │
                    │   │  ┌─────────────────┐  │  │    │
                    │   │  │ RAG Tool        │──┼──┘    │
                    │   │  │ (vector search) │  │ Obs.  │
                    │   │  ├─────────────────┤  │       │
                    │   │  │ API Tools       │  │       │
                    │   │  │ (DB, REST, etc) │  │       │
                    │   │  ├─────────────────┤  │       │
                    │   │  │ Compute Tools   │  │       │
                    │   │  │ (calculator,    │  │       │
                    │   │  │  code runner)   │  │       │
                    │   │  └─────────────────┘  │       │
                    │   └───────────────────────┘       │
                    └─────────────────────────────────--┘

Key design decision: The RAG knowledge base is a tool, not the first step. The agent calls it only when it decides retrieval will help.

RAG vs Agent vs Agentic RAG

	RAG	Agent	Agentic RAG
Control flow	Linear, deterministic	Dynamic, iterative	Dynamic with structured retrieval
Hallucination risk	Low (grounded)	Medium (depends on tools)	Low (retrieval + reasoning)
Latency	Low (~1-2s)	High (multiple LLM calls)	High
Cost	Low	High	Highest
Best for	Q&A, document lookup	Multi-step task execution	Complex Q&A + task automation
Transparency	High (show retrieved chunks)	Medium (show reasoning trace)	Medium-high
When to use	Static knowledge base Q&A	Workflow automation	Enterprise AI assistants

Common Pitfalls

In RAG:

Chunk size too large — LLMs bury the relevant sentence in irrelevant context → Bad answers
No reranker — Top-K by similarity ≠ Top-K by relevance → Missed answers
Not cleaning data — HTML tags, headers, footers pollute chunks → Noisy retrieval
Same embedding model for index and query — Must always use the same model for both

In Agents:

No max steps — Agents can loop indefinitely, burning tokens and money
Tool schemas too vague — The LLM can't call tools it doesn't understand
No error handling — Tool failures crash the agent instead of being recovered
Context too large — Stuffing the full conversation history hits token limits fast

In Agentic RAG:

RAG is not a fallback — Design retrieval as a deliberate tool, not a catch-all
No evaluation pipeline — Without offline testing, you won't know when retrieval degrades
Human-in-the-loop missing — Agents that take real actions (email, payment, delete) must be gated

Interview Tips

These topics appear constantly in senior/staff-level AI engineering interviews:

"How would you reduce hallucinations in an LLM system?" Talk about RAG (grounding), rerankers, confidence thresholds, and citation-based generation.

"Design a document Q&A system for 1 million employee records." Start with chunking + vector DB, add hybrid search, add reranker, discuss metadata filtering (e.g., by department).

"How do you prevent an agent from running forever?" Max iteration limits, step budgets, tool call whitelists, HITL gates on destructive actions.

"What's the difference between RAG and fine-tuning?" RAG = external memory (updatable, no training cost, auditable). Fine-tuning = baked-in knowledge (stale, expensive, higher recall for domain-specific tasks). Use both for best results.

Key Takeaways

RAG grounds the LLM in facts by retrieving relevant chunks at query time. It's the first-line defense against hallucination for knowledge-heavy applications.
Chunking and reranking are the two most impactful RAG optimizations. Most teams spend too long on prompt engineering and too little time here.
Agents extend LLMs with action and iteration using the ReAct loop: Thought → Action → Observation → repeat. Every serious AI workflow eventually becomes an agent.
Agentic RAG treats retrieval as a tool the agent can call one or many times. This enables multi-hop reasoning, self-correction, and mixing retrieval with live API calls.
Always add production safeguards: max iteration limits, tool input validation, cost budgets, and human-in-the-loop for destructive actions.
Know when NOT to use agents: If a simple two-step pipeline (retrieve → generate) solves the problem, use it. Agents add latency, cost, and non-determinism. Reach for them only when you need multi-step reasoning or real-world action.

Thanks for reading! RAG and agents are not competing ideas — they're complementary layers. Start with RAG to ground your LLM. Add an agent loop when your tasks become too complex for a single retrieval. That progression is the most reliable path to production-grade AI systems. 🚀

If you're building agents that need to call external services, check the MCP post — it's the cleanest way to expose tools to your agent.

Part 1: RAG — Giving the LLM Memory​

The Problem RAG Solves​

How RAG Works​

The Indexing Pipeline (Offline)​

The Query Pipeline (Real-time)​

Chunking Strategy Matters​

Retrieval: Vector Search vs Hybrid Search​

Reranking: The Often-Skipped Step​

Part 2: AI Agents — Giving the LLM Agency​

What Is an AI Agent?​

The ReAct Loop​

Anatomy of an Agent​

Production Safeguards You Cannot Skip​

Part 3: RAG + Agents = Agentic RAG​

Why Combine Them?​

Real-World Example: Fintech Support Agent​

Architecture of a Production Agentic RAG System​

RAG vs Agent vs Agentic RAG​

Common Pitfalls​

Interview Tips​

Key Takeaways​