Skip to main content

RAG and AI Agents: The Architecture Behind Intelligent AI Systems

· 10 min read
Hieu Nguyen
Senior Software Engineer at OCB

LLMs can reason brilliantly — but they're frozen in time. They don't know what happened last week. They've never seen your company's private database. And when they don't know something, they might confidently make something up.

RAG and AI Agents are the two architectural patterns that solve this. One gives the LLM relevant context. The other gives it the ability to act. Together, they form the backbone of every serious AI product built in 2025.

📖 References: Original RAG Paper (2020)ReAct PaperWeaviate on Agentic RAG


Part 1: RAG — Giving the LLM Memory

The Problem RAG Solves

Imagine you're building a customer support bot for a Vietnamese bank. You have 10,000 pages of product documentation, internal policies, and transaction FAQs. The LLM has never seen any of it — it was trained on generic internet data from a year ago.

You could fine-tune the model, but that's expensive, slow to update, and doesn't handle the next update to the policy PDF. RAG is the practical answer: augment the LLM's input with the exact documents it needs, at query time.

Core idea: Instead of baking knowledge into the model's weights, retrieve it on demand and inject it into the prompt.

How RAG Works

RAG has two distinct pipelines — one runs offline, one runs at query time.

The Indexing Pipeline (Offline)

This runs once (and re-runs whenever your data changes):

Raw Documents (PDFs, DB rows, Confluence pages, etc.)


[1] Parse & Clean


[2] Chunk
(split into ~500-token segments with overlap)


[3] Embed
(each chunk → a vector via embedding model)


[4] Store in Vector Database
(Pinecone, Weaviate, Chroma, pgvector...)

The Query Pipeline (Real-time)

This runs every time a user asks a question:

User Question: "What is OCB's overdraft fee?"


[1] Embed the question (same embedding model)


[2] Vector Search → retrieve Top-K most similar chunks


[3] Build augmented prompt:
"Answer using this context: [chunk1] [chunk2]...
Question: What is OCB's overdraft fee?"


[4] LLM generates grounded, cited answer

Chunking Strategy Matters

The quality of your RAG system is almost entirely determined by how you chunk. Getting this wrong is the #1 source of irrelevant retrieved context.

StrategyHow It WorksBest For
Fixed-sizeSplit every N tokensSimple text, uniform docs
Sentence-basedSplit on sentence boundariesConversational content
Recursive characterSplit by \n\n, then \n, then Mixed documents
SemanticSplit when topic changesLong-form technical docs
Parent-childStore small chunks, retrieve with parent contextComplex hierarchical docs

Tip: Use a chunk overlap of 10–20% (e.g., 100 token overlap on 512-token chunks). This prevents answers from being cut in half at a chunk boundary.

Basic RAG uses vector (semantic) search — cosine similarity between the query vector and chunk vectors. This finds semantically related content even if the words don't match exactly.

Hybrid search combines vector search with BM25 (keyword/lexical search). It's better for:

  • Proper nouns (names, product codes, account numbers)
  • Exact-match queries ("OCB-OVERDRAFT-FEE-2024")
  • Short, specific queries that semantic search handles poorly
Score = α × semantic_score + (1-α) × bm25_score
(typically α = 0.5–0.7)

Reranking: The Often-Skipped Step

After retrieval, add a reranker model to reorder your Top-K results. Rerankers (like Cohere Rerank or a cross-encoder) are slower but much more accurate than the embedding similarity alone — they read both the query and each chunk together to score actual relevance.

Retrieved: [chunk3, chunk17, chunk8, chunk2, chunk41] (by vector similarity)
↓ Reranker
Reranked: [chunk17, chunk3, chunk41, chunk8, chunk2] (by true relevance)
↓ Take Top-3
Final context: [chunk17, chunk3, chunk41]

Part 2: AI Agents — Giving the LLM Agency

What Is an AI Agent?

A chatbot answers a question. An agent completes a task.

The difference is autonomy and tool use. An agent can:

  • Break a complex task into sub-steps
  • Decide which tools to call (search, database, API, calculator...)
  • React to results and adjust its plan
  • Loop until the task is done

The foundational pattern behind almost every production agent is ReAct: Reason + Act.

The ReAct Loop

ReAct forces the LLM to make its reasoning visible before acting. Each iteration has three phases:

┌─────────────────────────────────────────────────────┐
│ ReAct Loop │
│ │
│ User: "What's the total revenue of our top 3 │
│ clients in Q1 2026?" │
│ │
│ Iteration 1: │
│ ┌────────────────────┐ │
│ │ THOUGHT: I need │ │
│ │ to find top 3 │ ← LLM reasons │
│ │ clients first. │ │
│ └─────────┬──────────┘ │
│ │ │
│ ┌─────────▼──────────┐ │
│ │ ACTION: call │ │
│ │ query_db("SELECT │ ← LLM calls a tool │
│ │ top 3 clients") │ │
│ └─────────┬──────────┘ │
│ │ │
│ ┌─────────▼──────────┐ │
│ │ OBSERVATION: │ │
│ │ [Vietcombank, │ ← Tool returns result │
│ │ Techcombank, MB] │ │
│ └─────────┬──────────┘ │
│ │ │
│ Iteration 2: │
│ THOUGHT: Now I need revenue for each → │
│ ACTION: query_db("Q1 revenue WHERE client IN...") →│
│ OBSERVATION: [120B, 95B, 88B VND] │
│ │ │
│ FINAL ANSWER: "Total Q1 revenue from top 3 │
│ clients: 303 billion VND." │
└─────────────────────────────────────────────────────┘

Every major agent framework (LangChain, LangGraph, AutoGen, Claude's tool_use) implements this loop — they just differ in how they manage state and orchestrate multiple agents.

Anatomy of an Agent

┌──────────────────────────────────────────┐
│ AI Agent │
│ │
│ ┌──────────┐ ┌─────────────────────┐ │
│ │ Memory │ │ Planner │ │
│ │ │ │ (LLM with ReAct │ │
│ │ - Short │◄──►│ or Plan-and-Exec) │ │
│ │ term │ └──────────┬──────────┘ │
│ │ (ctx) │ │ │
│ │ - Long │ ▼ │
│ │ term │ ┌─────────────────────┐ │
│ │ (RAG) │ │ Tool Executor │ │
│ └──────────┘ │ │ │
│ │ • search_web() │ │
│ │ • query_database() │ │
│ │ • call_api() │ │
│ │ • run_code() │ │
│ │ • send_email() │ │
│ └─────────────────────┘ │
└──────────────────────────────────────────┘

Key components:

  • Planner: The LLM itself — decides what to do next based on the task and current state.
  • Memory: Short-term (the context window), long-term (a vector DB or conversation store).
  • Tools: The agent's hands — anything with a well-defined input/output schema (your MCP server from the previous post is a perfect tool definition!).

Production Safeguards You Cannot Skip

Agents that loop are dangerous without guardrails:

❌ Bad: while true { agent.run() }

✅ Good:
maxSteps := 10
for step := 0; step < maxSteps; step++ {
result := agent.step()
if result.isDone { break }
}
// Force-stop if max steps reached — log and alert

Essential safeguards:

  1. Max iteration limit — prevent infinite loops burning API budget
  2. Tool input validation — reject malformed arguments before execution
  3. Output verification — parse and validate tool results before passing back to the LLM
  4. Human-in-the-loop — gate destructive actions (delete, send, deploy) on human approval
  5. Cost tracking — set per-agent token budgets (Paperclip does this automatically — see this post)

Part 3: RAG + Agents = Agentic RAG

Why Combine Them?

Simple RAG has a fundamental limitation: it retrieves once, then generates. If the first retrieval misses, the answer is wrong. There's no recovery.

In an Agentic RAG system, retrieval becomes one tool among many. The agent decides when to search, which knowledge base to query, evaluates if the results are good enough, and retries with a rephrased query if not.

Traditional RAG: Agentic RAG:

User Query User Query
│ │
▼ ▼
Retrieve (once) ┌──── Agent Planner ────┐
│ │ Thought: need data │
▼ └──────────┬────────────┘
Generate answer │
┌─────▼─────────────────┐
│ Is this a simple │
│ Q&A or complex task? │
└──┬──────────┬──────────┘
│ │
Simple Q&A Complex Task
│ │
▼ ▼
Vector Search Plan sub-tasks:
(get_context) 1. Search DB
│ 2. Call API
┌─────▼──────┐ 3. Calculate
│ Sufficient │ 4. Synthesize
│ context? │
└──┬────┬───┘
Yes│ │No
│ └─► Rephrase → Retry

Generate

Real-World Example: Fintech Support Agent

At a bank, a "smart support agent" needs to handle mixed queries:

QueryWhat the agent does
"What's my account balance?"Calls get_account_balance API tool
"Why was I charged an overdraft fee?"Searches policy KB and calls get_transaction_history API
"Is this fee correct per your policy?"Retrieves policy from RAG and compares against the transaction data
"Dispute this charge and email me confirmation"Multi-step: validate → submit dispute API → send email tool

The first two queries are single-tool. The last two require an agent loop — RAG alone can't handle them.

Architecture of a Production Agentic RAG System

┌─────────────────────────────────┐
│ Agentic RAG │
│ │
User Query ──────►│ ┌──────────────────────────┐ │
│ │ Intent Classifier │ │
│ │ (route to right path) │ │
│ └────────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Agent Planner │◄──┐ │
│ │ (ReAct LLM) │ │ │
│ └──────────┬──────────┘ │ │
│ │ │ │
│ ┌───────────▼───────────┐ │ │
│ │ Tool Executor │ │ │
│ │ │ │ │
│ │ ┌─────────────────┐ │ │ │
│ │ │ RAG Tool │──┼──┘ │
│ │ │ (vector search) │ │ Obs. │
│ │ ├─────────────────┤ │ │
│ │ │ API Tools │ │ │
│ │ │ (DB, REST, etc) │ │ │
│ │ ├─────────────────┤ │ │
│ │ │ Compute Tools │ │ │
│ │ │ (calculator, │ │ │
│ │ │ code runner) │ │ │
│ │ └─────────────────┘ │ │
│ └───────────────────────┘ │
└─────────────────────────────────--┘

Key design decision: The RAG knowledge base is a tool, not the first step. The agent calls it only when it decides retrieval will help.


RAG vs Agent vs Agentic RAG

RAGAgentAgentic RAG
Control flowLinear, deterministicDynamic, iterativeDynamic with structured retrieval
Hallucination riskLow (grounded)Medium (depends on tools)Low (retrieval + reasoning)
LatencyLow (~1-2s)High (multiple LLM calls)High
CostLowHighHighest
Best forQ&A, document lookupMulti-step task executionComplex Q&A + task automation
TransparencyHigh (show retrieved chunks)Medium (show reasoning trace)Medium-high
When to useStatic knowledge base Q&AWorkflow automationEnterprise AI assistants

Common Pitfalls

In RAG:

  • Chunk size too large — LLMs bury the relevant sentence in irrelevant context → Bad answers
  • No reranker — Top-K by similarity ≠ Top-K by relevance → Missed answers
  • Not cleaning data — HTML tags, headers, footers pollute chunks → Noisy retrieval
  • Same embedding model for index and query — Must always use the same model for both

In Agents:

  • No max steps — Agents can loop indefinitely, burning tokens and money
  • Tool schemas too vague — The LLM can't call tools it doesn't understand
  • No error handling — Tool failures crash the agent instead of being recovered
  • Context too large — Stuffing the full conversation history hits token limits fast

In Agentic RAG:

  • RAG is not a fallback — Design retrieval as a deliberate tool, not a catch-all
  • No evaluation pipeline — Without offline testing, you won't know when retrieval degrades
  • Human-in-the-loop missing — Agents that take real actions (email, payment, delete) must be gated

Interview Tips

These topics appear constantly in senior/staff-level AI engineering interviews:

"How would you reduce hallucinations in an LLM system?" Talk about RAG (grounding), rerankers, confidence thresholds, and citation-based generation.

"Design a document Q&A system for 1 million employee records." Start with chunking + vector DB, add hybrid search, add reranker, discuss metadata filtering (e.g., by department).

"How do you prevent an agent from running forever?" Max iteration limits, step budgets, tool call whitelists, HITL gates on destructive actions.

"What's the difference between RAG and fine-tuning?" RAG = external memory (updatable, no training cost, auditable). Fine-tuning = baked-in knowledge (stale, expensive, higher recall for domain-specific tasks). Use both for best results.


Key Takeaways

  1. RAG grounds the LLM in facts by retrieving relevant chunks at query time. It's the first-line defense against hallucination for knowledge-heavy applications.

  2. Chunking and reranking are the two most impactful RAG optimizations. Most teams spend too long on prompt engineering and too little time here.

  3. Agents extend LLMs with action and iteration using the ReAct loop: Thought → Action → Observation → repeat. Every serious AI workflow eventually becomes an agent.

  4. Agentic RAG treats retrieval as a tool the agent can call one or many times. This enables multi-hop reasoning, self-correction, and mixing retrieval with live API calls.

  5. Always add production safeguards: max iteration limits, tool input validation, cost budgets, and human-in-the-loop for destructive actions.

  6. Know when NOT to use agents: If a simple two-step pipeline (retrieve → generate) solves the problem, use it. Agents add latency, cost, and non-determinism. Reach for them only when you need multi-step reasoning or real-world action.


Thanks for reading! RAG and agents are not competing ideas — they're complementary layers. Start with RAG to ground your LLM. Add an agent loop when your tasks become too complex for a single retrieval. That progression is the most reliable path to production-grade AI systems. 🚀

If you're building agents that need to call external services, check the MCP post — it's the cleanest way to expose tools to your agent.