Prompt Engineering vs Context Engineering vs Harness Engineering: The AI Stack Every Developer Must Know
Most developers learn to write better prompts. The engineers building production AI systems are doing something far more powerful — they're engineering the entire environment around the model.
In 2023, everyone was talking about "prompt engineering." Write the right magic words and the AI would do anything. By 2025, teams shipping real AI products discovered the uncomfortable truth: a perfect prompt in a broken system is still a broken system.
This is a hard lesson I've watched unfold across the industry. Teams would spend weeks crafting the perfect system prompt, only to find their agent hallucinating in production because it was getting bad data from retrieval, losing track of state across turns, or running on infrastructure that had no observability. The prompt was fine — everything around it wasn't.
This post breaks down the three-layer engineering stack that underpins every serious LLM application: Prompt Engineering, Context Engineering, and Harness Engineering. Understand all three and you'll understand why some AI products feel magical and others feel broken.
📖 References: Anthropic's guide on prompting • Weaviate on Context Engineering • Firecrawl's deep dive on Context Engineering
The Three Layers
Think of it as layers of abstraction, from narrow to broad:
┌────────────────────────────────────────────-─┐
│ HARNESS ENGINEERING │ ← The system around the model
│ Execution | Guardrails | Observability │
│ ┌───────────────────────────────────────┐ │
│ │ CONTEXT ENGINEERING │ │ ← What the model knows
│ │ Memory | RAG | Tools | History │ │
│ │ ┌─────────────────────────────────┐ │ │
│ │ │ PROMPT ENGINEERING │ │ │ ← What you tell the model
│ │ │ Instructions | Examples | CoT │ │ │
│ │ └─────────────────────────────────┘ │ │
│ └───────────────────────────────────────┘ │
└───────────────────────────────────────────-──┘
Each layer builds on the one inside it. You can't skip layers. A brilliant harness won't save a broken prompt, and a perfect prompt won't survive a poisoned context.
Layer 1: Prompt Engineering
What It Is
Prompt engineering is the craft of writing effective instructions for an LLM. It's the innermost layer — what you directly write to tell the model what to do.
It answers: "How do I ask this question?"
Core Techniques
Zero-Shot Prompting — Just ask. Works well for simple, well-defined tasks:
You are a senior Java developer. Review the following code for bugs.
[code here]
Few-Shot Prompting — Provide examples of the expected input/output format:
Classify the sentiment of each customer review as POSITIVE, NEUTRAL, or NEGATIVE.
Review: "Delivery was fast and the product works great!"
Sentiment: POSITIVE
Review: "It arrived on time but the packaging was damaged."
Sentiment: NEUTRAL
Review: "Completely broken, total waste of money."
Sentiment: NEGATIVE
Review: "The app crashes every time I open it."
Sentiment:
Why this works: The model infers the pattern from examples rather than relying on your ability to describe every edge case in instructions. In production, use dynamic few-shot — retrieve the most relevant examples from a database based on the current input rather than hardcoding them.
Chain-of-Thought (CoT) — Force the model to reason step-by-step before answering:
You are a software architect. A user wants to migrate a monolith to microservices in 6 months.
Think through this step by step:
1. What are the key risks?
2. What should be the first services to extract?
3. What does a realistic milestone plan look like?
After thinking through each step, give your final recommendation.
This dramatically improves performance on complex reasoning tasks. The model "shows its work," making errors easier to spot and debug.
Structured Output — Essential for production. Always tell the model the exact format you need:
Analyze this Java exception stack trace and respond ONLY with valid JSON in this exact schema:
{
"root_cause": "string",
"likely_file": "string",
"suggested_fix": "string",
"severity": "LOW|MEDIUM|HIGH|CRITICAL"
}
Stack trace:
[stack trace here]
Prompt Engineering Best Practices
| Practice | Why It Matters |
|---|---|
| Be specific about role and task | "You are a senior Java developer" performs better than "you are a helpful assistant" |
| Specify the output format explicitly | Prevents free-form responses that break downstream parsing |
| Put the most important instructions first and last | Models pay more attention to the beginning and end of the prompt |
| Version-control your prompts | Treat prompts like code — a 2-word change can degrade performance by 20% |
| Test against a regression suite | Build "eval" datasets of known inputs/outputs and run them on every prompt change |
Where Prompt Engineering Falls Short
Prompt engineering is powerful but brittle and stateless. It breaks when:
- The task is multi-step — A single prompt can't maintain state across 20 tool calls
- The model needs external data — A prompt can't pull from your live database
- You need reliability at scale — A prompt that works 90% of the time means 10% failure in production
This is where Context Engineering comes in.
Layer 2: Context Engineering
What It Is
Context engineering is the discipline of designing, curating, and managing everything that goes into the model's context window at inference time.
It answers: "What does the model need to know to succeed?"
The context window is the model's working memory — it's all the model can "see" at any given moment. Context engineering is the science of filling that window with high-signal, precisely ordered information, and nothing else.
Context Window = System Prompt
+ Retrieved Documents (RAG)
+ Conversation History
+ Tool Call Results
+ User Message
+ (ideally) Nothing Else
Key insight: A perfect prompt still fails if it's buried under 50,000 tokens of irrelevant retrieved documents. Context engineering ensures the prompt is working in a clean, high-quality information environment.
The "Context Rot" Problem
Research from multiple labs has documented a consistent pattern: LLM performance degrades as context size grows, even before hitting token limits. Two well-known failure modes:
-
Lost in the Middle — Models are better at using information placed at the beginning or end of the context window. Information buried in the middle is frequently ignored.
-
Context Rot — As noise accumulates (irrelevant retrieved chunks, verbose tool outputs, redundant history), the model's attention is diluted across irrelevant tokens, degrading output quality.
The implication: more context ≠ better output. The goal is the smallest possible set of high-signal tokens.
The Four Context Failure Modes
| Failure Mode | What Happens | Example |
|---|---|---|
| Context Poisoning | Wrong info included → wrong reasoning | RAG retrieves a stale document with incorrect pricing |
| Context Distraction | Critical info buried in noise → missed | 40 irrelevant docs, 1 relevant one → model ignores the relevant one |
| Context Confusion | Contradictory info → inconsistent behavior | Two documents disagree on the refund policy |
| Context Overflow | Too many tokens → truncation or degradation | 200k context fills up → critical earlier turns dropped |
The Memory Architecture
For agents that maintain state across turns or sessions, you need a three-tier memory system:
┌──────────────────────────────────────────────────-──┐
│ MEMORY ARCHITECTURE │
│ │
│ Working Memory Long-Term Memory External │
│ (Context Window) (Vector Store) (Tools) │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
│ │Recent turns │ │Conversation │ │Database │ │
│ │Tool outputs │ ←→│summaries │ ←→│APIs │ │
│ │Retrieved docs│ │User prefs │ │Code exec│ │
│ │Current task │ │Past decisions│ │Search │ │
│ └──────────────┘ └──────────────┘ └─────────┘ │
└────────────────────────────────────────────────────-┘
Short-term (Working) Memory is the context window itself — fast, expensive, limited. You actively manage what goes in.
Long-term Memory is an external vector database. The agent writes important information here (key decisions, user preferences, summarized history) and retrieves it as needed.
External Tools give the agent access to real-time data that can't be preloaded — live database queries, API calls, code execution.
RAG Done Right
Basic RAG: embed query → retrieve top-K chunks → stuff into prompt.
Production RAG requires tuning the entire pipeline:
Query → [Pre-processing] → Retrieval → [Re-ranking] → [Compression] → Context
↓ ↓
Query expansion Remove redundant tokens
Hypothetical answers Summarize long chunks
Keyword extraction Merge overlapping results
Chunking strategy is critical and underestimated:
- Too small → each chunk lacks semantic context, retrieval is noisy
- Too large → irrelevant content is included, context is diluted
- Best practice: hierarchical chunking — store both sentence-level and paragraph-level chunks, retrieve at the right granularity for the task
Hybrid retrieval (vector similarity + BM25 keyword search) consistently outperforms either approach alone for recall on diverse query types.
Context Engineering Best Practices
- Build the context assembly pipeline separately from your prompt. Treat it as a first-class engineering concern with its own tests.
- Place critical information at the boundaries. System prompt at the top, most important retrieved documents first, key user instruction last.
- Summarize rather than truncate. When history grows long, maintain a rolling summary rather than cutting off old turns.
- Monitor token usage and retrieval quality. Track precision/recall of your retrieval, average tokens per turn, and hallucination rate as the context grows.
- Implement TTL on memories. Stale context (like an old user preference or superseded decision) actively harms model performance. Expire it.
Layer 3: Harness Engineering
What It Is
Harness engineering is the infrastructure layer — everything that surrounds the model to make it behave reliably in production.
If the LLM is the engine, the harness is the car: the chassis, brakes, steering, and safety systems that turn raw power into a controlled, predictable vehicle.
It answers: "How do we make the model reliably do the right thing, every time, at scale?"
┌─────────────────── HARNESS ──────────────────────-──-─┐
│ │
│ Feedforward Controls Feedback Controls │
│ (before model acts) (after model acts) │
│ ┌──────────────────┐ ┌───────────────────-──┐ │
│ │ Guardrails │ │ Output validators │ │
│ │ Task routing │ │ Self-correction loops│ │
│ │ Tool whitelists │ │ Human-in-the-loop │ │
│ │ Context policies │ │ Error recovery │ │
│ └──────────────────┘ └──────────────────-───┘ │
│ │
│ Observability State Management │
│ ┌──────────────────┐ ┌──────────────────-───┐ │
│ │ Trace every call │ │ Execution state │ │
│ │ Token cost logs │ │ Task queuing │ │
│ │ Latency p99 │ │ Retry/resume logic │ │
│ │ Eval dashboards │ │ Checkpointing │ │
│ └──────────────────┘ └────────────────────-─┘ │
└──────────────────────────────────────────────--───────┘
Why Raw LLMs Break in Production
A raw LLM has no guardrails, no memory, no error recovery, and no observability. In production, this means:
- Non-determinism — The same input with
temperature=0.7gives different outputs. Mission-critical systems can't tolerate this. - Hallucination — The model confidently generates plausible but incorrect answers. Without output validation, these reach users.
- State loss — A stateless API call means the model forgets context between turns (or between user sessions).
- Unbounded execution — An agent calling tools in a loop with no guardrails can rack up massive API bills or cause irreversible side effects (deleting files, sending emails).
Harness engineering is the discipline that solves all of these.
Core Components
1. Feedforward Controls (Before the Model Acts)
User Request
↓
[Input Validation] ← Reject malformed, malicious, or off-topic inputs
↓
[Task Router] ← Route to the right specialized agent or sub-chain
↓
[Tool Authorization] ← Check RBAC: can this agent call this tool?
↓
[Context Policy] ← Apply rate limits, token budget, data access scope
↓
LLM Call
2. Feedback Controls (After the Model Acts)
LLM Output
↓
[Output Validator] ← Schema check, hallucination scorer, toxicity filter
↓
[Self-Correction Loop] ← If validation fails, retry with corrective feedback
↓
[Human Checkpoint] ← Pause for human review if confidence is low
↓
[Side Effect Executor] ← Apply the action (write to DB, send API call, etc.)
↓
[Feedback Logger] ← Log result for eval and fine-tuning
3. Observability
You can't manage what you can't measure. Every production AI system needs:
# Example: Structured trace logging for every LLM call
{
"trace_id": "txn-20240401-abc123",
"timestamp": "2024-04-01T15:32:00Z",
"model": "claude-3-5-sonnet",
"agent_id": "payment-support-v2",
"input_tokens": 4821,
"output_tokens": 312,
"latency_ms": 2340,
"cost_usd": 0.0156,
"tools_called": ["get_transaction", "lookup_policy"],
"retrieval_chunks": 8,
"retrieval_precision": 0.85,
"output_validation": "PASS",
"hallucination_score": 0.12,
"user_rating": null
}
Track these metrics in aggregate to catch:
- Latency regressions (new model version is slower?)
- Cost spikes (a prompt change increased average token count?)
- Quality drift (hallucination rate increasing over time?)
4. State Management for Long-Running Agents
Modern agents run tasks that take minutes or hours — searching codebases, iterating on tests, investigating incidents. They need:
- Checkpointing — Save state periodically so a failure doesn't restart from zero
- Task queuing — Manage concurrent agent tasks without resource contention
- Retry/resume logic — Handle tool failures gracefully (API timeouts, transient errors)
- Execution budget — Hard limits on turns, tokens, and wall-clock time to prevent runaway agents
Self-Correction Pattern
One of the most powerful harness patterns is the Critic Loop — where a second model (or a second pass of the same model) evaluates and corrects the first output:
Input
↓
Generator Model → Draft Output
↓
Critic Model → "Missing error handling. Revise to include try-catch."
↓
Generator Model → Revised Output
↓
Critic Model → "APPROVED"
↓
Final Output
This pattern is particularly effective for code generation, where a critic checks correctness, security, and style before the output is used.
How They Work Together: A Real-World Example
Let's trace through a payment dispute agent for an e-wallet — a real-world scenario I've encountered in fintech.
Scenario: A user messages: "My payment to Grab failed but my balance was deducted. Help!"
1. HARNESS (Feedforward)
├─ Input validation: message is on-topic, not malicious
├─ Task router: routes to "payment-dispute" specialized agent
└─ Tool authorization: agent can call get_transaction, get_balance (READ only, not REFUND yet)
2. CONTEXT ENGINEERING
├─ Retrieval: pull last 5 transactions, account status, dispute policy doc
├─ History: load last 2 turns of this conversation from long-term memory
└─ Assembly: system prompt + policy doc + transaction data + history + user message
= ~3,200 tokens (precisely curated, not 50k of "everything")
3. PROMPT ENGINEERING
├─ Role: "You are a payment support specialist at an e-wallet"
├─ CoT: "First identify the transaction, then check if it was actually deducted..."
└─ Output format: JSON with `investigation_steps`, `likely_cause`, `recommended_action`
4. LLM CALL → Response: {"likely_cause": "network timeout during settlement", "recommended_action": "REFUND"}
5. HARNESS (Feedback)
├─ Output validator: JSON schema check ✅, refund requires human approval flag ⚠️
├─ Human checkpoint: ticket created for agent supervisor
├─ Logger: full trace saved with token count, latency, retrieved docs
└─ User response: "I've investigated your payment — it appears to be a network timeout. A refund request has been submitted and will be reviewed within 24 hours."
Every layer contributed to that outcome. Remove one and the system degrades or breaks.
Where Developers Are in 2026
The industry has shifted from demos to production-grade systems. The role of software engineers is evolving:
| Era | What Engineers Did | What Broke |
|---|---|---|
| 2023: Prompt Era | Wrote clever prompts | Brittle, unreliable, unscalable |
| 2024: RAG Era | Added retrieval to prompts | Context poisoning, poor chunking |
| 2025: Agent Era | Built multi-step agents | State loss, runaway execution, no observability |
| 2026: Systems Era | Engineer all three layers as a system | (This is where we are now) |
Today, the engineers building the best AI products are thinking in systems. They're not asking "how do I prompt better?" — they're asking "how do I build an infrastructure that makes failure less likely, catches it when it happens, and learns from it over time?"
Key Takeaways
-
Prompt Engineering is the innermost layer — necessary but not sufficient. Write clear instructions, use CoT for complex tasks, enforce structured output, and version-control your prompts.
-
Context Engineering is the critical middle layer most teams underinvest in. The goal is the smallest, highest-signal context window possible — not the biggest.
-
Harness Engineering is the outermost layer that makes AI systems production-ready: guardrails, observability, self-correction, and state management.
-
Context rot is real — performance degrades as context grows, not just when it overflows. Curate aggressively, summarize instead of truncate, and monitor retrieval quality.
-
The four context failure modes — poisoning, distraction, confusion, overflow — are the most common sources of production hallucinations. Engineer against all four.
-
Every LLM call needs a trace. Latency, token cost, retrieval precision, hallucination score — you can't debug a system you can't observe.
The good news: these aren't magic. They're engineering disciplines, and engineers can learn them. The teams winning at AI in production aren't the ones with the best prompts — they're the ones who built the best systems around their models.
Thanks for reading! If you found this useful, check out my Clean Architecture guide for how to structure the codebase around these systems, and the Saga Pattern post for managing multi-step distributed workflows — the same concepts apply elegantly to agentic pipelines. 🚀