Multi-Agent Simulation Platform
Autonomous AI Agents on Azure — Architecture Deep Dive
The Engineering Challenge
Most multi-agent systems are glorified prompt chains. One LLM calls another, context gets copy-pasted between turns, and the whole thing forgets everything after a page refresh. I wanted agents that remember.
The challenge: build a platform where 3–8 autonomous AI agents maintain persistent semantic memory across extended multi-phase interaction sessions. Each agent independently decides what to say, recalls relevant context from earlier phases, and — critically — detects when a user contradicts something they said 20 turns ago.
Not a demo. Not a prompt chain. A production-grade multi-agent system with memory, reasoning, and infrastructure that deploys to Azure with a single command.
Azure Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐ │ Azure Container Apps │ │ (azd up) │ │ ┌──────────────┐ ┌──────────────────────┐ ┌──────────────────┐ │ │ │ nginx SPA │ │ FastAPI Backend │ │ Qdrant │ │ │ │ (Frontend) │──│ │──│ (Vector DB) │ │ │ │ Port 80 │ │ Orchestration Layer │ │ Port 6333 │ │ │ └──────────────┘ │ Memory Manager │ │ Named Volume │ │ │ │ Claim Extraction │ └──────────────────┘ │ │ │ Governance Layer │ │ │ └──────────┬─────────────┘ │ │ │ │ │ ┌────────────────┼────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌───────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ │ Azure OpenAI │ │ Azure AI │ │ text-embedding │ │ │ │ Service │ │ Foundry │ │ -3-small │ │ │ │ │ │ │ │ (1536 dims) │ │ │ │ GPT-4o │ │ Agent │ └──────────────────┘ │ │ │ GPT-5-Nano │ │ Framework │ │ │ │ DeepSeek-V3 │ │ Threads │ │ │ │ Phi-4-mini │ │ Tools │ │ │ └───────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────┘
Azure OpenAI Service
Four models deployed with purpose-driven routing. Each agent gets the right model for its cognitive requirements.
- GPT-4o — Complex reasoning and response generation
- GPT-5-Nano — Fast classification and decision-making
- DeepSeek-V3 — Long-form generation tasks
- Phi-4-mini — Lightweight claim extraction (~150 tokens)
Azure AI Foundry
Microsoft Agent Framework provides persistent agents with per-session threads and tool call observability.
Runs on local Docker containers for development. Also deploys to Azure via Azure Container Apps for production.
Azure Container Apps
Three Docker Compose services deploy identically to cloud via azd up.
Zero architecture changes between local development and Azure deployment. Same containers, same networking, same volumes.
GovernedOpenAIClient — Single Gateway
Every LLM call in the platform flows through a single governed client. This is the architectural choke point by design — one place to add logging, one place to swap models, one place to enforce budgets.
Traffic Governance
- •Per-session request budgets with graceful exhaustion
- •Token tracking across all model calls
- •Model routing based on agent requirements
Resilience
- •Exponential backoff retry: 5s → 10s → 15s (max 3 attempts)
- •Graceful fallback responses on rate limit exhaustion
- •Zero 500 errors reach the user — ever
“In production, you need exactly one place to add logging, one place to swap models. If every agent calls Azure OpenAI directly, you've lost control.”
Memory Architecture — Stanford Generative Agents + Qdrant
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ Observation │───▶│ Embed │───▶│ Store │───▶│ Recall │
│ │ │ │ │ │ │ │
│ User turn │ │ text-embed- │ │ Qdrant │ │ Top-20 │
│ Agent turn │ │ ding-3-small │ │ collection │ │ candidates │
│ Self-obs │ │ (1536 dims) │ │ + metadata │ │ reranked by │
│ │ │ │ │ │ │ Stanford │
└─────────────┘ └──────────────┘ └─────────────┘ │ three-factor │
└──────────────┘Stanford Three-Factor Retrieval
score = 0.5 × recency + 3.0 × relevance + 2.0 × importanceCosine similarity — semantic match is the primary retrieval signal
Ensures inconsistencies (importance=1.0) always surface in recall
Low weight — early-phase assertions must remain accessible later
Single-Collection Design
One Qdrant collection (memories) with per-agent payload filters. Scales cleanly without creating N collections per session. Agent isolation enforced at the storage layer via session ID + agent ID filtering.
“I used ChromaDB at GovHack 2024. For persistent multi-agent memory with session scoping, Qdrant was the better tool — metadata filtering, persistent collections, and proper payload indexing out of the box.”
Cross-Phase Inconsistency Detection
Phi-4-mini extracts structured claims from natural language (~150 tokens)
Claims upserted into ClaimTable with topic, value, phase, turn
New claims checked against all prior claims on the same topic
Agent surfaces inconsistency in character, referencing original value
Lightweight Extraction
Phi-4-mini handles claim extraction at ~150 tokens per call. No need for GPT-4o on a structured extraction task — use the smallest model that gets the job done.
Difficulty-Gated Sensitivity
Inconsistency thresholds vary by difficulty tier. Easy mode catches only major discrepancies. Hard mode flags even minor deviations — the system adapts its scrutiny to the scenario.
“Most multi-agent systems don't check for consistency. They just... generate. This platform remembers everything you said, and it will call you on it.”
5-Stage Async Orchestration Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ User Message Received │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 1: Claim Extraction │
│ Phi-4-mini extracts (topic, value, phase) → upsert ClaimTable │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 2: Parallel Inconsistency Checks │
│ One check per new claim vs. all prior claims on same topic │
│ asyncio.gather — severity gated by difficulty tier │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 3: Parallel Memory Recording │
│ Current turn context → Qdrant for each active agent │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Stage 4: Parallel Agent Responses ┌────────────────┐ │
│ Top-5 Stanford-scored memories recalled │ Stage 5: │ │
│ Inconsistency injected into prompt ◀───▶│ Self-Observe + │ │
│ All agents fire via asyncio.gather │ Advisory Tip │ │
│ ContextVar isolation per agent └────────────────┘ │
└──────────────────────────────────────────────────────────────────┘asyncio.gather Parallelism
Stages are strictly sequenced where dependencies exist — claims must be extracted before inconsistency checks — but maximally parallel within each stage. 3+ simultaneous LLM calls complete in under 5 seconds.
ContextVar Isolation
Each async task runs in its own context. Session state, memory streams, and claim tables are task-scoped — preventing cross-agent data leakage during parallel execution.
“Without context isolation, Agent 3's state leaks into Agent 5's reasoning. ContextVar gives you task-scoped state without passing context objects through every function signature.”
Agent Architecture
Three Core Methods
respond()Generates the agent's turn output. Routes through Agent Framework when enabled, falls back to direct Azure OpenAI calls. Can invoke registered tools mid-turn.
decide()Autonomous decision-making before each turn. Returns a structured AgentAction: SPEAK, WAIT, COMMIT, EXIT, or INVITE_COLLAB. Not scripted branching — LLM-evaluated.
generate_tells()Produces non-verbal behavioural signals (expressions, micro-reactions) in parallel with the main response — masking latency with useful output.
Agent Tools
| Tool | Type | Purpose |
|---|---|---|
| recall_memory | Base | Semantic search over the agent's own memory stream |
| record_observation | Base | Write a new observation to persistent memory |
| detect_inconsistency | Base | Check a user assertion against the claim table |
| specialist tool | Specialist | Domain-specific analysis unique to each agent archetype |
| 4 diagnostic tools | Advisory | Structured feedback and contextual guidance for the user |
Data-Driven Agent Configuration
Markdown skill files define personality, domain knowledge, and tier behaviour. New agent archetypes require zero code changes — drop in a new skill file and the platform picks it up.
“If you need to redeploy to change agent behaviour, your architecture is wrong. Skill files are the config layer — personality, knowledge, and rules live in markdown, not Python.”
Key Design Decisions
Multi-Model Routing
Not every task needs GPT-4o. Purpose-driven model selection keeps costs down and latency low without sacrificing quality where it matters.
- GPT-4o — Complex reasoning
- GPT-5-Nano — Classification & decisions
- DeepSeek-V3 — Long-form generation
- Phi-4-mini — Structured extraction
Qdrant over ChromaDB
ChromaDB works for prototypes. For production multi-agent memory with session scoping and metadata filtering:
- Session-scoped payload filtering
- Persistent named collections
- Per-agent isolation at query time
- Docker-native with volume persistence
asyncio over Celery
This is real-time orchestration, not a job queue. Agents need to respond within a single HTTP request cycle.
- Sub-5s latency for parallel LLM calls
- ContextVar isolation per async task
- No message broker dependency
- Native Python 3.12 async
Feature-Flagged Agent Framework
The platform runs fully without an Azure subscription. Agent Framework integration is toggled with a single flag.
USE_AGENT_FRAMEWORK=true— Full Foundry integrationUSE_AGENT_FRAMEWORK=false— Direct Azure OpenAI calls- Graceful degradation, no code changes
- AF adds observability, not core logic
Infrastructure
nginx SPA
Single-file frontend. No build step, no bundler.
FastAPI
Orchestration, memory, claims, governance.
Qdrant
Vector DB with named volume for persistence.
$ docker compose up # Local development — 3 containers $ azd up # Azure deployment — zero architecture changes
“Zero-to-deployed in one command. That's not marketing — that's the actual developer experience. Same containers, same networking, same volumes. Local and cloud are architecturally identical.”
“Most people build multi-agent demos. I built a multi-agent system — with persistent memory, inconsistency detection, and production deployment. The architecture matters more than the demo.”