Back to Development

Multi-Agent Simulation Platform

Autonomous AI Agents on Azure — Architecture Deep Dive

AI Hackathon 2026AzureSolo Developer

The Engineering Challenge

Most multi-agent systems are glorified prompt chains. One LLM calls another, context gets copy-pasted between turns, and the whole thing forgets everything after a page refresh. I wanted agents that remember.

The challenge: build a platform where 3–8 autonomous AI agents maintain persistent semantic memory across extended multi-phase interaction sessions. Each agent independently decides what to say, recalls relevant context from earlier phases, and — critically — detects when a user contradicts something they said 20 turns ago.

Not a demo. Not a prompt chain. A production-grade multi-agent system with memory, reasoning, and infrastructure that deploys to Azure with a single command.

3–8
Autonomous Agents
17
Agent Tools
5
Pipeline Stages
<5s
Turn Latency

Azure Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                      Azure Container Apps                          │
│                         (azd up)                                   │
│  ┌──────────────┐  ┌──────────────────────┐  ┌──────────────────┐  │
│  │   nginx SPA  │  │   FastAPI Backend     │  │    Qdrant        │  │
│  │  (Frontend)  │──│                       │──│  (Vector DB)     │  │
│  │   Port 80    │  │  Orchestration Layer  │  │  Port 6333       │  │
│  └──────────────┘  │  Memory Manager       │  │  Named Volume    │  │
│                    │  Claim Extraction      │  └──────────────────┘  │
│                    │  Governance Layer      │                        │
│                    └──────────┬─────────────┘                        │
│                               │                                      │
│              ┌────────────────┼────────────────┐                     │
│              ▼                ▼                 ▼                     │
│  ┌───────────────┐  ┌──────────────┐  ┌──────────────────┐          │
│  │ Azure OpenAI  │  │  Azure AI    │  │  text-embedding  │          │
│  │   Service     │  │  Foundry     │  │  -3-small        │          │
│  │               │  │              │  │  (1536 dims)     │          │
│  │  GPT-4o       │  │  Agent       │  └──────────────────┘          │
│  │  GPT-5-Nano   │  │  Framework   │                                │
│  │  DeepSeek-V3  │  │  Threads     │                                │
│  │  Phi-4-mini   │  │  Tools       │                                │
│  └───────────────┘  └──────────────┘                                │
└─────────────────────────────────────────────────────────────────────┘

Azure OpenAI Service

Four models deployed with purpose-driven routing. Each agent gets the right model for its cognitive requirements.

  • GPT-4o — Complex reasoning and response generation
  • GPT-5-Nano — Fast classification and decision-making
  • DeepSeek-V3 — Long-form generation tasks
  • Phi-4-mini — Lightweight claim extraction (~150 tokens)

Azure AI Foundry

Microsoft Agent Framework provides persistent agents with per-session threads and tool call observability.

Runs on local Docker containers for development. Also deploys to Azure via Azure Container Apps for production.

Azure Container Apps

Three Docker Compose services deploy identically to cloud via azd up.

Zero architecture changes between local development and Azure deployment. Same containers, same networking, same volumes.

GovernedOpenAIClient — Single Gateway

Every LLM call in the platform flows through a single governed client. This is the architectural choke point by design — one place to add logging, one place to swap models, one place to enforce budgets.

Traffic Governance

  • Per-session request budgets with graceful exhaustion
  • Token tracking across all model calls
  • Model routing based on agent requirements

Resilience

  • Exponential backoff retry: 5s → 10s → 15s (max 3 attempts)
  • Graceful fallback responses on rate limit exhaustion
  • Zero 500 errors reach the user — ever

“In production, you need exactly one place to add logging, one place to swap models. If every agent calls Azure OpenAI directly, you've lost control.”

Memory Architecture — Stanford Generative Agents + Qdrant

┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐
│ Observation │───▶│   Embed      │───▶│   Store     │───▶│   Recall     │
│             │    │              │    │             │    │              │
│ User turn   │    │ text-embed-  │    │ Qdrant      │    │ Top-20       │
│ Agent turn  │    │ ding-3-small │    │ collection  │    │ candidates   │
│ Self-obs    │    │ (1536 dims)  │    │ + metadata  │    │ reranked by  │
│             │    │              │    │             │    │ Stanford     │
└─────────────┘    └──────────────┘    └─────────────┘    │ three-factor │
                                                          └──────────────┘

Stanford Three-Factor Retrieval

score = 0.5 × recency + 3.0 × relevance + 2.0 × importance
3.0
Relevance

Cosine similarity — semantic match is the primary retrieval signal

2.0
Importance

Ensures inconsistencies (importance=1.0) always surface in recall

0.5
Recency

Low weight — early-phase assertions must remain accessible later

Single-Collection Design

One Qdrant collection (memories) with per-agent payload filters. Scales cleanly without creating N collections per session. Agent isolation enforced at the storage layer via session ID + agent ID filtering.

“I used ChromaDB at GovHack 2024. For persistent multi-agent memory with session scoping, Qdrant was the better tool — metadata filtering, persistent collections, and proper payload indexing out of the box.”

Cross-Phase Inconsistency Detection

EXTRACT

Phi-4-mini extracts structured claims from natural language (~150 tokens)

STORE

Claims upserted into ClaimTable with topic, value, phase, turn

COMPARE

New claims checked against all prior claims on the same topic

ALERT

Agent surfaces inconsistency in character, referencing original value

Lightweight Extraction

Phi-4-mini handles claim extraction at ~150 tokens per call. No need for GPT-4o on a structured extraction task — use the smallest model that gets the job done.

Difficulty-Gated Sensitivity

Inconsistency thresholds vary by difficulty tier. Easy mode catches only major discrepancies. Hard mode flags even minor deviations — the system adapts its scrutiny to the scenario.

“Most multi-agent systems don't check for consistency. They just... generate. This platform remembers everything you said, and it will call you on it.”

5-Stage Async Orchestration Pipeline

  ┌─────────────────────────────────────────────────────────────────┐
  │                    User Message Received                        │
  └──────────────────────────┬──────────────────────────────────────┘
                             ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  Stage 1: Claim Extraction                                      │
  │  Phi-4-mini extracts (topic, value, phase) → upsert ClaimTable  │
  └──────────────────────────┬──────────────────────────────────────┘
                             ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  Stage 2: Parallel Inconsistency Checks                         │
  │  One check per new claim vs. all prior claims on same topic     │
  │  asyncio.gather — severity gated by difficulty tier              │
  └──────────────────────────┬──────────────────────────────────────┘
                             ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  Stage 3: Parallel Memory Recording                             │
  │  Current turn context → Qdrant for each active agent            │
  └──────────────────────────┬──────────────────────────────────────┘
                             ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  Stage 4: Parallel Agent Responses          ┌────────────────┐  │
  │  Top-5 Stanford-scored memories recalled    │ Stage 5:       │  │
  │  Inconsistency injected into prompt    ◀───▶│ Self-Observe + │  │
  │  All agents fire via asyncio.gather         │ Advisory Tip   │  │
  │  ContextVar isolation per agent             └────────────────┘  │
  └──────────────────────────────────────────────────────────────────┘

asyncio.gather Parallelism

Stages are strictly sequenced where dependencies exist — claims must be extracted before inconsistency checks — but maximally parallel within each stage. 3+ simultaneous LLM calls complete in under 5 seconds.

ContextVar Isolation

Each async task runs in its own context. Session state, memory streams, and claim tables are task-scoped — preventing cross-agent data leakage during parallel execution.

“Without context isolation, Agent 3's state leaks into Agent 5's reasoning. ContextVar gives you task-scoped state without passing context objects through every function signature.”

Agent Architecture

Three Core Methods

respond()

Generates the agent's turn output. Routes through Agent Framework when enabled, falls back to direct Azure OpenAI calls. Can invoke registered tools mid-turn.

decide()

Autonomous decision-making before each turn. Returns a structured AgentAction: SPEAK, WAIT, COMMIT, EXIT, or INVITE_COLLAB. Not scripted branching — LLM-evaluated.

generate_tells()

Produces non-verbal behavioural signals (expressions, micro-reactions) in parallel with the main response — masking latency with useful output.

Agent Tools

ToolTypePurpose
recall_memoryBaseSemantic search over the agent's own memory stream
record_observationBaseWrite a new observation to persistent memory
detect_inconsistencyBaseCheck a user assertion against the claim table
specialist toolSpecialistDomain-specific analysis unique to each agent archetype
4 diagnostic toolsAdvisoryStructured feedback and contextual guidance for the user

Data-Driven Agent Configuration

Markdown skill files define personality, domain knowledge, and tier behaviour. New agent archetypes require zero code changes — drop in a new skill file and the platform picks it up.

“If you need to redeploy to change agent behaviour, your architecture is wrong. Skill files are the config layer — personality, knowledge, and rules live in markdown, not Python.”

Key Design Decisions

Multi-Model Routing

Not every task needs GPT-4o. Purpose-driven model selection keeps costs down and latency low without sacrificing quality where it matters.

  • GPT-4o — Complex reasoning
  • GPT-5-Nano — Classification & decisions
  • DeepSeek-V3 — Long-form generation
  • Phi-4-mini — Structured extraction

Qdrant over ChromaDB

ChromaDB works for prototypes. For production multi-agent memory with session scoping and metadata filtering:

  • Session-scoped payload filtering
  • Persistent named collections
  • Per-agent isolation at query time
  • Docker-native with volume persistence

asyncio over Celery

This is real-time orchestration, not a job queue. Agents need to respond within a single HTTP request cycle.

  • Sub-5s latency for parallel LLM calls
  • ContextVar isolation per async task
  • No message broker dependency
  • Native Python 3.12 async

Feature-Flagged Agent Framework

The platform runs fully without an Azure subscription. Agent Framework integration is toggled with a single flag.

  • USE_AGENT_FRAMEWORK=true — Full Foundry integration
  • USE_AGENT_FRAMEWORK=false — Direct Azure OpenAI calls
  • Graceful degradation, no code changes
  • AF adds observability, not core logic

Infrastructure

nginx SPA

Single-file frontend. No build step, no bundler.

FastAPI

Orchestration, memory, claims, governance.

Qdrant

Vector DB with named volume for persistence.

$ docker compose up     # Local development — 3 containers
$ azd up                # Azure deployment — zero architecture changes

“Zero-to-deployed in one command. That's not marketing — that's the actual developer experience. Same containers, same networking, same volumes. Local and cloud are architecturally identical.”

“Most people build multi-agent demos. I built a multi-agent system — with persistent memory, inconsistency detection, and production deployment. The architecture matters more than the demo.”
Vin Ralh