Back to Development

Multi-Agent Simulation Platform

Autonomous AI Agents on Azure — Architecture Deep Dive

AI Hackathon 2026AzureSolo Developer
Under active development — hackathon submission March 2026
Dev Sandhu — AI agent character portrait

Dev Sandhu — HD-2D pixel art agent

The Engineering Challenge

I wanted to build a multi-agent system where agents actually remember what happened — across turns, across phases, across an entire session. Not just passing context between LLM calls, but genuine semantic memory backed by vector storage and retrieval.

I built the whole platform locally first using Microsoft Foundry containers with smaller models like Phi-4-mini — fast iteration, no cloud costs during development. Once the architecture held up, I deployed the same containers to Azure Container Apps with larger models for the hackathon demo.

The result: 3–8 autonomous agents maintaining persistent semantic memory, each independently deciding what to say, recalling relevant context from earlier phases, and detecting when a user contradicts something they said 20 turns ago — all running through a 5-stage async pipeline.

3–8
Autonomous Agents
20
Agent Tools
5
Pipeline Stages
<5s
Turn Latency

Azure Architecture Overview

Loading diagram...

Azure OpenAI Service

6+ models deployed with purpose-driven routing. Each task type routes to the optimal model for cost, latency, and quality.

  • GPT-5-Nano — Primary agent reasoning and decisions
  • DeepSeek-V3.1 — Long-form content generation
  • Phi-4-mini — Lightweight claim extraction (~150 tokens)
  • Phi-4-reasoning — Structured evaluation tasks
  • Mistral-small — Fast classification
  • Grok-3-mini — Advisory analysis

Azure AI Foundry

Microsoft Agent Framework provides persistent agents with per-session threads and tool call observability.

Runs on local Docker containers for development. Also deploys to Azure via Azure Container Apps for production.

Azure Container Apps

Four Docker Compose services deploy identically to cloud via azd up.

Zero architecture changes between local development and Azure deployment. Same containers, same networking, same volumes.

MCP Server

Separate microservice implementing Model Context Protocol with 5 analysis tools and PDF report generation.

Generates session summaries, consistency analysis, improvement plans, feedback reports, and exportable PDF documents via REST API.

GovernedOpenAIClient — Single Gateway

Every LLM call in the platform flows through a single governed client. This is the architectural choke point by design — one place to add logging, one place to swap models, one place to enforce budgets.

Traffic Governance

  • Per-session request budgets with graceful exhaustion
  • Token tracking across all model calls
  • Model routing based on agent requirements

Resilience

  • Exponential backoff retry: 5s → 10s → 15s (max 3 attempts)
  • Graceful fallback responses on rate limit exhaustion
  • Zero 500 errors reach the user — ever

Dual API Paths

Dynamically routes between Azure OpenAI's Responses API (primary) and Chat Completions (fallback for non-OpenAI Foundry models). One client handles both API surfaces transparently.

Azure RAI Content Safety

Solved real Azure AI jailbreak filter false-positives by redesigning prompt language and building slim message construction. Practical production experience with Azure content safety at scale.

“In production, you need exactly one place to add logging, one place to swap models. If every agent calls Azure OpenAI directly, you've lost control.”

Memory Architecture — Stanford Generative Agents + Qdrant

Loading diagram...

Stanford Three-Factor Retrieval

score = 0.5 × recency + 3.0 × relevance + 2.0 × importance
3.0
Relevance

Cosine similarity — semantic match is the primary retrieval signal

2.0
Importance

Ensures inconsistencies (importance=1.0) always surface in recall

0.5
Recency

Low weight — early-phase assertions must remain accessible later

Single-Collection Design

One Qdrant collection (memories) with per-agent payload filters. Scales cleanly without creating N collections per session. Agent isolation enforced at the storage layer via session ID + agent ID filtering.

“I used ChromaDB at GovHack 2024. For persistent multi-agent memory with session scoping, Qdrant was the better tool — metadata filtering, persistent collections, and proper payload indexing out of the box.”

Cross-Phase Inconsistency Detection

EXTRACT

Phi-4-mini extracts structured claims from natural language (~150 tokens)

STORE

Claims upserted into ClaimTable with topic, value, phase, turn

COMPARE

New claims checked against all prior claims on the same topic

ALERT

Agent surfaces inconsistency in character, referencing original value

Lightweight Extraction

Phi-4-mini handles claim extraction at ~150 tokens per call. No need for GPT-4o on a structured extraction task — use the smallest model that gets the job done.

Difficulty-Gated Sensitivity

Inconsistency thresholds vary by difficulty tier. Easy mode catches only major discrepancies. Hard mode flags even minor deviations — the system adapts its scrutiny to the scenario.

“The part I'm most proud of is the consistency layer. The system remembers everything you said — and if you contradict yourself twenty turns later, it notices.”

5-Stage Async Orchestration Pipeline

Loading diagram...

asyncio.gather Parallelism

Stages are strictly sequenced where dependencies exist — claims must be extracted before inconsistency checks — but maximally parallel within each stage. 3+ simultaneous LLM calls complete in under 5 seconds.

ContextVar Isolation

Each async task runs in its own context. Session state, memory streams, and claim tables are task-scoped — preventing cross-agent data leakage during parallel execution.

“Without context isolation, Agent 3's state leaks into Agent 5's reasoning. ContextVar gives you task-scoped state without passing context objects through every function signature.”

Multi-Phase Session Orchestration

Beyond the per-turn pipeline, sessions progress through a structured arc of 7 distinct phases. A while-loop in advance_phase() skips tier-inappropriate phases automatically — tier rules live in a data dict, not scattered conditionals.

PhaseHandlerDescription
OnboardingAdvisoryAdvisory agent welcomes user, learns context
Scouting1-on-1User interviews individual agents before the main interaction
Preparation1-on-1Advisory agent briefs user for the selected panel
Main InteractionMulti-agentAll selected agents respond in parallel each turn
DeliberationAuto-triggerAgents confer privately (tier-conditional)
ResolutionMulti-agentInteractive resolution phase (tier-conditional)
ReviewAdvisoryAdvisory agent reviews key moments, delivers feedback

Tier-Conditional Skipping

3 difficulty tiers produce meaningfully different session experiences. Phase rules live in a data dict — advance_phase() loops until it finds a valid phase for the current tier.

Tier-Differentiated Behaviour

Each agent has tier-specific behaviour defined in markdown skill files. Easy mode is forgiving; Hard mode demands precision. Same agents, different personalities per tier — no code changes.

Phase Fork by Tier

Easy mode culminates in a group deliberation. Medium and Hard modes skip deliberation and enter an interactive resolution phase instead. The session arc adapts to the difficulty.

“The per-turn pipeline handles what happens within a turn. The phase state machine handles what happens across the session. Both are data-driven, both are tier-aware, and neither requires code changes to reconfigure.”

Agent Architecture

Three Core Methods

respond()

Generates the agent's turn output. Routes through Agent Framework when enabled, falls back to direct Azure OpenAI calls. Can invoke registered tools mid-turn.

decide()

Autonomous decision-making before each turn. Returns a structured AgentAction: SPEAK, WAIT, COMMIT, EXIT, or INVITE_COLLAB. Not scripted branching — LLM-evaluated.

generate_tells()

Produces non-verbal behavioural signals (expressions, micro-reactions) in parallel with the main response — masking latency with useful output.

Agent Tools (20 total)

ToolTypePurpose
3 base toolsBaseMemory recall, observation recording, contradiction detection — available to all agents
8 specialist toolsSpecialistOne per agent archetype — domain-specific analysis unique to each agent's expertise
4 advisory toolsAdvisoryStructured feedback, scoring frameworks, and contextual guidance for the user
5 MCP toolsMCPSession analysis, consistency reports, improvement plans, feedback summaries, PDF export

Data-Driven Agent Configuration

Markdown skill files define personality, domain knowledge, and tier behaviour. New agent archetypes require zero code changes — drop in a new skill file and the platform picks it up.

Speaker Resolution

Up to 8 agents decide independently each turn (SPEAK/WAIT/etc.), then a resolution algorithm picks 2 speakers based on interest level and turns since last spoke — with force-speak after 3 silent turns. Genuine multi-agent coordination.

Heuristic Exit Logic

Per-agent personality-driven autonomous exit decisions. Each agent has unique exit thresholds — one exits at low interest after a major inconsistency; another exits after a failed chemistry test. Not scripted; emergent from agent state.

Topic Normalisation

50+ aliases mapping claim variations to canonical topics. Ensures “revenue”, “annual revenue”, and “yearly income” all resolve to the same claim topic for consistent contradiction detection.

“If you need to redeploy to change agent behaviour, your architecture is wrong. Skill files are the config layer — personality, knowledge, and rules live in markdown, not Python.”

Key Design Decisions

Multi-Model Routing

6+ models deployed with purpose-driven routing. Each task type gets the optimal model for cost, latency, and quality.

  • GPT-5-Nano — Primary reasoning & decisions
  • DeepSeek-V3.1 — Long-form generation
  • Phi-4-mini / Phi-4-reasoning — Extraction & evaluation
  • Mistral-small / Grok-3-mini — Classification & analysis

Qdrant over ChromaDB

ChromaDB works for prototypes. For production multi-agent memory with session scoping and metadata filtering:

  • Session-scoped payload filtering
  • Persistent named collections
  • Per-agent isolation at query time
  • Docker-native with volume persistence

asyncio over Celery

This is real-time orchestration, not a job queue. Agents need to respond within a single HTTP request cycle.

  • Sub-5s latency for parallel LLM calls
  • ContextVar isolation per async task
  • No message broker dependency
  • Native Python 3.12 async

Feature-Flagged Agent Framework

The platform runs fully without an Azure subscription. Agent Framework integration is toggled with a single flag.

  • USE_AGENT_FRAMEWORK=true — Full Foundry integration
  • USE_AGENT_FRAMEWORK=false — Direct Azure OpenAI calls
  • Graceful degradation, no code changes
  • AF adds observability, not core logic

Infrastructure

nginx SPA

Single-file frontend. No build step, no bundler.

FastAPI

Orchestration, memory, claims, governance.

Qdrant

Vector DB with named volume for persistence.

MCP Server

5 analysis tools, PDF reports, REST API.

$ docker compose up     # Local development — 4 containers
$ azd up                # Azure deployment — zero architecture changes

“Zero-to-deployed in one command. That's not marketing — that's the actual developer experience. Same containers, same networking, same volumes. Local and cloud are architecturally identical.”

DevSecOps CI/CD

Here's the thing about hackathon projects — most of them deploy from someone's laptop. I wanted this one to ship like a real product. Infrastructure as code, automated security scanning, no stored credentials, no portal clicking. The kind of pipeline you'd actually trust in production.

Infrastructure as Code — Bicep

7 Bicep modules define everything: Container Apps Environment, ACR, Key Vault with Managed Identity RBAC, Azure Files for vector DB persistence, Application Insights, and a budget alert. Subscription-scoped deployment creates its own resource group with deterministic naming. No portal clicking, no manual resource creation.

Zero-Delta Local-to-Cloud

The same 4 Docker containers that run via docker compose up deploy to Azure Container Apps via azd up. nginx reverse-proxies API, MCP, and static assets identically in both environments. The frontend bakes assets into the container at build time — no volume mounts in cloud, no missing files on deployment.

Dockerfile Hardening

Multi-stage builds strip build tooling from runtime images. All Python services run as non-root. Healthchecks use Python stdlib — no curl installed. nginx adds security headers (X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Referrer-Policy) and gzip compression.

Secrets & Identity

Key Vault holds API keys with RBAC-scoped access — the API container's system-assigned Managed Identity gets Key Vault Secrets User, nothing else. Secrets never appear in environment variables, config files, or container images.

GitHub Actions Pipeline

Two workflows. The PR gate runs 541+ unit tests on every pull request — nothing merges without passing. The main branch workflow is the interesting one:

Trivy Scan

Filesystem security scan

Unit Tests

541+ tests, mocked LLMs

ACR Build

Remote AMD64 builds

azd deploy

OIDC federation

OIDC federation means no stored credentials in GitHub — no service principal secrets to rotate. Images build remotely in ACR on AMD64, solving the ARM Mac to cloud architecture mismatch without cross-compilation.

$ azd up    # 11 Azure resources provisioned, 4 images built,
            # security scanned, identity-bound, serving traffic

“I've seen too many hackathon projects that work on the presenter's laptop and nowhere else. This one deploys from a GitHub Actions workflow with security scanning, OIDC federation, and zero stored secrets. That's not over-engineering — that's how you show the platform actually works.”

Testing & Production Maturity

541+
Unit Tests
28
Integration Tests
8
Agent Skill Files
50+
Topic Aliases

I ended up with four levels of testing — each one catches things the others miss.

Level 1

Unit Tests — Stubbed LLMs

541+ unit tests with mocked LLM calls. Fast, deterministic, run on every commit. Pytest markers separate these from slower integration runs. Covers memory operations, claim extraction, pipeline sequencing, and agent decision logic.

Level 2

Integration Tests — Real LLMs

28 integration tests running specific scenarios against real Azure endpoints. These validate actual model behaviour — claim extraction accuracy, inconsistency detection sensitivity, and agent response quality under real latency conditions.

Level 3

Manual Frontend — LLM-Suggested Test Data

I drive the browser manually, but the LLM suggests what data to enter — edge cases, contradictions, phase transitions. The AI knows what scenarios stress the system; I observe how the frontend handles them.

Level 4

Simulator — LLM Drives the Frontend

A dedicated playtest agent drives the frontend end-to-end. I watch the results and note issues to feed back. Test scenarios accumulate into a growing list for repeat invocation — this becomes the input for eval runs.

Azure RAI Content Safety

Solved Azure AI content filter false-positives by redesigning prompt language (avoiding trigger phrases like “MUST”, “ABSOLUTE RULE”) and building slim message construction for Agent Framework threads.

“That's not comprehensive — it doesn't cover platform issues — but it works really well for my scenarios. Each level catches things the others miss, and the simulator scenarios keep growing into a proper eval set.”

“I wanted agents that remember, detect contradictions, and deploy to production with a single command. That's what this is — persistent memory, inconsistency detection, and real infrastructure.”
Vin Ralh