Back to Blog
AI & iOS26 June 2026

How I Built a Secure AI Relay for an iOS App

Architectural considerations for deploying an iOS app to production with LLM support.

I wanted to add two AI-powered features to Spokenword, my iOS app: rhetoric analysis and gentle critique. I also wanted to harvest reusable patterns for subsequent apps.

The idea was simple. A poet taps a button, their poem is analysed for classical rhetorical techniques such as anaphora, tricolon and volta, and the results appear as inline annotations they can accept or dismiss. Simple enough, except for three problems.

API keys cannot live on the device.

Ship an Anthropic or Azure key inside your app binary and someone will extract it within hours. You wake up to a five-figure bill.

Poetry is sensitive.

Slam poets write about trauma, violence, identity and mental health. A blunt content filter would block half the poems at a competitive slam. You still need some safety layer to satisfy App Store review and your own conscience.

Cost is unpredictable.

Each analysis costs fractions of a cent, but without limits a single abusive user could run up hundreds of dollars in API calls before I noticed.

A poem detail view in Spokenword with Perform, Rehearse and AI Coach buttons

The AI Coach lives on each poem, beside Perform and Rehearse.

The answer was a thin relay server that sits between the app and the AI provider, handling authentication, safety, rate limiting and credit accounting. The app never sees the API key. The AI provider never sees who wrote the poem.

The relay answers the infrastructure questions: keys, safety and abuse. It leaves the harder one open, which is the one I find most interesting. Which model should it call? Good results from a premier frontier model are easy. My early analyses ran through Claude Opus and the feedback was excellent, the kind a poet would stop and read twice. But Opus is too expensive to sit behind a one-time app, whether I am the provider paying the bill or a poet running pass after pass over their own work. So the real question became how much I could squeeze out of small models. Could a cheap model perform well enough to be worth having, at a fraction of the price? That question runs through the rest of this post, and the eval pipeline is where I answer it.

The Architecture

This is where I took advantage of a deliberate move by the platform companies. Over the last couple of years, Cloudflare and Vercel have reshaped their edge networks to support AI workloads directly: serverless functions, key-value stores, SQLite at the edge, and gateways that sit in front of the model providers. Work that used to need a backend team is now a handful of managed primitives a solo developer can wire together.

The relay runs on Cloudflare Workers, serverless functions deployed to the edge. No server to maintain, no containers to patch, no uptime to watch. It scales to zero when nobody is using it and absorbs bursts without configuration.

Cloudflare was the deliberate default. I had already built Futurescapes on the same stack, so Workers, KV and D1 were known quantities, which meant the novelty budget for this project could go into the eval pipeline and the safety policy, not into learning a new platform. The trade-off is committing further to one vendor's edge ecosystem, but for a solo developer shipping a one-time purchase, reduced cognitive load and a single bill beat best-of-breed every time.

Loading diagram...

The relay sits on the edge. The app sends a poem and a token. The Worker authenticates, rate-limits, checks credit, builds the prompt, then forwards to the model and the safety classifier.

Defence in Depth

The relay trusts nothing. Every request passes through nine sequential checks before it reaches the AI provider. Fail any one and the request is rejected before a cent is spent.

#CheckWhat It Does
1Route checkOnly POST /analyse and POST /critique are served. Everything else returns 404.
2Body sizeAnything over 64 KB returns 413, checked before the body is read.
3Authentication256-bit shared secret, compared in constant time over SHA-256 digests to defeat timing side-channels.
4Rate limitingDual-axis: a short rolling window per user, a longer one per IP. The IP is SHA-256 hashed before it touches storage.
5Field validationRequired fields must be present and non-empty.
6Poem lengthOver 10,000 characters returns 400. That is roughly 200 lines, longer than any slam poem.
7Provider whitelistOnly "azure" is accepted. Single provider by design.
8Model whitelistOnly "DeepSeek-V3.2" is accepted, which blocks expensive model injection.
9CORS blockedNo Access-Control headers. OPTIONS returns 403, so a browser cannot call this endpoint at all.

Why constant-time authentication?

A naive string comparison returns the moment it finds a mismatched character. An attacker can measure response time for different secret guesses, and a slower response means more characters matched. Comparing SHA-256 digests of the presented and expected tokens makes every comparison take the same time regardless of how many characters match. A small thing. It matters.

Why hash the IP before storing it?

If someone ever dumps the KV namespace, they get a3f2b1... instead of xxx.xxx.xxx.xxx. No plaintext IPs in storage, ever. And the dual-axis limiter earns its keep: the per-user cap survives IP rotation on VPNs and mobile networks, while the loose per-IP cap copes with carrier-grade NAT, where thousands of users share one address.

If a secret is stolen?

Rate limits cap the damage to cents per hour. Rotate the secret and the attacker is locked out immediately.

I did not want to take my own word for any of this. I ran the relay through Zapper, my own security pipeline, which puts deterministic scanners over the code, DAST, dependency and static analysis among them, maps every finding to recognised frameworks, and proposes concrete defence-in-depth fixes I can work through. On top of that I had Claude Opus carry out independent security reviews of the same code. The deterministic scanners catch what a model overlooks, and the model reasons about what a scanner cannot express. Running both, rather than trusting either alone, is what built my confidence in the security of the relay.

That is the lesson Zapper exists to enforce: no single control, and no single reviewer, covers every angle. I wrote about building it in Defence in Depth Can't Be Delegated to an LLM.

The Guardrails Problem

This is the part I thought about most carefully. Slam poetry regularly addresses racial violence, police brutality, mental health crises, sexual identity and war. Block that content and the app is useless for its audience. Moderate nothing and you cannot pass App Store review, with no defence against genuine abuse.

The answer is flag-and-log, not block. Almost everything passes through to the AI provider, but every request runs through a safety classifier via Cloudflare AI Gateway, and the classification is logged. Each hazard category gets one of three dispositions.

Flag

Categories common in legitimate poetry: war, identity, mental health, protest. Logged for monitoring. The request proceeds normally.

Block

A small set of categories that are never legitimate poetry. The request is rejected before it reaches the AI, and the app points the poet to a local on-device model instead.

Ignore

Categories not relevant to poetry analysis. Skipped, to keep the signal clean.

At the Cloudflare layer, a flagged poem is analysed normally and the poet sees nothing of it. The flag exists only in the Cloudflare dashboard for monitoring, an audit trail I can show Apple if asked. The classifier costs fractions of a cent per evaluation, effectively free.

That is the policy I set. It is not the only one in the path. The model runs on Microsoft Foundry, and Azure enforces its own default guardrail, Microsoft.DefaultV2, on every prompt and every response. It is stricter than mine, and I do not control it.

Risk typeChecked atAction
JailbreakUser inputBlock
Hate (medium)Input + outputBlock
Self-harm (medium)Input + outputBlock
Sexual (medium)Input + outputBlock
Violence (medium)Input + outputBlock
Protected material, codeOutputAnnotate
Protected material, textOutputBlock

Azure's default guardrail, applied to the DeepSeek-V3.2 deployment.

On paper that looks like a problem. The Azure default blocks hate, self-harm, sexual content and violence at a medium threshold, the very themes slam poetry lives in. In practice it has not been. When I evaluated sensitive poems across those topics, the model layer handled them without tripping the filter. A poem treats grief or violence differently from the material the threshold is built to catch, and it tends to score below it. The Azure guardrail sits there as a backstop, not a gate.

That is defence in depth working as intended: a permissive policy I set at the edge for a creative tool, and a conservative default the provider enforces at the model, two independent layers from two vendors. On the rare occasion Azure does refuse a request, the poet gets a clear message pointing them to the on-device model, which runs with no network and no content restrictions. Catching the genuinely unacceptable content at the Cloudflare edge earns its keep here too: fewer requests reach the Azure filter, and a request shut down at the edge is one fewer chance of tripping the provider and putting the deployment's standing at risk.

Flag-and-log is also what threads the needle for App Store review. Apple wants to see that you have content moderation. They do not want to see you blocking poems about depression or protest. My edge policy does both, and the audit trail is there if anyone asks.

“The app never sees the API key. The provider never sees who wrote the poem.”

Credit Accounting Without a Subscription

I did not want a subscription. Spokenword is a one-time purchase. The AI Coach is framed as an experimental bonus: 10 free uses included with the app, shared across both lenses, rhetoric analysis and critique.

The credit system runs on Cloudflare D1, serverless SQLite. The principle is that credits are only deducted on success. Timeout? No deduction. Server error? No deduction. Content blocked by the safety filter? No deduction. The user only pays when they get something useful back.

The Try AI Coach intro screen, explaining 10 free uses, no subscription, and that poem text is the only thing sent off-device

The free-trial framing, and the off-device line, stated up front.

Loading diagram...

First launch grants 10 credits against a device UUID. Each analysis checks the balance, calls the model, and only decrements on a successful response.

Identity without accounts

Each device gets a UUID stored in iCloud-synchronisable Keychain. No email, no password, no sign-in flow. The same UUID follows the user across reinstalls and across devices on the same Apple ID. It is the only identifier the relay ever sees.

Because there are no accounts, there is no admin console either. For operational management I wrote a small set of scripts that deterministically manage credit usage and reset a device's UUID, so I can inspect a balance, grant or clear credits, or retire a test identity without a dashboard or hand-edited SQL.

Why I am in no hurry to charge

I could not solve this on the device. I tested Gemma 2B and 4B on the phone extensively, and the results were not good enough: the analysis was substandard, and on a typical lower-powered phone it was not fast enough to ship. That is what pushed the AI Coach to the cloud, and what gives it a real, if tiny, per-use cost.

So there is a genuine bill behind every analysis, and I have still kept the AI Coach to ten free uses and resisted turning it into a paid credit pack. Part of that is timing. At WWDC 2026 Apple made serious commitments to on-device and remote Apple Intelligence, with third-party model fallback options now including Google's models alongside ChatGPT. If capable local and server AI arrives as a platform feature apps can call, a paid-credit pack I build now could be redundant within a year. I would rather keep this a small, free experiment and see what the platform gives me than charge poets for infrastructure that might soon come for free.

How the AI Actually Analyses the Poem

The relay does not forward the poem verbatim. It prepends line numbers and wraps the text in a tuned prompt that tells the model exactly what to look for and how to format the response.

1: You memorised every word of the play
2: and never once stepped on the stage
3: stood in the wings for six months
4: holding a costume that fit you perfectly.
5:
6: This poem is for the understudies,
7: the ones who were ready,
8: the ones the audience never needed,
9: the ones who showed up anyway.

Line numbers matter. Without them the model says “in the second stanza” and the app has to guess which lines that means. With them it says “lines 7 to 9” and the highlighting is precise.

AI Coach rhetoric results: Parallelism and Tricolon detected on specific line numbers, each with a confidence score and Save or Dismiss controls

Each detection cites its lines and a confidence score. The poet saves or dismisses.

For long poems, over 40 lines, the relay splits the text into overlapping chunks at stanza boundaries, sends them in parallel, and merges the results, de-duplicating detections that appear in the overlap zone. The chunker is a hand-translation from the Python evaluation pipeline that validated the model's accuracy, and a layer of replay-based parity tests ensures the two implementations never drift apart.

A Learn More popover explaining the Parallelism technique inside the AI Coach

The same screen teaches as it analyses. Tap a technique and a short explanation appears, so a poet who has never heard the word “parallelism” can still understand why the model flagged those three lines. The detection is the hook. The definition is the lesson.

There is a line I was careful not to cross. The AI analyses; it does not create. It detects the techniques a poet has already used and explains them. It never writes a line, rewrites a line, or hands the poet words. The craft stays the poet's. To keep that honest in the interface, every AI-detected technique carries a sparkle marker, so it is never mistaken for something the poet wrote or annotated themselves.

The Eval Pipeline

Before shipping any of this I needed to know whether the AI actually finds the right techniques, or whether it hallucinates patterns that are not there. So I built a 64-poem evaluation corpus with hand-annotated ground truth, every technique marked by a human, line by line. Then I ran 45 evaluation runs across 11 models, from frontier cloud models down to small ones that fit on the phone, measuring precision, recall and F1. The shortlist:

ModelRunsAccuracyCostSensitive poems
Claude Sonnet 4.6cloudHighest$$$$passes
Claude Haiku 4.5cloudHigh$$$1 blocked
DeepSeek V3.1chosencloudHigh$$none blocked
GPT-4o-minicloudMedium$passes
Gemma 4 E4Bon-deviceMediumFreepasses
Gemma 4 E2Bon-deviceLowFreepasses

A selection from the 11 models tested, ranked on the 64-poem corpus. Cost is relative, most to least. GPT-4o-mini was a 10-poem pilot.

The frontier models were the most accurate. Claude Sonnet led on accuracy and recall, but it costs many times what DeepSeek does, too much to sit behind a one-time app and run on every analysis. At the other end, the on-device Gemma models were free but weaker, and, as I found earlier, not fast enough on a typical phone. Several budget options ruled themselves out: GPT-4o-mini only ever ran as a pilot, Phi-4 Mini returned nothing, Mistral emptied out on sensitive poems, one model rejected the prompts outright, and another kept returning malformed JSON.

DeepSeek sat in the sweet spot. It matched Claude Haiku on accuracy at a fraction of the cost, and, the detail that mattered most for an app full of poems about trauma, violence and identity, it never content-filtered a sensitive poem. None blocked across the whole corpus, where Haiku blocked one and the budget models refused or emptied out. Good-enough accuracy, near-free, and no moralising about the poetry. I ran the selection on DeepSeek V3.1 and later moved to V3.2, which the relay uses today.

With the model settled, the rest of the accuracy came from the pipeline. The detector-only setup won on the metric that users actually feel: higher recall, which reads as “the AI found the techniques”. It also costs half as much and runs faster, with no second verification call.

The evaluation also killed an idea that sounded good on paper. A two-stage pipeline, detect then verify each detection with a second call, was a measured no-op across 12 runs: the verifier added latency and cost without improving accuracy. No measurable change.

Nine techniques survived: anaphora, epistrophe, anadiplosis, parallelism, tricolon, asyndeton, antithesis, rhetorical question and volta. Each detection surfaces as “Suggested” in the app, and the poet confirms or dismisses. A noisy precision boundary does not degrade trust when the human has the final say.

And at a fraction of a cent per poem, this is the answer to the question I started with. A small model, kept honest by a human in the loop, can deliver enough value to be worth having, at a price a one-time app can carry. The frontier model was never the hard part. Getting most of the way there for a fraction of the cost was.

Two Lenses on the Same Poem

Rhetoric detection is one lens. The second is critique. Where the rhetoric lens returns tagged techniques on numbered lines, the critique lens returns prose: what lands, what a line is reaching for, where the poem might breathe differently when read aloud. Both run through the same relay, draw from the same pool of 10 free uses, and treat the poet as the editor, not the edited.

The critique lens returning prose feedback on a poem, on iPhone

Critique on iPhone.

The AI Coach critique lens on iPad, poem on the left and prose feedback on the right

The same critique, in split view on iPad.

The Stack

LayerTechnologyWhy
AppSwiftUI + @ObservableNative iOS, zero dependencies.
RelayCloudflare Workers (TypeScript)Serverless edge, scales to zero, no idle cost.
Rate limitingCloudflare KVKey-value store with TTL, fail-open on outage.
Credit accountingCloudflare D1Serverless SQLite, no database to provision.
Content safetyCloudflare AI Gateway + Llama Guard 3Flag-and-log, fractions of a cent per evaluation.
AI modelDeepSeek V3.2 via Microsoft FoundryBest accuracy for the cost, no content filtering.
Receipt verificationApple JWS (jose + @peculiar/x509)Pinned to Apple Root CA G3.
Secretswrangler secret putNever in code, never in config files.

Lessons Learnt

What I have built is a reusable pattern: a mobile app that reaches large models through a thin, hardened web layer in between, never directly. The device holds no keys. The relay does everything that has to be trusted, authentication, rate limiting, safety, credit accounting, and the routing to whichever provider sits behind it. None of that is specific to Spokenword or to poetry. Any app that needs to put a model behind a button can adopt the same shape, harden it once, and reuse it across builds and deployments.

Measure before you tune.

Build the eval pipeline first. Every decision should be backed by data, not intuition.

Audit the ground truth.

Your benchmark is only as good as your labels. The model is probably better than your scores suggest.

Simple beats clever.

Detector-only beat detect-and-verify. Temperature 0 beat temperature 0.1. Line numbers beat few-shot examples. The boring approach usually wins.

Look beyond the frontier models.

The best frontier model is the easy answer, and usually the wrong one for a feature you ship at scale. A cheaper model, measured against your own eval and lifted with pipeline work, may get close enough at a fraction of the cost and latency. Save the frontier model for the jobs that genuinely need it.

Skip the heavyweight database.

You may not need a managed database from a major cloud vendor, or the bill that comes with it. Edge key-value stores and serverless SQLite (here, Cloudflare KV and D1) handle auth state, rate limits and credit accounting for cents a month, and scale to zero when nobody is using them.

The AI Coach ships inside Spokenword. Ten free analyses, no subscription, and an architecture where the key never leaves the edge and the poem never carries a name.

Part of a series on Spokenword. Start with I'm a slam poet. I built an app. and I wasn't taught rhetoric at school. So I built it into my app.

Vinod Ralh is an Enterprise Architect, currently shipping production AI systems.

#SpokenWord#AI#CloudflareWorkers#EdgeComputing#iOS#IndieApp#LLM#AppSecurity