luke.geek.nz Blog

Running LiteLLM on AKS with azd and Bicep

Fri, 12 Jun 2026 02:18:34 GMT

I've been spending time with LiteLLM and wanted to see how far I could take it as a self-hosted LLM gateway on Azure Kubernetes Service. The goal was simple: build a deployment that you can spin up with a single azd up command, with all the production bits - private networking, Redis caching, PostgreSQL for spend tracking, and a proper ingress with automatic TLS.

Turns out it works pretty well. Here's what I built and what I found.

What LiteLLM does

LiteLLM is an open-source proxy that sits between your applications and the LLM providers they call. You hit a single OpenAI-compatible endpoint, and LiteLLM routes to whatever backend you've configured - Azure OpenAI, Anthropic, OpenAI, or any of the 100+ providers it supports.

The useful bit for me was having a single place to manage authentication, rate limiting, caching, and spend tracking across all the models our team uses. Rather than distributing API keys for every provider, the team gets a virtual key from LiteLLM and I control which models they can access and what their budget is.

The deployment

I wanted to deploy this on AKS with everything managed through Infrastructure as Code. I used the Azure Developer CLI (azd) with Bicep for the infrastructure, with deployment lifecycle handled by PowerShell hooks.

note

The full source is at lukemurraynz/LiteLLM.AKSGateway if you want to skip the walkthrough and just deploy it.

Infrastructure

The Bicep template provisions a greenfield environment in a single resource group:

Resource	Purpose
AKS cluster	Standard tier, system + user node pools, Azure CNI Overlay
Azure Container Registry	Stores the LiteLLM Docker image
PostgreSQL Flexible Server	Spend tracking, virtual key storage, user management
Azure Managed Redis	Distributed caching and rate limiting across replicas
Key Vault	Managed by the proxy for credential storage
Azure OpenAI	GPT-4o model deployment
VNet with 3 subnets	AKS nodes, private endpoints, and a dedicated ingress subnet
NAT Gateway	Outbound connectivity for the AKS cluster
Private DNS zones	Private endpoint resolution for all PaaS services

Every data service (PostgreSQL, Redis, ACR, Key Vault) is configured with private endpoints. No public IPs on the data plane. The AKS cluster uses Azure CNI Overlay with Azure Network Policy for pod-level segmentation.

Network design

The VNet is split into three subnets. The CIDR blocks are defined in infra/core/networking.bicep:

Subnet	CIDR	Purpose
`snet-aks`	`10.30.0.0/23`	AKS node pool (system + user)
`snet-pe`	`10.30.2.0/24`	Private endpoints for PostgreSQL, Redis, ACR, Key Vault
`snet-ingress`	`10.30.3.0/24`	Reserved for the NGINX ingress controller

Outbound connectivity uses a NAT Gateway attached to the AKS subnet via a dedicated public IP (for example, pip-nat-vnet-). The AKS cluster uses userAssignedNATGateway as its outbound type, which avoids the SNAT port exhaustion that can happen with loadBalancer outbound type at scale.

Private endpoint DNS resolution uses four Azure Private DNS zones, each linked to the VNet:

privatelink.postgres.database.azure.com
privatelink.redis.azure.net
privatelink.azurecr.io
privatelink.vaultcore.azure.net

These zones are created in the Bicep template and linked automatically. Pods resolve private endpoint IPs through the CoreDNS split-DNS patch (more on that later).

The public ingress path is separate: the NGINX ingress controller gets a public LoadBalancer IP, and cert-manager handles the Let's Encrypt TLS certificate through the HTTP-01 challenge. The DNS A record is managed through the sync-public-dns.ps1 hook, or created manually if you prefer.

The `azd` lifecycle

azd up runs through several stages, each with a PowerShell hook:

Preprovision - generates random secrets for the PostgreSQL admin password, LiteLLM master key, and salt key. Also installs kustomize via Winget if it is not present.
Provision - deploys the Bicep template to Azure. This takes about 10-15 minutes and creates the full environment.
Postprovision - gets the AKS credentials, patches CoreDNS with a split-DNS configuration (Google DNS for public resolution, Azure DNS for private endpoint resolution), deploys the Kubernetes manifests via kustomize, and sets up the ingress controller.
Postdeploy - refreshes the Kubernetes secret with the latest connection strings and keys, restarts the deployment, and syncs the public DNS A record.

One thing I hit early on - the Bicep output aksClusterName was captured by azd as an environment variable, but azure.yaml was referencing ${AZURE_AKS_CLUSTER_NAME} which did not exist. A quick fix to ${aksClusterName} and the deployment ran clean.

The LiteLLM config

The proxy is configured through a ConfigMap that kustomize applies to the cluster. The config file defines the models, authentication, caching, and routing.

model_list:
  - model_name: azure-gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: os.environ/AZURE_OPENAI_ENDPOINT
      api_key: os.environ/AZURE_OPENAI_KEY
      api_version: "2024-10-21"

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  database_connection_pool_limit: 10
  proxy_batch_write_at: 60
  allow_requests_on_db_unavailable: true

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    password: os.environ/REDIS_PASSWORD
    ssl: true

Adding OpenCode Zen and Go models

Once the Azure OpenAI model was working, I also wanted to see whether LiteLLM could front the OpenCode models I use day to day. The answer is yes. OpenCode Zen exposes an OpenAI-compatible endpoint at https://opencode.ai/zen/v1, and OpenCode Go exposes one at https://opencode.ai/zen/go/v1 for the OpenAI-compatible Go models.

That let me put Azure OpenAI, OpenCode Zen, and OpenCode Go behind the same LiteLLM endpoint:

Model group	Example models	Upstream base URL
Azure OpenAI	`azure-gpt-4o`	`os.environ/AZURE_OPENAI_ENDPOINT`
OpenCode Zen	`big-pickle`, `deepseek-v4-flash-free`, `mimo-v2.5-free`	`https://opencode.ai/zen/v1`
OpenCode Go (OpenAI-compatible)	`glm-5.1`, `kimi-k2.6`, `deepseek-v4-pro`	`https://opencode.ai/zen/go/v1`
OpenCode Go (Anthropic-compatible)	`minimax-m2.5`, `qwen3.7-plus`	`https://opencode.ai/zen/go`

The last row is the small gotcha. For the Anthropic-compatible Go models, LiteLLM's Anthropic provider appends /v1/messages automatically. If you set the base URL to https://opencode.ai/zen/go/v1, LiteLLM sends the request to /v1/v1/messages, and OpenCode quite correctly returns a 404. Setting the base URL to https://opencode.ai/zen/go fixes it.

The single proxy now exposes 19 models. A few of the Go subscription models currently return a GoUsageLimitError because my Go monthly limit is exhausted, but that still proved the routing path was correct. The free Zen models, including big-pickle, are available through the same proxy key.

The UI is useful once the proxy is running. I recorded this walkthrough after logging in (the login step is deliberately not captured so the master key never appears in the recording). It shows the live dashboard moving through the current configuration: Models + Endpoints, MCP Servers, and Virtual Keys.

I also tested the virtual key path from the API. The flow below creates a scoped key, uses it for a chat completion, and records the follow-up issue I saw when trying to delete it.

Production settings from the LiteLLM docs

The LiteLLM production best practices page had a few things worth picking up.

The Redis config was the one that surprised me most - using host/port/password separately rather than a redis_url string is measurably faster (about 80 RPS according to their benchmarks). I had originally used the URL format and switched it after reading that.

The database settings are less surprising but worth noting. proxy_batch_write_at: 60 batches spend updates every 60 seconds rather than writing on every request, which makes a meaningful difference to PostgreSQL write load. I paired that with database_connection_pool_limit: 10 - with 3 replicas and 4 workers each that's 120 total connections, sitting comfortably inside PostgreSQL's defaults.

The other two I'd put in any production gateway: allow_requests_on_db_unavailable: true keeps the proxy serving requests even if PostgreSQL is momentarily unreachable (useful when you're in a private VNet and the database briefly hiccups during a scale event), and LITELLM_MODE: "PRODUCTION" disables the load_dotenv() call that would otherwise look for a .env file inside the container at startup.

Testing multi-replica behaviour

One of the main reasons to use AKS is that you can run multiple replicas for high availability. I wanted to verify that the Redis-backed caching works across pods.

I split the capture into three shorter GIFs so each one shows a specific test. The values in these captures are from the live AKS deployment, with API keys redacted.

First, the proxy health and model catalogue:

Then the completion and cache checks:

I tested a few things through the public endpoint:

Test	Result
Readiness	`{"status":"healthy","db":"connected"}`
Model count	`19` models
Azure OpenAI call	`azure-gpt-4o`, `0.153s`, response `Hello!`
OpenCode Zen call	`big-pickle`, `0.155s`, response `4`, `150` reasoning tokens
Redis cache	`0.448s` miss, `0.132s` hit

Then I ran a cross-pod cache test against two different LiteLLM pods:

Pod A: litellm-proxy-9b9d6dffd-p2qqg on aks-userpool-12527619-vmss000001
Pod B: litellm-proxy-9b9d6dffd-qjfkr on aks-userpool-12527619-vmss000000

Pod A cache miss: 1.040s
Pod B cache hit:  0.636s

Pretty basic test case, but it proves the important bit: the response was cached by one replica and then served by another replica from Azure Managed Redis. That is the behaviour I wanted before trusting horizontal scale-out.

I also tested a shorter prompt through the public ingress:

Cache miss: 0.448s
Cache hit:  0.132s
Response:   Blue green

The NGINX ingress controller distributes requests across the pods transparently, and the Redis cache serves cached responses regardless of which pod handles the request.

AKS configuration for LiteLLM

A few settings in k8s/litellm-deployment.yaml are worth calling out.

Rolling updates and pod lifecycle

The deployment uses maxUnavailable: 0 and maxSurge: 1, so Kubernetes never drops below the desired replica count during a rollout. A new pod starts, passes its readiness probe, gets added to the service, and only then does an old pod get terminated.

The readiness probe hits /health/readiness with a 30-second initial delay. LiteLLM won't pass that probe until Prisma has finished running migrate deploy, which matters because on first deploy it runs schema migrations before it's ready for traffic. The liveness probe is separate - /health/liveliness with a 60-second initial delay and 15-second period, so three failures in a row trigger a restart.

Graceful shutdown uses terminationGracePeriodSeconds: 620 and a 5-second preStop sleep. The grace period is deliberately longer than LiteLLM's 600-second request timeout so in-flight requests can finish. The preStop sleep gives the load balancer a moment to deregister the pod before SIGTERM lands.

terminationGracePeriodSeconds: 620
lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]

Autoscaling and disruption budget

The HPA targets 60% CPU and 80% memory, scaling between 2 and 10 replicas. The PodDisruptionBudget sets minAvailable: 1, so kubectl drain during node maintenance can't terminate the last running pod. Worth having, especially in a two-node user pool.

Read-only root filesystem

The container runs with readOnlyRootFilesystem: true, runAsNonRoot: true, and all capabilities dropped. LiteLLM needs a few writable directories - Prisma writes binaries to a cache directory, migrations state needs somewhere to live, and the UI needs writable paths for assets and logos. I used emptyDir volumes at /app/cache, /app/migrations, /app/var/litellm/ui, /app/var/litellm/assets, and /tmp.

The LITELLM_NON_ROOT=true environment variable adjusts the default UI paths to point into /app/var/litellm rather than trying to write into the container root.

Some things I learned

The CoreDNS split-DNS thing caught me early. AKS pods resolve DNS through CoreDNS, which by default forwards everything to the node's resolver - Azure DNS at 168.63.129.16. That works fine for private DNS zones (PostgreSQL, Redis, ACR all resolve correctly through private endpoints), but it breaks for public internet lookups. cert-manager's Let's Encrypt HTTP-01 challenge needs to resolve public domains, and with the default config it can't. The fix is a CoreDNS ConfigMap patch that sends public queries to 8.8.8.8 while keeping Azure private zones on 168.63.129.16 - the postprovision hook applies this automatically, but it's worth understanding why it's there.

Related to that: the first deploy I did, cert-manager started the Let's Encrypt flow before the DNS A record had fully propagated. The challenge timed out and left a stale cm-acme-http-solver pod sitting in the namespace. Deleting the Certificate and CertificateRequest objects forced a fresh attempt once the DNS was actually ready.

Kubernetes secrets don't hot-reload - worth remembering. When I added the OpenCode Go API key I updated the secret in the cluster, but the running pods still had the old environment because envFrom secrets are only loaded at pod startup. A force delete of the pods fixed it.

The virtual key deletion path still needs a closer look. Creating a scoped key and using it for completions worked fine, but the delete step surfaced a Redis cluster MOVED error during auth cache invalidation. The key disappeared from the list so it seems functionally gone, but I wouldn't call that clean. Left it as a follow-up rather than pretending it didn't happen.

Operational checklist

Once the proxy is deployed, here is how I validate it is working:

# AKS cluster state
az aks show -g rg-llm -n aks-llm --query provisioningState -o tsv
kubectl get nodes
kubectl get pods -n litellm -o wide
kubectl get hpa -n litellm

# LiteLLM health and model catalogue
curl https://litellm.headinthecloud.co.nz/health/readiness
curl https://litellm.headinthecloud.co.nz/models -H "Authorization: Bearer "

# Test a completion
curl -X POST https://litellm.headinthecloud.co.nz/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{"model":"azure-gpt-4o","messages":[{"role":"user","content":"hello"}]}'

# Test Redis cache by repeating the same request
curl -X POST https://litellm.headinthecloud.co.nz/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{"model":"azure-gpt-4o","messages":[{"role":"user","content":"hello"}]}'

# Check Prometheus metrics
(Invoke-WebRequest https://litellm.headinthecloud.co.nz/metrics).Content | Select-String "litellm_"

# Check ingress TLS certificate
kubectl get certificate -n litellm -o wide

The test suite in scripts/test-litellm.ps1 runs 26 checks across health, models, keys, chat, spend, metrics, and security. It cleans up generated test keys at the end.

./scripts/test-litellm.ps1 -BaseUrl https://litellm.headinthecloud.co.nz -MasterKey 

Cost and cleanup

This is not a free deployment. AKS Standard tier with two D4s_v3 node pools, Azure Managed Redis Balanced_B0, and PostgreSQL Standard_B2ms runs roughly $400-$500 per month in Australia East. The Azure OpenAI gpt-4o deployment adds pay-per-token cost.

To tear down the whole environment:

azd down --purge

The predown hook deletes the litellm Kubernetes namespace before the resource group removal starts, so there are no dangling load balancer resources. If you want to keep data for later, export the PostgreSQL database first and store the LITELLM_SALT_KEY somewhere safe - you will need it to decrypt stored credentials when you restore.

Wrapping up

Hopefully this gives you a starting point for running LiteLLM on AKS. The azd template handles the full deployment lifecycle, and the production configuration covers caching, rate limiting, spend tracking, and high availability out of the box.

I had a lot of fun setting this up. The mind boggles at what you could do with the MCP Gateway features on top of this - wiring up Microsoft Learn documentation tools or GitHub MCP servers through the same proxy. Something for the backlog.

References

LiteLLM:

Azure Infrastructure & Deployment:

Networking & Private Connectivity:

Data Services:

From cloud adoption to value realisation

Sun, 07 Jun 2026 07:54:38 GMT

A lot of Azure programmes can answer one question pretty quickly: what did we deploy?

Landing zone. Done.
Workloads migrated. Done.
Monitoring enabled. Done.
Tags and budgets configured. Done.
Security baseline applied. Done.

Those are all useful things, and I am not downplaying them. They are part of getting cloud adoption right. But they are not the whole story.

The harder question is usually this one:

What is measurably better because we adopted Azure?

That is where the conversation shifts from cloud adoption to value realisation.

Adoption is not the finish line

The Microsoft Cloud Adoption Framework gives us a really useful way to think about the cloud journey. Strategy, Plan, Ready, Adopt, Govern, Secure, and Manage all have a place, and they are not only technical activities.

That matters because cloud adoption is not just moving workloads or deploying services. It is integrating cloud into the way the organisation works.

The trap I see is that teams often treat the Adopt phase as the finish line.

A workload is migrated.
A platform is available.
A dashboard exists.
A new service is live.

Then the project closes, and everyone moves to the next thing.

That is where value can quietly leak out of the programme. The technical delivery might be successful, but the outcome is left assumed rather than managed.

A simple way I think about it is:

Cloud adoption asks	Value realisation asks
Did we move or build the thing?	Did the organisation get better because of it?
Is the platform ready?	Can teams use it safely and consistently?
Is monitoring enabled?	Are decisions improving from the signals?
Do we have cost visibility?	Is spend changing behaviour?
Did we migrate the workload?	Did we retire cost, risk, or friction?

Cloud adoption creates the conditions for value. It does not automatically realise it.

The familiar pattern

This is a pattern I have seen a few times, in different shapes.

A business case gets approved. The platform team does the right thing and builds the Azure foundation properly. Workloads move. Governance is applied. Azure Monitor starts collecting useful data. Microsoft Defender for Cloud is switched on. Cost Management reports are available.

The project is technically successful.

Then, six months later, someone asks:

What did we get for the investment?

And the answer is harder than it should be.

Not because Azure failed. Usually the platform did exactly what it was asked to do.

The gap is that nobody kept ownership of the value after go-live. The original benefit owner moved on, the review cadence became operational only, and the measures that justified the work were not carried into the run state.

The team can show that the workload is healthy. They can show that it is patched, monitored, backed up, and policy-compliant. But they cannot easily show whether the business outcome improved.

That is the bit worth fixing.

Start with the outcome, not the service

This is where Azure roadmaps need a bit of discipline.

If the roadmap item says:

Migrate application X to Azure.

That is not wrong, but it is incomplete.

A better version is closer to:

Migrate application X to Azure so that recovery time improves, platform risk reduces, can make use of additional functionality, and the legacy hosting contract can be retired by Q4.

That second version gives you more to work with. It names the technical move, but it also names the expected value.

Some examples:

Azure activity	Better value framing
Deploy an Azure landing zone	Teams can launch compliant workloads faster, with inherited security and governance controls
Migrate virtual machines	Legacy infrastructure risk reduces, recovery improves, and old hosting costs can be retired
Build a Fabric data platform	Decision latency reduces because trusted data is available at the right cadence
Roll out Copilot or Azure AI	A named workflow improves, with human review, quality thresholds, and support ownership
Enable Azure Monitor	Teams can act before users are impacted, not only after an incident is raised

The technical activity still matters. It just needs to sit inside a value story.

Four questions before calling an Azure roadmap done

These are the questions I like to ask before treating an Azure roadmap item as ready.

What business outcome should change?

Be specific.

Not "modernise infrastructure".
Not "improve reporting".
Not "adopt AI".

Better outcomes sound like:

reduce time to recover from a priority incident
reduce manual reporting effort for the finance team
improve release confidence for a customer-facing workload
reduce audit preparation effort
improve cost visibility by product, team, or service
reduce dependency on unsupported infrastructure

If the outcome is hard to name, the roadmap item may still be a technical task rather than a value item.

That is fine, but call it what it is.

Who owns the value after go-live?

This is a big one.

The delivery owner is not always the value owner.

A platform team might own the Azure landing zone. An application team might own the workload. But the value might belong to operations, finance, risk, customer service, or a product owner.

If nobody owns the value after go-live, the value probably will not be reviewed.

For each roadmap item, I want to know:

Role	Question
Delivery owner	Who gets it live?
Operational owner	Who runs it after go-live?
Value owner	Who proves whether it was worth doing?
Decision owner	Who can change course if the measure does not move?

Those roles might be the same person in a small organisation. In larger environments, they often are not.

What should move first?

Lagging measures are useful, but they can arrive too late.

If the only measure is annual cost reduction, customer satisfaction, or revenue growth, you might wait months before you know whether the change is working.

A leading indicator gives you an earlier signal.

For example:

Outcome	Early signal
Faster release cycle	Deployment frequency or lead time changes within 30-60 days
Better reliability	Incident volume, alert quality, or mean time to restore starts improving
Better reporting	Manual report preparation time reduces
Better cost ownership	Teams review cost by workload or product each month
Better adoption	Repeat usage grows after the first training wave

Azure gives us a lot of telemetry. Azure Monitor, Application Insights, Log Analytics, Cost Management, Defender for Cloud, and Azure Policy can all provide useful signals.

The trick is connecting those signals to the outcome the organisation cares about.

What gets retired?

One of the cleanest ways to realise value is to stop paying for, supporting, or working around something.

Cloud programmes sometimes add the new thing but keep the old thing alive.

New platform, old process.
New dashboard, old spreadsheet.
New cloud environment, old hosting contract.
New automation, old manual approval path.
New governance model, old exception process.

That is how cost and complexity grow together - someone once told me:

"Cloud is not where you work, it's how you work."

For each Azure roadmap item, ask:

What do we stop, retire, reduce, or simplify if this works?

That could be:

a legacy server or hosting contract
a manual reporting process
an old integration path
duplicate tooling
a support burden
a recurring security exception
a governance meeting that no longer adds value

If nothing gets retired or simplified, the organisation might still choose to proceed, but it should do so with eyes open.

A few field learnings

None of these are tied to a specific customer. They are patterns I have seen often enough that I now look for them early.

A landing zone without a first workload is hard to explain

Azure landing zones are important. I am a big fan of doing the foundation properly: identity, network, management groups, policy, logging, security baseline, subscription structure, and deployment patterns.

But a landing zone by itself can be a hard thing for a business sponsor to value.

The value becomes clearer when it is tied to the first workload or product team that uses it - and I generally try to push for 'what is the business strategy, that the platform can also help to support to ensure alignment'.

Instead of saying:

We deployed the landing zone.

Try to get to:

The first product team can now deploy into a governed subscription without a custom security review every time.

That moves the conversation from foundation delivered to friction removed.

Cost visibility is not the same as cost ownership

I have seen teams build good Azure Cost Management views, dashboards, tags, and budgets, then wonder why spend behaviour does not change.

The missing part is usually ownership.

If a team can see cost but cannot change architecture, scale settings, licensing, retention, or product priority, then visibility becomes reporting rather than management.

The FinOps conversation needs at least three groups in the room:

Group	What they bring
Finance	Budget, forecast, and commercial discipline
Engineering or platform	Technical options and trade-offs
Business or product owner	Value judgement and prioritisation

If one of those is missing, the conversation usually turns into either "cut spend" or "explain spend", neither of which is enough. I do feel Azure Service Groups will help here - they provide a way to group and govern resources across subscriptions by business context, making it easier to align cost visibility with the workloads and products that own the spend. I have started including them in all deliverables for that reason.

Monitoring has to produce a decision

Azure Monitor, Application Insights, and Log Analytics can give you a lot of useful telemetry.

The trap is building dashboards that nobody uses to make a decision.

For each dashboard or workbook, I like to ask:

who looks at this?
how often?
what decision can they make from it?
what action happens when the number is red?
what gets ignored because it is noise?

If the answer is "we might need it someday", the dashboard is probably an archive, not an operational tool.

AI pilots need a boring operating model

AI pilots are exciting, but the value usually comes from the boring bits around them.

Who owns the workflow? Who reviews the output? What quality threshold is acceptable? What happens when the model is wrong? Who pays for run cost if usage grows? Who supports users after the demo?

Those questions are less flashy than the demo, but they are usually what decides whether the pilot becomes useful.

The old thing has a habit of surviving

One of the easiest value leaks to miss is the old thing staying alive.

The old report. The old server. The old approval process. The old spreadsheet. The old vendor contract. The old incident workaround.

Sometimes there is a good reason. Maybe the migration is phased, the risk is too high, or the replacement has not proven itself yet.

But if the old thing survives because nobody owns the retirement decision, the business case quietly weakens.

I now try to make retirement part of the roadmap item, not an afterthought.

Value needs a review cadence

A value measure without a review cadence is mostly decoration.

The cadence does not need to be heavy. In fact, it is better if it isn't.

A few examples:

Cadence	Useful for
Monthly service review	Operational health, incidents, support, adoption, cost anomalies
Quarterly value review	Outcome progress, roadmap adjustment, benefit tracking
FinOps iteration	Spend, optimisation, forecasting, unit economics
Well-Architected review	Workload risk, trade-offs, quality improvement
Product or platform steering	Priorities, ownership, funding, and stop/change/continue decisions

This is where the Cloud Adoption Framework and Well-Architected Framework complement each other nicely.

CAF helps shape the adoption and operating model. Well-Architected helps you keep improving workload quality across reliability, security, cost optimisation, operational excellence, and performance efficiency.

FinOps adds another important lens: cost decisions need to be connected to business value, not just lower spend.

To be frank, that is where some cloud cost conversations go sideways. Saving money is good, but the better question is whether the spend is producing enough value for the workload, product, or service it supports.

A simple value realisation loop

The model I keep coming back to is:

The six steps are:

Define the business outcome - name what should improve, not just what gets built
Identify the Azure capability - the service, pattern, or platform that enables it
Assign ownership - delivery, operational, and value owners, not just a project lead
Set the leading indicator - the earliest signal that the outcome is moving
Review on cadence - a scheduled checkpoint where the measure is actually looked at
Stop, change, or continue - a decision trigger so the work stays accountable

Nothing clever there, but it forces the right conversation.

If an item cannot make it through that loop, it is probably not ready to be called a strategic roadmap item yet. It might still be necessary technical work, but you need to be aware of it.

A practical example

Take a simple migration example.

Weak framing:

Migrate the claims application to Azure.

Better framing:

Migrate the claims application to Azure so that recovery time improves, unsupported infrastructure risk is removed, deployment lead time reduces, and the legacy hosting arrangement can be retired.

The value realisation loop might look like this:

Element	Example
Outcome	Claims platform is more resilient and cheaper to operate
Azure capability	Azure landing zone, workload migration, Azure Monitor, Azure Backup, policy baseline
Operating owner	Application owner plus platform operations
Leading indicator	Restore test completes within RTO target within 30 days; deployment lead time drops from days to hours within 60 days; support tickets down 20% by month 3
Review cadence	Monthly service review for 3 months post go-live, then quarterly
Stop/change/continue	Continue if recovery and deployment measures improve; revisit scope if legacy hosting contract cannot be confirmed for retirement by end of Q3

The point is not the table itself - it is that the team now has something to look at together at the monthly review, rather than only tracking whether the workload is green.

Where this fits for Azure teams

For Azure architects and platform teams, this does not mean turning every technical task into a business case.

Some work is foundational. Some work is hygiene. Some work is risk reduction. Not everything needs a dramatic value story.

But the bigger the investment, the more important this becomes.

If you are asking for funding, executive attention, delivery capacity, migration downtime, security exceptions, or behaviour change, then value ownership matters.

At minimum, each major Azure roadmap item (I would do this as an Epic in Agile) should have:

a named outcome
a value owner
one leading indicator
one lagging measure if available
a review cadence
a clear statement of what gets retired, reduced, or simplified
a stop/change/continue trigger

That is enough to move the conversation from "we adopted Azure" to "Azure helped us improve something that mattered."

Final thoughts

The best Azure roadmap is not the one with the most services on it.

It is the one where every platform decision traces to an outcome, every outcome has an owner, every owner has a measure, and every measure is reviewed after go-live.

Adoption gets you into Azure. Value realisation proves Azure was worth adopting.

Hopefully, this helps you look at your next Azure roadmap with a slightly sharper lens.

Before approving the next wave, ask:

What will be measurably better 90 days after this goes live, and who owns proving it?

References

OMO Teams: Multi-agent project delivery with ARB gates

Mon, 01 Jun 2026 02:28:10 GMT

I've spent the last year building AI agent workflows for Azure projects, and I kept running into the same problem. The agents were useful in isolation - writing code, reviewing PRs, checking security - but there was no structure connecting them. No governance. No audit trail. No one could tell me who approved what and why.

So I built some Teams, using the Oh My OpenAgent Team Mode using the opensource OpenCode harness.

The idea is simple: five phases, each with a dedicated team of AI agents, and an Architecture Review Board (ARB) gate between them. The gate has real voters, real quorum rules, and a real escalation path when things deadlock. Every decision gets committed as a Markdown file - essentially governance as code.

And because I believe in eating your own dog food, I used OMO Teams to build the OMO Teams Quickstart. This post walks through what happened.

The five-phase model

The lifecycle covers everything from intake to production. Each phase has a different voter set because the decisions at each gate are different.

Phase	Voters	What they're signing off
Phase 0 - Intake	Product Owner, Cloud Economics	Is the business case sound? Is the budget real?
Phase 1 - Architecture	Principal Architect, Security Lead, Product Owner	Are the ADRs correct? Is the threat model complete?
Phase 2 - Build	Principal Architect, Product Owner	Does the code match the architecture?
Phase 3 - Validate	Security Lead, Product Owner, Cloud Economics	Do the tests pass? Are there open security findings?
Phase 4 - Prod	Product Owner, Security Lead, Cloud Economics	Are the runbooks ready? Has DR been tested?

The voters are defined in a YAML config file that gets fed into a tally python script. The script reads individual vote files, checks quorum, and writes an immutable outcome. No dashboards, no ticket systems, no approval workflows in a SaaS tool. Just Markdown and a Python script.

How the teams are wired

Before walking through what happened, it helps to understand how these teams actually get their capabilities.

Each team is a JSON config file in .omo/teams/. A team has a lead (always the atlas subagent type, which acts as the phase coordinator) and a list of members. Every member has two things: an always_load skill list and a conditional_load list.

always_load gives the member its baseline capabilities - skills that load regardless of the project. conditional_load is where it gets interesting: the orchestrator scans the ADR files for keywords and loads additional skills only when they match. A backend architect working on a Cosmos DB project gets cosmos-db-nosql-patterns loaded automatically. One working on Postgres gets postgresql-npgsql instead. The member prompts stay generic; the skills carry the domain depth.

"always_load": ["context-map", "adr-management", "azure-container-apps"],
"conditional_load": [
  { "if_adr_contains": ["networking", "vnet", "private endpoint"],
    "load": ["private-networking", "azure-network-security-perimeter"] },
  { "if_adr_contains": ["managed identity"], "load": ["identity-managed-identity"] }
]

Members also consume load_instructions - paths to instruction files (coding standards, writing style guides, data sovereignty rules) that get prepended to the member's prompt before it runs. A backend builder picks up C# and ASP.NET Core standards automatically. An infra builder picks up Bicep conventions. The knowledge lives in the instruction files, not duplicated across every member prompt.

Members can also declare trivial_project_skip: true. For a simple stateless API like LinkSnap, the orchestrator reads a trivial-project-mode marker and skips members that aren't relevant - the data architect, agent architect, frontend builder, and MCP server builder all sat out. That's not a limitation; it's the teams adapting to project scope.

The orchestrator detects the current phase by reading .sisyphus/state/ - a set of marker files that gate progression. No state variable in memory, no database. Just files on disk. phase1-gate-approved.md exists → Phase 2 can start. It doesn't exist yet → ARB Gate 1 must run first.

The project: LinkSnap

The application itself is deliberately boring. It's a URL shortener API built with FastAPI, deployed on Azure Container Apps, with Cosmos DB for NoSQL. About 200 lines of Python and 60 lines of Bicep. I wanted the product to be simple enough that nobody would mistake it for the interesting part. The interesting part is the process.

Phase 0: intake

Phase 0 is a single-member team: the product-owner agent, running on a writing category model with quick as fallback.

The product-owner carries a substantial always_load skill set:

create-prd - structures the project requirements document
business-case-investment-justification - formalises the financial case
stakeholder-map - identifies who needs to be in the room
risk-register - Likelihood × Impact matrix
agent-economics - estimates token spend across all five phases before a single line of code is written
identify-assumptions, pre-mortem, prioritize-features, value-proposition, naming-strategist, user-stories

That last one matters: running the agent-economics skill at intake means the ARB gate voters know what the agent budget is before they approve anything. It's not just infrastructure cost - it's a budget for the agents doing the work.

The output gate requires six documents to exist before the ARB will even convene: problem statement, success metrics, budget constraints, compliance scope, risk register, and stakeholder map. If any are missing, the gate blocks.

For LinkSnap, the product-owner ran a discovery session on the project brief. The output was a risk register with seven items, a cost estimate showing $60/month, and an agent economics budget of around $18 in total token spend across all five phases.

The risk register is worth a closer look because it shows the kind of thing that usually gets missed until it becomes a problem.

R01 - Team has no production experience with ACA secrets
      Likelihood: High, Impact: High
      Mitigation: Spike on managed identity + Key Vault in Phase 1

R06 - No DR plan for single-region ACA deployment
      Likelihood: Low, Impact: High
      Acceptance: Document in runbook as known gap

R01 turned out to be prescient. More on that later.

Gate 0 to 1 passed unanimously. Product Owner approved the business case. Cloud Economics signed off on the $60 monthly estimate with a 20% variance clause. The Quickstart had a green light.

Phase 1: architecture

Phase 1 is the most complex team: six members running in three sequential waves, each wave gated by artifacts from the previous one.

Wave 1 - backend-arch and infra-arch run in parallel. Neither needs the other's output to start.

The backend-arch member uses a quick model and always loads context-map, adr-management, dotnet-backend-patterns, typespec-api-design, observability-monitoring, and azure-well-architected-assessment. Its job is to produce the backend API ADR (REST style, Problem Details, versioning), an async/event strategy ADR, an OpenAPI contract, and an observability design. For LinkSnap, cosmos-db-nosql-patterns loaded conditionally because the word "cosmos" appeared in the project brief.

The infra-arch member always loads azure-container-apps, azd-deployment, azure-deployment-preflight, finops, and azure-well-architected-assessment. It produces the hosting platform ADR with a cost estimate, a secret management ADR, a deployment toolchain ADR, and a DR strategy ADR. For LinkSnap, identity-managed-identity loaded conditionally from the key vault, secret match.

Wave 2 - data-arch and agent-arch require at least two ADRs to already exist before they run. They self-report missing prerequisites by writing a *-waiting.md file and stopping. The orchestrator's pre-flight scan catches these waiting files at the start of each invocation and runs the blocker first. Both members have trivial_project_skip: true - they sat out for LinkSnap entirely.

Wave 3 - security-arch and cloud-economics need at least four ADRs plus a data model before they start.

The security-arch member always loads threat-modelling, api-security-review, and risk-register. It writes a STRIDE threat model, an OWASP security review, and updates the risk registry with any high or critical findings. For agentic projects it picks up owasp-agentic and agent-governance-toolkit conditionally.

The cloud-economics member always loads finops, cost-optimization, agent-economics, and azure-cost-calculator. It produces the Azure cost estimate and an agent economics phase report. The exit gate requires at least six ADRs before the ARB can convene.

For LinkSnap: three ADRs got written. ACA hosting (ADR-001), Cosmos DB partition key on tenant_id (ADR-002), and managed identity for all Azure resource authentication (ADR-003). The backend-arch member reviewed the ADRs and confirmed the Cosmos partition key choice was correct. The infra-arch member approved the hosting and secret management approach.

Then the security ARB voter flagged a conditional.

The managed identity ADR said "use managed identity" but didn't say how local development would authenticate. The security lead's point was fair: without a documented local auth path, someone would inevitably drop a connection string into a .env file and commit it. Two mandatory revisions came out of it: document the DefaultAzureCredential fallback path, and decide whether Cosmos DB would use a private endpoint or public access with IP firewall.

Voter	Vote	Why
Principal Architect	Approved	ADRs are solid
Security Lead	Conditional	Need local auth path and network decision
Product Owner	Approved	Conditions are manageable

The gate passed as CONDITIONAL. Phase 2 couldn't start until both revisions were committed. It took about an hour to resolve. That's the whole point of a gate - catch it before it becomes code.

Phase 2: build

Phase 2 has six members and runs in two waves. The first wave produces contracts; the second wave builds everything.

Wave 1 - contracts only. The backend member writes the OpenAPI spec to api-contracts/openapi.yaml, a contract summary, and a backend-contracts-ready.md state marker, then stops. The mcp-server member (if not skipped) writes tool schemas to tool-contracts/tools.json and stops. Wave 2 can't start until the contracts exist - the frontend and AI engineer members check for their contract files and write a *-waiting.md file if they're missing.

For LinkSnap, both mcp-server and frontend had trivial_project_skip: true, so Wave 1 was just the backend producing its OpenAPI contract.

Wave 2 - full implementation. Four members ran for LinkSnap (two were skipped):

The backend member uses a deep category model - the highest compute tier in the team. Its always_load includes dotnet-backend-patterns, observability-monitoring, and code-review. The identity-managed-identity and azure-role-selector skills loaded conditionally from the managed identity ADR. It built the FastAPI app with three endpoints (/health, POST /links, GET /links/{short_code}), managed identity auth to Cosmos DB, and a container image pinned to a specific Python 3.12-slim digest. It then ran code-review on its own output before declaring done.

The infra member always loads azd-deployment, azure-container-apps, azure-defaults, code-review, and azure-deployment-preflight. The identity-managed-identity and azure-role-selector skills loaded from the managed identity and RBAC ADR keywords. It built the Bicep template: Container Apps environment, Cosmos DB in serverless mode, ACR registry, and the RBAC assignment wiring the ACA managed identity to Cosmos read/write. No connection strings anywhere. It also ran code-review on its own output.

The data-engineer and ai-engineer members both had trivial_project_skip: true for a stateless API demo.

Code review caught a hardcoded ACR URL in the CI pipeline. Security scan flagged a base image with no SHA pin - fixed before merge. Both issues got resolved in the same PR. That's the advantage of having agents that review as you build rather than finding these things in a production incident post-mortem.

Gate 2 to 3 passed clean. Code matched the ADRs, tests passed, CI pipeline was green.

Phase 3: validate

Phase 3 has three members, all running without trivial project skips.

The integration member always loads code-review and azure-well-architected-assessment. It verifies integration points match the ADRs and runs a WAF assessment. For Cosmos-based projects it picks up cosmos-db-nosql-patterns conditionally. Results go to evidence/phase3/integration/test-results.md.

The qa member has no always_load skills - it picks everything up conditionally. Performance testing? azure-performance-resilience-validation. E2E tests? playwright-testing. Deployment pipeline? github-actions-ci-cd. For LinkSnap the load was light: just functional test verification. Results go to evidence/phase3/qa/test-results.md.

The security-engineering member always loads threat-modelling and api-security-review. It reads the threat model from Phase 1, implements all mitigations, checks for committed secrets, and addresses OWASP Top 10. For agentic projects it picks up owasp-agentic conditionally. Results go to evidence/phase3/security/hardening-evidence.md.

The exit gate requires all three evidence files to exist before the ARB convenes.

For LinkSnap: load test ran with 50 concurrent users against a staging revision. Average response time was 180ms for writes and 45ms for reads. ACA scaled from one to two replicas during the test. Cosmos DB serverless handled the load without throttling.

Trivy scan found zero critical or high vulnerabilities. pip-audit found zero known advisories in the dependency tree. The two accepted gaps from the threat model - no auth and no rate limiting - were documented in the runbook as known v1 limitations.

Security Lead approved. Product Owner approved. Cloud Economics confirmed no cost overrun.

Phase 4: production

Phase 4 has three members. Entry requires the Phase 3 evidence directory to exist.

The reliability member always loads observability-monitoring. It picks up dr-design, chaos-engineering, azure-sre-agent, and azure-safe-deployment-practices conditionally based on ADR keywords. It writes SLOs to reliability/slos.md and produces dashboards, alerts, and DR validation evidence.

The runbooks member uses a writing category model and always loads docs-style. It picks up azure-troubleshooting, release-notes, post-mortem, and adr-management conditionally. It also consumes writing-style and markdown instruction files from load_instructions, so the runbook prose matches the project's documentation standards. Outputs: deployment runbook and incident response guide.

The cloud-economics member always loads finops, cost-optimization, agent-economics, and azure-cost-calculator. It writes the final variance report comparing actuals against the Phase 1 estimates, and a phase-4 agent economics summary.

The exit gate requires SLOs, the deployment runbook, and the incident response guide to all exist before the final ARB convenes.

For LinkSnap: the runbook covers health checks, log queries, scaling commands, escalation contacts, and the known gaps. DR is documented as a single-region best-effort arrangement with manual redeployment as the recovery path. Realistic for a Quickstart. Wouldn't fly for a production customer workload, but it's honest about what it is.

The final gate passed unanimously. Five phases, five gates, one shipped application.

The ARB team

Every gate invokes the same arb-review team. It has four members.

Janus is the chair. Janus always loads governance-gate, azure-well-architected-assessment, pressure-test, and risk-register. Janus runs the evidence inventory using the governance-gate checklist, writes a WAF assessment, assembles the gate checklist, collects the three votes, and tallies the outcome. The quorum rule: 4/5 = APPROVED, 3/5 = CONDITIONAL (mandatory revisions), 2+/5 = REJECTED (phase must be redone). If the ARB deadlocks, Janus writes escalated.md and surfaces to a human arbiter.

The three voters are product-owner, security, and cloud-economics - each reading their relevant artifacts and writing a vote file to arb/{phase}-gate/votes/.

The product-owner voter always loads pressure-test and votes on alignment with business intent, success metrics, budget, and scope.

The security voter always loads threat-modelling and votes on whether threats are mitigated or accepted, OWASP findings addressed, and compliance met. For agentic projects it picks up owasp-agentic, responsible-ai-operating-model, and agent-governance-toolkit conditionally.

The cloud-economics voter always loads cost-optimization, agent-economics, and azure-cost-calculator and votes on whether cost is within budget, agent economics are within baseline, and variance is acceptable.

Every vote file, every checklist, and every outcome is committed to .sisyphus/knowledge/arb/{phase}-gate/. The audit trail is the repository.

What the gates caught

The CONDITIONAL at Gate 1 was the most valuable outcome of the entire process. The security ARB voter's two revisions forced a decision that would have been painful to retrofit. The local dev auth path is the kind of thing that gets deferred until someone commits a key. The network perimeter decision is the kind of thing that gets deferred until a compliance audit finds a public Cosmos endpoint.

The earlier you catch these, the cheaper they are to fix. That's not a new insight. What's new is that the ARB gate made it structural rather than depending on someone happening to ask the right question in a meeting.

What the economics looked like

One thing I tracked across all five phases was the agent token budget. The agent-economics skill, loaded by both the Phase 0 product-owner and the Phase 1 cloud-economics member, allocated 510,000 tokens across the lifecycle, split by model tier. Phase 1 (architecture) used the heaviest models (Opus for lead, Sonnet for members). Phase 2 (build) had the highest raw token count because of all the code generation and review loops - the backend member runs on the deep category, which is the costliest tier.

The actuals came in under budget. Phase 0 used 14,200 tokens against a 20,000 budget. Phase 1 used 72,000 against 80,000. Phase 2 was the tightest at 185,000 against 200,000 - the code review loops chewed up more iterations than expected. The total across all five phases was about 460,000 tokens, which at current API pricing works out to roughly $16. The cheapest governance process I've ever run.

Lessons learned

Three things stood out from running this end to end.

First, the conditional gate was the most valuable outcome. If every gate had passed unanimously, I'd be writing a different post - one about how governance is easy when nothing goes wrong. The security ARB voter's conditional at Gate 1 proved the system works. It caught something real, produced a concrete revision list, and the team resolved it in an hour.

Second, the skill-loading design means the teams adapt to the project. For LinkSnap, four of the six Phase 2 members were skipped entirely via trivial project mode, and the remaining two picked up exactly the skills they needed (Cosmos patterns, managed identity) from the ADR keyword matching. A more complex project with AI agents, a frontend, and a data layer would activate all six members and load the relevant skills for each domain. The same config file, different capability surface.

Third, the artifact trail is the killer feature. Every approval, every condition, every rejection rationale is committed as a Markdown file. If someone asks six months from now who approved the Cosmos partition key decision, the answer is in .sisyphus/knowledge/adrs/ADR-002-cosmos-partition-key.md with a signed-off vote from the Principal Architect. No digging through Teams messages or email threads.

The Quickstart

Quick link

Browse the full implementation in the OMO Teams Quickstart repository.

The full Quickstart repo is at lukemurraynz/omo-teams-quickstart if you want to see all the artifacts. Every ADR, every vote file, every gate outcome, every risk register entry, and the full application code. The README walks through how to deploy it yourself.

If you're running AI agents on Azure projects and wondering where the governance is, this is a pattern you can adapt. Start with the five-phase model, pick your voters, and run your first gate.

References

Agentic Operations Lakehouse: Drasi & Microsoft Framework

Wed, 27 May 2026 09:41:53 GMT

Hospital operations run on a web of concurrent signals. Theatre lists change throughout the day. PACU bays fill and empty. Sterile tray queues build up. Discharge blockers cascade into bed shortages. None of these individually defines a risk (it's the combination that matters), and the window to act is often under an hour.

A traditional response is a coordinator checking spreadsheets, chasing phone calls, and making judgement calls with incomplete information. A common response to using AI for this kind of scenario, would be to route the problem through a chat assistant and hope the prompt captures enough context. In this kind of operational workflow, that is not enough on its own: the system needs an audit trail, grounding in historical outcomes, and a clear boundary between what it can decide autonomously and what needs a human to approve.

I wanted to see if a different approach was feasible:

One where AI agents can produce evidence-backed recommendations grounded in historical patterns, high-impact actions always require human approval, every decision is recorded for audit and replay, and the detection logic is deterministic and testable (not buried in a prompt).

This post covers the Proof of Technology I built to validate an Agentic Operations Lakehouse style pattern _(and to be frank, it was a good chance for some fun, tieing these technologies together).

Three Azure/hero technologies each own a distinct part of the problem:

Drasi for live risk detection Microsoft Agent Framework for governed agent reasoning Microsoft Fabric for operational memory.

To do this, we will use a healthcare scenario using entirely synthetic data (no real patient data, clinical records, or live hospital systems at any point), and the full source is in the AgenticLakehousePoT repository.

The short version. Drasi runs continuous queries that detect risk signals deterministically (in my scenario its multiple tables in Azure Postgres and Event Hub sources, but it could be from multiple different sources). Microsoft Agent Framework runs a 14-stage workflow (5 LLM calls, 9 deterministic stages) that reasons about the event and produces a recommendation. In the source implementation, the LLM cannot write directly to storage or bypass the action routing table, and every decision is recorded in Fabric for audit. High-impact actions require human approval.

note

The full implementation is in the AgenticLakehousePoT repository on GitHub. Everything in this post deploys from that repo with azd up. Feel free to fork, review etc.

The architecture in one sentence

Full system architecture: Drasi detects risks from PostgreSQL and Event Hubs; the 14-stage MAF workflow reasons over them using Microsoft Foundry agents; Microsoft Fabric stores every outcome; the React operator portal surfaces recommendations and approvals to role-selected operators.

Live walkthrough of the operator portal: risk events detected by Drasi appearing as recommendations, with role-based access and action approval flow visible in the UI.

Drasi detects that a risk exists. Microsoft Agent Framework reasons about it and produces a recommendation. Fabric stores the evidence. Humans approve or reject. The workflow records everything. It is straightforward when you break it down, but the interesting part is how these three pieces fit together (and what each one is not allowed to do).

Temporal ordering of the full pipeline: synthetic signals trigger Drasi detection, which kicks off the MAF workflow, which reads and writes Fabric context, ultimately serving the operator portal. Self-messages show internal processing; dashed arrows are replies.

The split that actually matters

The design principle that sets this apart from "put everything in a prompt" is the agentic/deterministic boundary. In this implementation, there are fourteen workflow stages: five invoke a Foundry LLM agent, the other nine are deterministic (schema validation, database queries, KQL reads, static routing lookups, API calls, and Fabric writes).

All 14 stages in execution order. Navy fill = LLM-backed (Azure AI Foundry agent). White fill = fully deterministic. Stage 7 (ApplySafetyPolicy) runs both layers.

The LLM handles the parts that need contextual reasoning (classifying a risk, routing to the right specialist, producing a recommendation for a specific operator role, checking whether a risk is still live). The deterministic code handles the parts that need repeatability and safety enforcement (validation, state queries, the action routing table, SLA policy resolution, and Fabric writes).

If an agent returns unexpected output, the deterministic stages either catch it (the decisionDrivers validation), ignore it (the routing lookup blocks unknown actions), or record it for audit without acting on it. The LLM cannot bypass the routing table or write to Fabric directly.

Each component's explicit boundary: green = owned responsibility, grey = explicitly excluded. When something goes wrong, you know exactly which component to investigate.

Drasi keeps detection honest

First design question: who decides that a risk exists? The answer is not the AI agent. It is Drasi.

Drasi is an open-source project from Microsoft and a Sandbox project on the CNCF (Cloud Native Computing Foundation) that runs continuous queries over live operational state. When source state changes (a new theatre entry, a PACU bay becoming unavailable, a discharge flag being set), Drasi re-evaluates its queries and emits structured change events. I wrote one query per risk type. Each query defines the signal combination that constitutes a risk. The bed capacity query, for example:

apiVersion: v1
kind: ContinuousQuery
name: healthcare-bed-capacity-risk
spec:
  mode: query
  queryLanguage: Cypher
  sources:
    subscriptions:
      - id: aol-operational-postgres
        nodes:
          - sourceLabel: surgical_cases
          - sourceLabel: ward_bed_forecasts
  query: >
    MATCH (c:surgical_cases)
    MATCH (w:ward_bed_forecasts)
    WHERE
      c.ScenarioRunId = w.ScenarioRunId
      AND c.CorrelationId = w.CorrelationId
      AND w.StateValue = 'blocked'
    RETURN
      c.Id AS workItemId,
      'bed-capacity-risk' AS riskType,
      'high' AS riskLevel,
      'Post-op bed forecast indicates blocked capacity' AS observedFact,
      'human-approval-required' AS approvalRequirement

It watches two PostgreSQL tables (surgical_cases, ward_bed_forecasts), matches them on a correlation ID, and when a ward forecast flips to blocked it emits a structured risk event. The output feeds directly into the MAF (Microsoft Agent Framework) workflow as the observedFacts you see in the code samples.

Drasi continuous query containers deployed on Azure Kubernetes Service, listing running pods across the cluster namespace.

Detection is testable_(a Drasi query is a declarative expression you can write unit tests against and replay historical events through - if detection logic were inside an agent prompt, testing it would mean evaluating LLM outputs). Detection is observable (Drasi emits structured events with correlation IDs, observed facts, and lifecycle state. When an operator asks "why was this risk flagged?", the answer comes from structured output, not from reconstructing what a model was thinking). And detection is separated from recommendation (if the recommendation is wrong, you can tell whether the agent misread the situation or simply gave bad advice). Drasi hands the MAF workflow a structured, verifiable risk event, and that separation is the important design boundary.

The workflow is the product, not the chat

Once Drasi emits a risk event, the MAF workflow runs. For this pattern, a single LLM call is the wrong boundary because waiting for human approval, checkpointing state for restart, and separating contextual reasoning from structural routing need explicit workflow state. The 14-stage workflow makes each concern an explicit, auditable stage with a clear input, output, and responsibility.

Role-aware recommendations

The bed manager wanted discharge-blocker advice. The theatre coordinator wanted case-sequencing language. I had to map the risk type to a role before the recommendation prompt made sense to its reader. The fix was a lookup that derives the primary operator role from the risk type and injects it into the context:

private static (string Role, string Context) MapRoleFromRiskType(string? riskType) => riskType switch {
    "bed-capacity-risk" or "post-op-discharge-coordination-risk" =>
        ("Bed Manager",
         "Responsible for ward capacity and patient discharge flow..."),
    "pacu-throughput-risk" or "theatre-turnover-risk" =>
        ("Theatre Coordinator",
         "Responsible for theatre list execution and perioperative flow..."),
    _ => ("Operational Manager", "...")
};

Before I added this, the operatorGuidance field read like generic advice. After adding the role name and one sentence of role context, it started using domain vocabulary. A single string addition to the prompt context.

The prompts themselves are versioned artefacts stored in Azure App Configuration, not hardcoded in the worker. They're loaded at startup by PromptLibrary.LoadFromAppConfigurationAsync with a version label and cached with a configurable TTL so the 14-stage workflow doesn't hit App Configuration on every stage call. Updating an agent's instructions means updating an App Configuration key-value pair, not redeploying the service.

Handling transient Foundry failures

A number of the early PACU throughput risk runs hit an incomplete status from the Foundry agent on the first attempt (a transient infrastructure failure with no error detail). The initial code threw immediately, which restarted the entire 14-stage workflow from scratch. The fix was an internal retry within RunAgentAsync on incomplete status, before escalating to the workflow-level retry:

const int MaxAttempts = 2;
for (int attempt = 1; attempt <= MaxAttempts; attempt++) {
    PersistentAgentThread thread = await agentsClient.Threads.CreateThreadAsync(...);
    try {
        if (run.Status != RunStatus.Completed) {
            bool isIncomplete = string.Equals(run.Status.ToString(), "incomplete", ...);
            if (isIncomplete && attempt < MaxAttempts) { continue; }
            throw new InvalidOperationException(...);
        }
        return result;
    }
    finally {
        await agentsClient.Threads.DeleteThreadAsync(thread.Id, ...);
    }
}

The workflow session stays alive, no stale-progress UX, and the workflow-level retry stays as a safety net for other failure modes.

Fabric as operational memory

Every record the workflow produces is written to Fabric. Every record the next run needs is read from Fabric. That bidirectional relationship is what makes recommendations grounded rather than speculative. When the agent generates a recommendation, it can see how many times this risk type has occurred historically, what the typical escalation latency looks like, and which actions have been effective. The Fabric context is fed directly into the generation prompt as structured data.

The Fabric workspace hosting the Eventhouse that stores risk events, recommendations, and approval records (the operational memory layer).

The operational memory lives in a Fabric Eventhouse. Risk events stream in from Drasi, the MAF workflow writes recommendation records directly, and the operator portal reads the current state through KQL queries. The Eventhouse schema was designed around three core tables: risk events, recommendation records, and action outcomes.

The KQL schema was the first thing I built and it stayed stable through the whole PoT (Proof of Technology). I changed it three times and each change required updating the write path, read path, and API contract simultaneously — so I froze it and worked around the constraints instead.

KQL query showing risk event lifecycle data aggregated over a seven-day window. Answers the question "how many risks did I detect, and what happened to each one?"

This query is the one I reached for most when testing. It tells you at a glance whether the pipeline is healthy - if new risk events are arriving, if recommendations are being produced, and if they're reaching the operator portal.

Querying the top 100 recommendation records in Eventhouse. Each record captures the full decision chain: risk event, agent recommendation, safety policy evaluation, and human approval outcome.

The recommendation records are the audit trail. Every decision (agentic and human) is captured in a single row you can trace from the risk event through to the outcome. This was important to me from the start - if someone asks "what happened with this risk?", there is one place to look.

Fabric insights surfaced in the operator portal, giving operators real-time visibility into the health and throughput of the operational memory layer.

A design decision that surprised me: the frontend polls for recommendations and gets nothing until stage 13 (seven stages after the recommendation is generated at stage 6). The recommendation lives in workflow state from stage 6. Fabric only sees it after stage 13, when the complete record (including approval decisions and policy evaluations) is written. This is by design, but operators will see "checking again" for 30-60 seconds. I had to make this explicit in the frontend (Fluent on React).

Three layers of safety

Every recommended action is classified before it reaches the operator portal:

Layer 1: deterministic routing table. Layer 2: LLM policy evaluation (parallel). Layer 3: deterministic post-approval gate. Only Layer 2 invokes a Foundry agent.

Layer 1 - deterministic routing. ActionRoutingSteps.Route() classifies each action against hardcoded lookup sets: SafeActions, ApprovalRequiredActions, and a blocked bucket. This runs before any LLM evaluation. An unknown action goes to blocked, regardless of what the agent recommended.

Layer 2 - LLM policy evaluation. EvaluateSafetyPolicyAsync calls the Foundry agent for each action in parallel to produce contextual rationale. The LLM provides the explanation, the routing table provides the enforced classification.

Layer 3 - deterministic final gate (post-approval). After the human approves, SafetyPolicyEngine.Evaluate() runs again with freshness signals (no LLM). If the risk has gone stale, no approved actions execute.

Class	Actions	Behaviour
Safe-automated	create-risk-board-entry, send-role-notification, pacu-throughput-coordination	Recorded immediately
Approval-required	theatre-case-resequencing, alternate-ward-placement, overtime-approval, duty-manager-escalation	Approval request to Duty Manager
Blocked (never-allowed)	surgery-cancellation, clinical-prioritisation, any unknown action	Blocked before LLM evaluation

Operator view of a recommendation technical detail blade, showing the risk event summary, agent-generated recommendation, applied safety policy classification, and the approval action buttons for the Duty Manager role.

What I learned

These were not obvious up front. I hit each of them the first time I ran a full scenario end-to-end.

The incomplete status from Foundry is a retry problem, not a workflow problem. I initially threw immediately on the first incomplete return, which restarted all 14 stages from scratch. The fix was an internal retry loop inside RunAgentAsync — operators never saw the stall once I hid it behind the call-level retry. The workflow-level retry is still there as a safety net for everything else.

The recommendation is not visible in Fabric until stage 13, even though stage 6 generates it. This surprised me the first time I watched the operator portal — the screen looked like it was doing nothing for 30-60 seconds after a risk appeared. I had to add a stage-start event so operators see "running..." instead of a frozen UI. Make this explicit in the portal from the start.

The KQL schema was the first thing I built and it stayed stable. I changed it three times and each change required updating the write path, read path, and API contract simultaneously. I froze it and worked around the constraints. That was the right call — locking the schema early kept those contracts stable through the rest of the PoT.

The reusable pattern

The healthcare scenario is one instance of a general pattern. The same architecture applies anywhere live signals create operational risk, humans need evidence-backed recommendations, and high-impact actions need approval.

The same architecture pattern maps across healthcare, manufacturing, logistics, and field service. Each industry has its own work item, capacity constraint, and risk event, but the detection/reasoning/approval/memory structure stays the same.

The repository includes three scenario packs (healthcare, manufacturing, and a manufacturing stub). The stub is the quickest way to understand the pattern - it implements IScenarioPack with inline comments mapping each healthcare concept to its manufacturing equivalent. Adding a new industry means implementing that same interface: define risk types, actions, roles, and synthetic data rules. The cross-industry mapping document covers utilities and emergency management as worked examples.

Open source

The full implementation is on GitHub under MIT licence in the AgenticLakehousePoT repository. It includes all five microservices, both full scenario packs, Fabric workspace item definitions, Drasi continuous query definitions, the React/Fluent UI operator portal, and eight test projects covering domain logic, integration, safety policy, and agent evaluation (AgentEval) for .NET/MAF agent evaluation.

Clone the repo, run the deployment script, and try it with your own risk types. The manufacturing stub is a good starting point - walk through the inline comments and you can see how the full detection-to-approval flow maps to a different industry. If you build on this pattern for a new industry, I would like to hear about it.

References

Running Azure SRE Agent for AKS and Drasi Operations

Fri, 08 May 2026 07:57:10 GMT

I have been spending time with Azure SRE Agent and wanted to see how far I could take it beyond the "click around the portal" experience.

The goal was simple: build a public, repeatable blueprint that deploys an Azure SRE Agent for AKS and Drasi operations with:

infrastructure deployed through Azure Developer CLI
custom SRE subagents
skills and runbooks
Azure Monitor response plans
scheduled health checks
MCP connectors for Microsoft Learn and Drasi docs
fault-injection tests for AKS and Drasi failure modes

The result is an Azure SRE Agent with support for Drasi on AKS that can be deployed with azd up using an AVM-style (Azure Verified Modules) Bicep module and PowerShell.

Why I Built This

Drasi is a good workload for this pattern because it sits right on the boundary between application runtime and platform reliability.

When a Drasi query is stale or a source is not delivering changes, the root cause might be Drasi itself.

But it might also be:

an AKS scheduling problem
a missing metrics API
a broken admission webhook
a node under pressure
a stopped cluster
a DCR or DCRA problem
a private-cluster operations path issue.

That is where SRE Agent becomes interesting. I had a lot of fun setting this up, and the mind boggles at what this can do!

The Azure SRE Agent can receive an incident from Azure Monitor, route it to a specialist agent, collect evidence, reason through likely causes, and either propose or execute a remediation depending on the response plan mode.

The trick is giving it enough structure so it does not treat every symptom as 'restart the app' - and go through appropriate troubleshooting and evidence gathering steps.

What this Blueprint Deploys

note

The full source is at lukemurraynz/drasi-aks-sre-agent on GitHub. Everything in this post deploys from that repo with azd up.

The repository deploys the resources for the SRE Agent with Bicep and wires the agent configuration through a post-provision script.

drasi-aks-sre-agent/
├── infra/
│   ├── main.bicep
│   ├── drasi-sre-agent.bicep
│   └── drasi-sre-agent-rbac.bicep
├── avm/
│   └── res/app/agent/main.bicep
├── scripts/
│   └── setup-sre-agent.ps1
├── sre-config/
│   ├── agents/
│   ├── skills/
│   ├── response-plans/
│   ├── scheduled-tasks/
│   └── testing/
└── azure.yaml

At a high level, azd up gives you:

Microsoft.App/agents Azure SRE Agent
managed identity for resource operations
Application Insights
Log Analytics workspace integration
Azure Monitor incident platform
Azure Monitor, Log Analytics, Application Insights, Microsoft Learn, and Drasi docs connectors
response plans for AKS and Drasi incidents
scheduled health probes and daily resilience summaries
scoped RBAC for the Drasi resource group and AKS cluster

Agent Design

I split the agent capability into four custom agents:

Agent	Purpose
`drasi-incident-triage`	First responder. Classifies the incident and routes by failure phase.
`aks-platform-diagnostics`	Handles AKS, node, networking, autoscaler, metrics, admission, and upgrade issues.
`drasi-runtime-diagnostics`	Handles Drasi sources, continuous queries, reactions, Dapr, Redis, Mongo, and Drasi rollout issues.
`drasi-remediation-review`	Reviews proposed fixes for evidence, risk, rollback, and validation.

I did not want the Drasi runtime agent to debug a cluster-wide scheduling issue. I also don't want the AKS agent deleting Drasi resources when a query or source isn't working.

So the response plans route by failure phase first:

Failure phase	Prefer this route
Pod creation fails	Admission webhook, workload identity, policy, or API server
Pod is pending	Scheduler, node capacity, autoscaler, subnet, or quota
HPA/KEDA is blind	Metrics API or external metrics API
Broad `kubectl` and controller timeouts	API server, konnectivity, node/network health
Only Drasi resources are unhealthy after source/query changes	Drasi lifecycle diagnostics

Built-In Skills Still Matter

One thing I tested was whether custom skills replaced the built-in skills.

They should not.

For Azure Kubernetes Service (AKS), the built-in aks_general skill is still useful for generic Kubernetes operations. The custom aks-platform-diagnostics skill I added contains the more local context for Drasi, known false-positive patterns, and our route-specific evidence bundles.

The setup script only upserts custom skills and agents. It does not overwrite the built-in SRE Agent skills.

That distinction matters because future platform improvements should continue to flow through the built-in skill set.

Custom Skills

Skills are the runbooks that tell each agent what to collect, what to query, and how to reason before proposing a fix.

I wrote three custom skills for this blueprint:

Skill	Attached to	Evidence bundle
`aks-platform-diagnostics`	`aks-platform-diagnostics` agent	Node status, pod events, admission webhook health, metrics API availability, konnectivity tunnel state, SNAT stats
`drasi-runtime-diagnostics`	`drasi-runtime-diagnostics` agent	Drasi source and query status, Dapr sidecar health, Redis and Mongo connectivity, resource-provider logs
`drasi-remediation-review`	`drasi-remediation-review` agent	Evidence completeness checklist, risk classification, rollback path verification, validation steps

The setup script applies them on every azd up without touching the built-in skills.

I kept each evidence bundle deliberately narrow. The Drasi runtime skill, for example, always checks source status before looking at any continuous query — because a stale-looking query usually has a source connection problem behind it. If I left that ordering to the model, it would take longer and sometimes go the wrong way.

Connector Lesson: Connected Does Not Always Mean Enabled

The first issue I hit was with the Microsoft Learn and Drasi docs MCP connectors.

The connector status was healthy, but the tools were not active for the agent. In the portal, they showed up as connected but with zero active tools.

warning

A healthy connector status does not mean the tools are active for your agent. Always verify the tool assignment in the portal, not just the connector health indicator.

The fix was to configure both the connector metadata and the agent tool assignment:

Enable-AgentTools -ToolNames @(
  'microsoft-learn_microsoft_docs_search',
  'microsoft-learn_microsoft_code_sample_search',
  'microsoft-learn_microsoft_docs_fetch',
  'drasi-docs_fetch_docs_documentation',
  'drasi-docs_search_docs_documentation',
  'drasi-docs_search_docs_code',
  'drasi-docs_fetch_generic_url_content'
)

After that, the agent had access to current Microsoft documentation and live Drasi docs during investigations.

Response Plans

The repo includes direct routes for common Azure Kubernetes Service (AKS) and Drasi incidents.

For Azure Kubernetes Service (AKS):

cluster stopped
CoreDNS unavailable
node pressure
image pull failures
pod scheduling failures
storage mount failures
Dapr system faults
Cilium/network faults
Azure Monitor agent faults
admission webhook failures
autoscaler stuck or capped
metrics API unavailable
SNAT port exhaustion
API server overload
konnectivity tunnel faults
AKS upgrade blockers
namespace or PVC stuck terminating

For Drasi:

platform fault
source unavailable
query staleness
reaction unavailable
Redis/Mongo/Dapr state store faults
partial upgrade or failed rollback
source bootstrap race
source dependency break

Most routes stay in Review mode. One route is intentionally Autonomous:

{
    "id": "aks-cluster-stopped",
    "handlingAgent": "aks-platform-diagnostics",
    "agentMode": "autonomous"
}

If the cluster is stopped, the agent is allowed to start the same AKS cluster (if you grant it permissions to the resource through the User Assigned managed identity to do so) otherwise, you can have this notify you through email/teams, and you can elevate the permissions (as long as you yourself have access to do so):

az aks start -g <resource-group> -n <aks-cluster-name>

That is a bounded, reversible-enough action for my use case. It does not authorize node-pool scale-out, upgrades, networking changes, add-on changes, or cluster recreation.

warning

Autonomy should be route-specific. Do not make the entire agent autonomous, as a single remediation is sufficient for your environment.

The Alert then changed to Acknowledged, and the Agent will output a Kepner-Tregoe problem management table (i.e., IS vs IS NOT).

We can even have a look at the Trace of the process, to see the steps it took, this can help us improve the Agents and their Skill calling:

Session Insights

Every incident creates an investigation session that you can open in the portal. I found these worth going back and reading properly after each test run.

Each session shows you:

the triggering alert and incident metadata
Which response plan and subagent handled the route
every tool call made during the investigation (Log Analytics queries, kubectl commands, Azure REST calls, MCP doc lookups)
the evidence collected and how the agent reasoned about it
the proposed or executed remediation
a Kepner-Tregoe IS / IS NOT table where the agent produced one

That last part is worth calling out. It is not just tidy output — it forces the agent to be explicit about what is not broken, which is often as useful as knowing what is.

Because the blueprint wires in Application Insights as a connector, you can query the agent's own telemetry directly:

dependencies
| where cloud_RoleName == "sre-agent"
| where timestamp > ago(1h)
| project timestamp, name, duration, success
| order by timestamp desc

That helps surface slow tool calls or failed skill invocations that the session view does not always make obvious.

After a real incident, I would go through the session and:

Check which tools fired and in what order.
Look for tool calls that did not make it into the reasoning — wasted round-trips.
Look for places where the agent guessed at evidence rather than retrieved it.
Update the skill to tighten the evidence bundle for that route.

tip

Sessions are the fastest way to improve your agent over time. One review after a real incident is worth more than ten synthetic tests.

The Trace view shows the order of skill calls and subagent handoffs. If a route touched three agents before finding the right one, the triage logic in drasi-incident-triage needs to be adjusted.

Scheduled Tasks

Azure SRE Agent scheduled tasks are useful for proactive reliability checks. The Microsoft docs describe them as scheduled natural-language checks that create a conversation thread, query data sources, reason about findings, and return an actionable summary.

This blueprint adds:

Task	Purpose
`drasi-health-probe-15m`	Recurring AKS and Drasi health probe
`drasi-daily-resilience-report`	Daily operational risk and resilience summary

The 15-minute task checks the cluster power state before trying any Kubernetes command. If the cluster is stopped, it reports that directly and avoids wasting time on failed kubectl calls.

The daily report is more architectural: recurring risks, noisy components, failed remediations, and follow-up work.

But you could use this for cost analysis reporting, configuration drift, and more.

Fault Injection

I wanted this to be testable without breaking a shared AKS cluster, so the repo includes a fault-injection matrix and synthetic route validation.

For destructive or noisy cases, use synthetic alerts:

az monitor metrics alert create \
  --resource-group <resource-group> \
  --name sre-e2e-aks-admission-webhook-failure \
  --scopes <aks-cluster-resource-id> \
  --description "Synthetic route validation. Expected route: aks-admission-webhook-failure" \
  --severity 3 \
  --evaluation-frequency 1m \
  --window-size 5m \
  --condition "avg kube_node_status_allocatable_cpu_cores > 0" \
  --action <sre-agent-action-group-resource-id> \
  --auto-mitigate false

That alert intentionally fires without damaging the cluster. The important part is the route ID in the alert name and description.

The Bicep also supports this with an opt-in flag:

param deploySyntheticRouteValidationAlerts bool = false

Keep it off by default. Turn it on only for validation windows.

danger

Always-firing synthetic alerts that run continuously will trigger autonomous or review-mode agent runs, burning through tokens and tools. Deploy them, validate them, then delete or disable them.

Real Finding: Container Insights Was Broken

One useful outcome from testing was that the SRE Agent surfaced a real platform issue.

The AKS monitoring add-on was enabled, and ama-logs pods were running, but Log Analytics had no recent rows in:

KubePodInventory
ContainerLogV2
Heartbeat
InsightsMetrics

The ama-logs pod logs showed DCR parsing errors, and there were no Data Collection Rules or DCR associations.

That is a perfect example of why you need platform routes before application routes. If Drasi looks unhealthy but your AKS telemetry pipeline is broken, the first incident is not "restart Drasi". It is "fix monitoring".

I added a baseline alert for that:

KubePodInventory
| where TimeGenerated > ago(30m)
| summarize CurrentRows=count()
| where CurrentRows == 0

This routes to:

aks-monitoring-agent-fault

The SRE Agent correctly diagnosed the missing DCR/DCRA path and proposed re-onboarding Container Insights. That is a sensible fix, but it changes AKS monitoring configuration, so the remediation review skill keeps it as a human approval path.

Drasi Example: Source and Query Issues

Drasi has its own failure modes that are not generic Kubernetes failures.

One route in the blueprint handles a documented lifecycle case: creating a Source and then immediately creating a dependent Continuous Query before the Source has connected cleanly.

The response plan is:

drasi-source-bootstrap-race

The correct remediation is not to restart the cluster. It is:

Confirm the Source is healthy.
Inspect the Continuous Query status and resource-provider logs.
Delete and recreate only the affected Continuous Query if the bootstrap failed.

That is the kind of domain-specific behavior that belongs in a Drasi runtime skill, not a generic AKS skill.

The Deployment Flow

To deploy, the flow is:

git clone https://github.com/lukemurraynz/drasi-aks-sre-agent.git
cd drasi-aks-sre-agent

azd auth login
az login

azd env new drasi-sre-dev
azd env set DRASI_RESOURCE_GROUP_NAME <drasi-resource-group>
azd env set DRASI_AKS_CLUSTER_NAME <aks-cluster-name>
azd env set DRASI_LOG_ANALYTICS_WORKSPACE_NAME <workspace-name>
azd env set AZURE_RESOURCE_GROUP <agent-resource-group>
azd env set AZURE_SRE_AGENT_NAME <agent-name>

azd up

Refer to my previous blog article Deploy Drasi Faster with the Azure Developer CLI Extension if you want to get Drasi running on AKS using an AZD extension.

The first run provisions the agent and then applies the data-plane configuration:

custom agents
skills
response plans
scheduled tasks
MCP tool enablement

The reason for the post-provision step is pragmatic: not every SRE Agent object is cleanly portable through ARM in every tenant yet, so the repo uses Bicep for infrastructure and the SRE Agent data-plane API for operational content.

Lessons Learned

A few things stood out.

1. Route by failure phase before the product

Creation-time failures usually mean admission, workload identity, policy, or API-server health.
Pending-time failures usually mean scheduling, capacity, subnet, or autoscaler.
Metrics blindness usually means the metrics API or the monitoring pipeline.

Only after those are clean should the Drasi specialist take over.

2. Autonomous should be boring

Starting a stopped AKS cluster is boring enough for my environment.

Recreating Container Insights, changing DCRs, scaling node pools, changing webhooks, deleting finalizers, or modifying networking is not.

Those remain approval-gated.

3. Synthetic alerts are useful, but dangerous if left on

Always-firing metric alerts are great for response-plan validation.

They are terrible as a permanent baseline.

Deploy them behind a flag, run the validation, capture the evidence, and delete them.

4. "Connected" is not the same as "usable."

MCP connectors can be connected and remain healthy even when their tools are not active for the agent.

Check the actual tool assignment, not just connector health.

5. Observability needs its own alert

If Container Insights stops sending inventory, many AKS alerts become blind.

That is a reliability incident in its own right.

Where This Fits in Well-Architected

From a Well-Architected Reliability perspective, this is about reducing detection and diagnosis time without blindly increasing the risk of automation.

From an Operational Excellence perspective, it gives you:

version-controlled runbooks
repeatable deployment
consistent incident routing
explicit approval boundaries
scheduled operational review
post-incident feedback loops

From a Cost Optimization perspective, it also matters because noisy autonomous agents can quickly burn through tokens and tools. Route narrowly, scope tools, and keep high-impact flows in Review until you have real evidence.

Final Thoughts

Azure SRE Agent is most useful when you treat it like an operational platform, not a chatbot.

The value comes from the structure around it:

focused agents
route-specific response plans
current documentation tools
scoped RBAC
review-mode safety gates
scheduled checks
fault-injection evidence

For AKS and Drasi, that structure matters even more because the symptoms overlap. A Drasi issue can look like a Kubernetes issue, and a Kubernetes issue can make Drasi look broken, but hopefully this gives you enough of a view and scaffold to fit your own purposes.

That is exactly the kind of ambiguity SRE Agents can help with, as long as we give them the right guardrails.

References

Deploy Drasi Faster with the Azure Developer CLI Extension

Wed, 15 Apr 2026 06:24:12 GMT

I have deployed Drasi enough times now to know exactly where the pain shows up: too much manual scaffolding, inconsistent post-provision steps, and "it worked in one environment but not the other" cluster setup drift.

So I built a custom Azure Developer CLI extension for AZD called azure.drasi to standardize that workflow end-to-end.

It gives you a clean, repeatable way to:

Scaffold Drasi projects from templates
Validate config before touching infrastructure
Provision AKS + supporting Azure resources in one flow
Deploy sources, queries, middleware, and reactions in dependency order
Operate and troubleshoot Drasi workloads with native azd commands

Why I Built This

Drasi deployments are not just "deploy app and move on". You normally need to coordinate:

Azure Kubernetes Service (AKS) configuration (including Workload Identity)
Namespace/runtime setup
Managed identity + Key Vault + diagnostics plumbing
Correct deployment order for Drasi components

This is exactly the kind of process that becomes fragile if left to handwritten, ad hoc scripts per repo.

The extension wraps those moving parts into a consistent set of AZD commands, so your Drasi workloads feel like any other azd project lifecycle.

What the Extension Covers

The current azure.drasi extension supports:

Project scaffolding templates:
- blank
- blank-terraform
- event-hub-routing
- postgresql-source

Supported Template Matrix

Template	Best for	Typical use case
`blank`	Starting from scratch	Build a custom Drasi topology with your own sources/queries/reactions
`blank-terraform`	Infra-first teams	Use Terraform-based provisioning workflows with Drasi project scaffolding
`event-hub-routing`	Streaming/event routing	Ingest from Event Hubs and route/filter events with Drasi queries
`postgresql-source`	Relational CDC demos/POCs	Capture PostgreSQL changes and validate end-to-end Drasi flow quickly

These templates are starting points, not rigid blueprints. Before you run azd drasi provision, you can modify infrastructure settings (for example VM sizes/SKUs, PostgreSQL sizing, networking, and environment parameters) to fit your subscription limits, region availability, and production standards.
Offline validation of Drasi config before deployment
Infrastructure provisioning for AKS, Key Vault, UAMI, and Log Analytics
Ordered Drasi component deployment with health checks
Operations commands for status, logs, and diagnostics
Safe teardown and runtime upgrade actions

Installation

Install the extension from my GitHub Releases registry:

azd extension source add -n drasi-lukemurray-azdext -t url -l "https://github.com/lukemurraynz/azd.extensions.drasi/releases/latest/download/registry.json"
azd extension install azure.drasi -s drasi-lukemurray-azdext

Verify:

azd drasi --help
azd drasi version

You can upgrade the extension with the latest upstream version from my repo using:

azd extension upgrade azure.drasi

Quick Start (First Run)

This is the fast path from an empty folder to deployed Drasi components:

mkdir my-drasi-app && cd my-drasi-app
azd init --minimal -force
azd drasi init --template postgresql-source
azd env new drasienv
azd drasi validate --strict
azd auth login
az login
azd drasi provision
azd drasi deploy
azd drasi status

Cost note: azd drasi provision can create billable resources (especially AKS and Log Analytics). Use a dedicated dev/test subscription or budget guardrails for experimentation. The following are example costs only to give a view of cost; Azure Developer CLI shines with the removal and redeployment of entire environments.

The postgresql-source template baseline (SKUs as defined in the Bicep: 2× Standard_D2s_v5 AKS nodes, Standard_B1ms PostgreSQL, Standard NAT Gateway + Public IP) — estimated USD, pay-as-you-go, 24 h/day:

newzealandnorth

Resource	SKU	1 day	7 days	30 days
AKS nodes ×2	Standard_D2s_v5	$6.05	$42.34	$181.44
PostgreSQL compute	Standard_B1ms (Burstable)	$0.66	$4.59	$19.66
NAT Gateway	Standard	$1.08	$7.56	$32.40
Public IP	Standard Static	$0.12	$0.84	$3.60
Total		$7.90	$55.32	$237.10

Key Vault (Standard) and Log Analytics are consumption-based: Key Vault is negligible for dev use; Log Analytics adds $3.51/GB (NZ North) above the 5 GB/day free allowance. VNet and managed identities are free.

Region note: If a SKU/offer is restricted in your default location, set a supported region before provisioning. For example:

azd env set AZURE_LOCATION australiaeast
azd drasi provision

This flow is intentionally opinionated: validate early, provision once, then deploy in a known order.

Common Scenarios

These are the scenarios I hit most often when building demos and internal proofs-of-concept.

1. Scaffold and Start with a Known Pattern

When you want to get moving quickly with a real source/reaction shape, start from a template:

azd drasi init --template event-hub-routing
azd drasi validate

This avoids copy/paste YAML drift and gives you a repeatable baseline across contributors.

2. Validate in CI Before Provision/Deploy

If you want fast feedback on pull requests:

azd drasi validate --strict

Because validation runs offline, you can fail quickly without needing cluster access.

3. Dry-Run Before a Live Deploy

Useful when you want confidence in component changes:

azd drasi deploy --dry-run

Think of this as your safety rail before touching a shared environment.

4. Multi-Environment Deployments

Use overlays and environment targeting for dev/stage/prod separation:

azd drasi provision --environment dev
azd drasi deploy --environment dev

azd drasi provision --environment prod
azd drasi deploy --environment prod

This is where the extension helps prevent "prod got dev settings" moments.

5. Operate and Troubleshoot a Running Deployment

azd drasi status
azd drasi status --kind continuousquery --output json
azd drasi logs --kind continuousquery --component order-changes
azd drasi diagnose

The diagnose command is especially useful when something is failing across auth, cluster connectivity, or runtime dependencies.

6. Teardown with Guardrails

# Components only
azd drasi teardown --force

# Components + infrastructure
azd drasi teardown --force --infrastructure

Cleanup note: If infrastructure remains provisioned, AKS and Log Analytics can continue incurring cost. Use azd drasi teardown --force --infrastructure (or azd down when applicable) to clean up fully.

This is force-gated by design so you are less likely to accidentally wipe an environment.

And a normal azd down works:

Day-2 Operations Notes

Some practical notes after using this in repeated demo cycles:

Prefer --environment consistently, even in dev, so context switching is explicit.
Use --output json in automation jobs where you need a machine-readable state.
Keep secrets in Key Vault references and out of repo config.
Use validate --strict as a pre-deploy gate in CI.

Gotchas I Found

Kube context confusion still happens. If your local context points at the wrong cluster, operations commands can surprise you. Prefer explicit environment targeting where possible.

Validation is not a replacement for live diagnostics. validate catches config-level issues early, but connectivity/auth/runtime checks still belong to diagnose on a live target.

Teardown is intentionally friction-filled. You must use --force, and that is a good thing.

Who This Is For

This extension is useful if you:

Deploy Drasi repeatedly across multiple environments
Want a reusable bootstrap path for sources/queries/reactions
Need cleaner team handover (same commands, same flow)
Prefer AZD-native workflows over custom one-off scripts

If you only run one tiny local experiment once, this may feel like overkill. For anything beyond that, consistency pays for itself quickly.

Wrapping Up

The main goal of azure.drasi is simple: remove the repetitive plumbing and make Drasi delivery predictable.

Instead of rebuilding the same script stack every time, you can use one AZD extension workflow to scaffold, validate, provision, deploy, operate, and clean up.

I will add more walkthrough GIFs and scenario demos over time, but the extension is already usable today for practical Drasi workflows.

Code: lukemurraynz/azd.extensions.drasi

If you try azure.drasi, I’d love your feedback:

Issues: Report bugs or request features

Remove Build-Time Environment Variables with Azure App Configuration with Front Door for Static Web Apps

Sat, 04 Apr 2026 04:11:36 GMT

Today, we are going to look at a preview feature that solves one of the most common pain points in SPA (single page application) or Static Web App deployments - build-time environment variable injection - using Azure App Configuration with Azure Front Door.

If you have ever had to rebuild a React or Vue app just because the API URL changed between staging and production, this one is for you.

info

This article walks through a proof of concept using preview SDKs. The pattern is production-applicable, but the Azure Front Door integration for App Configuration is currently in public preview. SDK versions and APIs may change before GA.

The Problem Everyone Has Hit

Every Vite, React, Next.js, or Vue developer knows this pattern:

# Build stage - config is compiled INTO the JavaScript
ARG VITE_API_URL
ENV VITE_API_URL=$VITE_API_URL
RUN npm run build

Vite replaces import.meta.env.VITE_API_URL with the literal string value at build time. The output JavaScript file contains "https://api-staging.example.com" as a hardcoded constant. To point at production, you rebuild the entire application.

This causes real problems:

One build per environment - staging, UAT, production each need their own Docker image or pipeline run
Leaked URLs - a staging API hostname baked into a production bundle is a common incident
CI/CD coupling - your frontend pipeline needs to know infrastructure details at build time
No runtime changes - updating a feature flag or API version requires a full rebuild and redeploy

Because of this issue, I developed my own Copilot skill dedicated entirely to diagnosing ERR_NAME_NOT_RESOLVED errors caused by incorrect build-time URLs. The fact that this needs its own troubleshooting guide tells you something about how often it goes wrong.

What Changed

In late 2025, Azure App Configuration added Azure Front Door integration. The idea is straightforward: serve your configuration through a CDN endpoint that browsers can call directly, without authentication.

The architecture shift looks like this:

Before (build-time injection):

Build Pipeline → injects VITE_API_URL → npm run build → baked into JS bundle
                                                              ↓
                                              One artifact per environment

After (runtime fetch via CDN):

npm run build → single artifact (no config baked in)
                         ↓
Browser loads app → JS calls Front Door CDN endpoint (HTTPS GET, no auth)
                         ↓
Front Door → (managed identity) → App Configuration store → returns JSON
                         ↓
App receives { "ApiUrl": "https://api-prod.example.com", "Theme": "dark" }

The built JavaScript bundle is identical across dev, staging, and production. Configuration arrives as an HTTP response at runtime, not as compiled constants.

Runtime config and feature flags are delivered at request time via Front Door, not compiled into the bundle.

Why Front Door? Can I Just Use App Configuration Directly?

This is the first question I had. Azure App Configuration already has a JavaScript SDK (@azure/app-configuration). Why add Front Door in the middle?

The answer is authentication. App Configuration requires credentials to access - either a connection string or a Microsoft Entra ID token. An SPA running in a browser cannot securely hold either of these. You cannot embed a connection string in JavaScript that ships to the client. And you cannot run DefaultAzureCredential in a browser - there is no managed identity context.

Front Door solves this by acting as an authentication proxy:

	App Configuration Direct	App Configuration + Front Door
Client auth required	Yes (connection string or Entra token)	No (unauthenticated HTTPS GET)
Works in browser/SPA	No (cannot hold secrets)	Yes
Works server-side	Yes (managed identity)	Yes (but overkill)
CDN caching	No	Yes (global edge, DDoS protection)
Scoped exposure	N/A (full access with credentials)	Yes (only configured key filters served)
Feature flags	Yes	Yes
Cost	App Config only	App Config + Front Door Standard/Premium

The rule is simple: server-side apps (APIs, Functions, background workers) use App Configuration directly with managed identity. Client-side apps (SPAs, mobile) that cannot hold secrets use App Configuration through Front Door.

This is not a replacement for server-side App Configuration. It is the missing piece for browser-based clients that previously had no safe way to consume runtime configuration.

Does This Work on Azure Static Web Apps?

Yes. This is one of the strongest use cases.

Azure Static Web Apps serves pre-built static files from a global CDN. There is no server-side runtime to inject environment variables at request time. Today, if you need a different config per environment (staging vs production), you either:

Rebuild the app per environment with different VITE_* build args
Use a workaround like a /config.json file served from the API backend
Use Static Web Apps environment variables injected at build time (same rebuild problem)

With App Configuration + Front Door, none of this is needed. The built JavaScript makes an HTTPS fetch() call to the Front Door CDN endpoint when the app loads. It works the same way whether the app is hosted on Static Web Apps, Blob Storage with a CDN, or Nginx in a container. The hosting platform does not matter because the config fetch is a standard browser HTTP request.

In this demo, accessing via the Front Door endpoint is the intended path; the direct Static Web App hostname is intentionally not the runtime-config path.

The deployment flow becomes:

GitHub Actions → npm run build → deploy to Static Web App (once)
                                        ↓
              The same artifact serves staging AND production
              Config values differ per App Configuration store/labels

No rebuild per environment. No pipeline secrets leaking into static assets.

The Scenario

To demonstrate this, I built a simple weather dashboard SPA. It has three settings that traditionally would be build-time environment variables:

If you want the full deployable implementation (Vite app + Bicep + azd workflows), the companion repository is here: lukemurraynz/appconfig-frontdoor-spa-demo.

Setting	Purpose	Traditional Approach
`WeatherDashboard:ApiUrl`	Backend API endpoint	`VITE_API_URL` build arg
`WeatherDashboard:RefreshIntervalSeconds`	Data refresh frequency	Hardcoded or `VITE_REFRESH_INTERVAL`
`WeatherDashboard:Theme`	UI theme (light/dark)	`VITE_THEME` or CSS variable

It also has a feature flag - WeatherDashboard.ExtendedForecast - that toggles an extended forecast section on and off without a code change or redeploy. This is the kind of thing you would normally hardcode or gate behind a build-time flag.

With App Configuration + Front Door, all three settings and the feature flag become runtime-fetched values that can be changed in the Azure portal without touching the deployed application.

Setting Up the Azure Resources

You need two Azure resources: an App Configuration store and an Azure Front Door profile.

Step 1: Create the App Configuration Store

az appconfig create \
  --name appconfig-weather-demo \
  --resource-group rg-appconfig-demo \
  --location australiaeast \
  --sku Standard

note

The Free tier works for testing, but Standard is required for production workloads (replicas, Private Link, higher request limits).

Step 2: Add Configuration Values

az appconfig kv set --name appconfig-weather-demo \
  --key "WeatherDashboard:ApiUrl" \
  --value "https://api.open-meteo.com/v1/forecast" -y

az appconfig kv set --name appconfig-weather-demo \
  --key "WeatherDashboard:RefreshIntervalSeconds" \
  --value "300" -y

az appconfig kv set --name appconfig-weather-demo \
  --key "WeatherDashboard:Theme" \
  --value "light" -y

I am using the Open-Meteo API here because it is free, requires no API key, and returns real weather data. This keeps the demo self-contained with no additional service dependencies.

Add a Feature Flag

az appconfig feature set --name appconfig-weather-demo \
  --feature "WeatherDashboard.ExtendedForecast" \
  --description "Show extended 3-day forecast section" -y

az appconfig feature enable --name appconfig-weather-demo \
  --feature "WeatherDashboard.ExtendedForecast" -y

Feature flags in App Configuration are stored as key-values with a reserved prefix (.appconfig.featureflag/). When you configure the Front Door endpoint, the Key of feature flag filter field controls which flags are exposed. Set it to WeatherDashboard.* to match our flag.

Step 3: Connect Azure Front Door

In the Azure portal:

Navigate to your App Configuration store
Under Settings, select Azure Front Door (preview)
Select Create new profile
Configure:
- Profile name: afd-weather-config
- Pricing tier: Standard
- Endpoint name: weather-config
- Origin host name: select your App Configuration store
- Identity type: System-assigned managed identity
- Cache Duration: 10 minutes
- Key filter: WeatherDashboard:*
- Feature flag filter: WeatherDashboard.*
Select Create & Connect

The portal automatically assigns the App Configuration Data Reader role to the managed identity.

warning

The key filter you configure on the Front Door endpoint must exactly match the selector in your application code. If your app requests WeatherDashboard:* but Front Door is configured for Weather:*, the request will be rejected. This is the most common setup mistake.

After creation, note your Front Door endpoint URL from the Existing endpoints table. It looks like: https://weather-config-xxxxxxxxx.z01.azurefd.net

What This Looks Like in IaC (from my demo repo)

The demo also codifies the App Configuration-to-Front Door relationship in Bicep, so it is reproducible across environments. I had to reverse engineer the ARM template here: App Configuration integration with Azure Front Door.

1. App Configuration resource linked to Front Door profile (infra/main.bicep):

resource appConfig 'Microsoft.AppConfiguration/configurationStores@2025-06-01-preview' = {
  name: appConfigName
  location: location
  sku: {
    name: 'standard'
  }
  properties: {
    azureFrontDoor: {
      resourceId: frontDoorProfileRef.id
    }
  }
}

2. AFD managed identity auth scope for App Configuration origin (infra/modules/frontdoor-environment.bicep):

resource configOriginGroup 'Microsoft.Cdn/profiles/originGroups@2025-06-01' = {
  parent: frontDoorProfile
  name: configOriginGroupName
  properties: {
    authentication: {
      type: 'SystemAssignedIdentity'
      scope: 'https://appconfig.azure.com/.default'
    }
  }
}

That scope value is the AFD token audience for App Configuration. Combined with App Configuration Data Reader role assignment, Front Door can fetch config on behalf of the browser while keeping credentials out of client code.

This is the live outcome: runtime values and feature flags can differ by environment without rebuilding the SPA.

If you want to deploy exactly this setup, use the repo's azd up flow and scripts documented in the demo README.

Building the Weather Dashboard

The demo is a vanilla TypeScript application built with Vite. No framework dependencies beyond what I needed to demonstrate the pattern.

Project Setup

npm create vite@latest weather-dashboard -- --template vanilla-ts
cd weather-dashboard
npm install
npm install @azure/app-configuration-provider@2.3.0-preview.1
npm install @microsoft/feature-management

The Configuration Loader

Create src/config.ts:

import { loadFromAzureFrontDoor } from "@azure/app-configuration-provider";
import {
  FeatureManager,
  ConfigurationMapFeatureFlagProvider,
} from "@microsoft/feature-management";

export interface AppConfig {
  apiUrl: string;
  refreshIntervalSeconds: number;
  theme: "light" | "dark";
  featureManager: FeatureManager;
}

const AFD_ENDPOINT =
  import.meta.env.VITE_AFD_ENDPOINT ??
  "https://weather-config-xxxxxxxxx.z01.azurefd.net";

export async function loadConfig(): Promise<AppConfig> {
  const settingsMap = await loadFromAzureFrontDoor(AFD_ENDPOINT, {
    selectors: [{ keyFilter: "WeatherDashboard:*" }],
    featureFlagOptions: { enabled: true },
    refreshOptions: {
      enabled: true,
      refreshIntervalInMs: 60_000,
    },
  });

  const featureManager = new FeatureManager(
    new ConfigurationMapFeatureFlagProvider(settingsMap),
  );

  return {
    apiUrl:
      settingsMap.get("WeatherDashboard:ApiUrl") ??
      "https://api.open-meteo.com/v1/forecast",
    refreshIntervalSeconds: parseInt(
      settingsMap.get("WeatherDashboard:RefreshIntervalSeconds") ?? "300",
      10,
    ),
    theme:
      (settingsMap.get("WeatherDashboard:Theme") as "light" | "dark") ??
      "light",
    featureManager,
  };
}

Two things to notice:

featureFlagOptions: { enabled: true } tells the provider to load feature flags alongside key-values. Feature flags use the reserved .appconfig.featureflag/ prefix, which the provider handles automatically.
ConfigurationMapFeatureFlagProvider wraps the settings map so FeatureManager can evaluate flags. You then use featureManager.isEnabled("WeatherDashboard.ExtendedForecast") anywhere in your app.

The only "baked in" value is the Front Door endpoint URL itself. This URL is stable per environment and rarely changes, unlike API endpoints, feature flags, and display settings. You could also inject it as a single build arg or serve it from a /config.json on the same host.

The feature flag evaluation happens at runtime on every refresh cycle. Toggle WeatherDashboard.ExtendedForecast on or off in the Azure portal, and the extended forecast section appears or disappears on the next refresh - no rebuild, no redeploy.

Running It

Open the deployed website. You should see:

A brief "Loading configuration from Azure Front Door..." message
The weather card populated with real Auckland weather data
A footer showing the config source: Config loaded at runtime via CDN | API: https://api.open-meteo.com/v1/forecast | Refresh: 300s | Theme: light

Now go to the Azure portal and try two things:

Change WeatherDashboard:Theme from light to dark - the app switches themes on the next refresh
Disable the WeatherDashboard.ExtendedForecast feature flag - the 3-day forecast section disappears

Both changes take effect without a rebuild or redeploy. The status bar shows the feature flag state so you can confirm it is working.

The Docker Build - One Artifact, Every Environment

Here is where the value becomes concrete. The Dockerfile no longer needs environment-specific build args:

FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json .
RUN npm ci
COPY . .
RUN npm run build

FROM nginx:alpine
COPY --from=build /app/dist /usr/share/nginx/html
EXPOSE 80

No ARG VITE_API_URL. No ENV VITE_API_URL. The same image runs in dev, staging, and production.

The only environment-specific value is the Front Door endpoint URL, which you can inject via a single environment variable or serve from a static /config.json on the same origin. Everything else - API URLs, refresh intervals, themes, feature flags - comes from App Configuration through Front Door at runtime.

One artifact, multiple environments: shared Front Door profile, separate endpoints/stores, isolated runtime config.

Security Considerations

The Front Door endpoint is unauthenticated. Any browser (or curl) can hit it. This is the same threat model as any public CDN asset.

What is safe to serve through this channel:

UI themes and display strings
Public API base URLs (these are already visible in your JS bundle today)
Feature flags for non-sensitive features
Version numbers and refresh intervals

What should never go through this channel:

API keys, tokens, or connection strings
Internal service URLs that reveal infrastructure
Business-critical pricing or logic config that competitors should not see

Sensitive configuration stays server-side with managed identity authentication. The Front Door channel is for config that is already effectively public in your shipped JavaScript bundle.

Gotchas I Found

Filter matching is character-exact. The keyFilter in your JavaScript must match the filter configured on the Front Door endpoint character-for-character. WeatherDashboard:* in code with WeatherDashboard* (no colon) in Front Door equals a rejected request with no useful error message.

No sentinel key refresh. Unlike server-side App Configuration, you cannot use a sentinel key to trigger refresh. The SDK uses "monitor all selected keys" mode, which checks all keys for changes on the refresh interval.

Cache TTL matters. Front Door caches responses. If you set a 10-minute cache TTL, config changes take up to 10 minutes to reach clients. Setting it too low increases origin requests and risks throttling your App Configuration store.

Language support is limited. As of April 2026, only JavaScript (@azure/app-configuration-provider v2.3.0-preview) and .NET (Microsoft.Extensions.Configuration.AzureAppConfiguration v8.5.0-preview) have Front Door support. Java, Python, and Go are listed as "work in progress."

When to Use This Pattern

This pattern makes sense when:

You deploy the same SPA to multiple environments and are tired of rebuilding per environment
You want to change feature flags or display settings without a CI/CD run
Your SPA currently uses VITE_* or NEXT_PUBLIC_* build args for configuration that changes between environments
You need CDN-level performance for config delivery (global latency, DDoS protection)

It is less suited for:

Server-rendered applications (use server-side App Configuration with managed identity instead)
Apps with only one or two config values that genuinely never change
Configurations containing secrets (these must stay server-side)

Wrapping Up

Build-time environment variable injection for SPAs is a pattern that works until it does not. The moment you need multiple environments, runtime config changes, or deploy the same artifact across regions, the rebuild-per-environment model becomes a liability.

Azure App Configuration with Front Door moves SPA configuration from compile-time constants to runtime-fetched data, delivered through a CDN. The trade-off is clear: you accept eventual consistency (cache TTL) and a public endpoint (no per-client auth) in exchange for a single build artifact and runtime configuration changes.

The feature is still in preview, and the SDK support is limited to JavaScript and .NET. But the architectural pattern - fetch config as data, not compile it as code - is sound and worth exploring now.

Want to deploy this exact walkthrough end-to-end? Start with the companion repo: lukemurraynz/appconfig-frontdoor-spa-demo (includes Bicep, azd provisioning, and runtime config/feature-flag demo scripts).

You can also check the official Microsoft samples on GitHub: JavaScript SPA sample (a full React chatbot with A/B testing across LLM models) and .NET MAUI sample.

NimbusIQ: Multi-Agent Azure Drift Remediation

Sun, 15 Mar 2026 10:24:44 GMT

As the AI Dev Days Hackathon comes to an end, I want to share my submission.

Today, I want to walk through something I have been building over the last wee while - a project called NimbusIQ. It is my submission for the AI Dev Days Hackathon, and it sits across the Best Multi-Agent System and Best Enterprise Solution categories - NimbusIQ.

At its core, NimbusIQ is built on Microsoft Agent Framework - Microsoft's orchestration layer for composing multi-agent pipelines in .NET. It gives you a WorkflowBuilder pattern for wiring agents together with explicit edges, lifecycle management via InProcessExecution, and the structure needed to run ten specialised agents in a coordinated sequence without the whole thing becoming a tangle of custom plumbing.

I spend some time working with Azure environments - helping teams understand their estates, finding configuration drift, catching orphaned resources, and figuring out what to fix first. If you have done any of that work, you will know the pain. Azure gives you no shortage of signals: Azure Advisor, Azure Resource Graph, Cost Management, PSRule for Azure, Azure Quick Review, Policy, Monitor - the list goes on. The problem is not a lack of data. The problem is that all of these signals live in different dashboards, different exports, and different tools. Nobody is joining them up.

So I thought to myself: what if I could build something that does the bit that currently requires a human cloud architect? Not the detection - Azure already does that well enough - but the reasoning, prioritisation, and remediation planning that happens after detection - scoped per Azure Service Group.

That is what NimbusIQ aimed to do.

What is NimbusIQ?

In short, NimbusIQ is a multi-agent AI platform that continuously discovers your Azure estate, detects drift and policy violations, reasons across cost, reliability, sustainability, and governance signals, and produces remediation plans that a human can review and approve before anything gets applied.

It uses:

Microsoft Agent Framework for agent orchestration
Microsoft Foundry (GPT-4) for the reasoning and narrative generation
Azure MCP for grounded Azure capability discovery
Azure Container Apps, PostgreSQL, Key Vault, managed identity, and OpenTelemetry for the runtime

View the source

The full source code is on GitHub: github.com/lukemurraynz/NimbusIQ - feel free to explore, fork, or open PRs.

warning

This was created purely for the Hackathon, with a fair amount of hypervelocity engineering effort, although I have done my best to wrap production logic - ie security and resilience/circuit breakers/fallback endpoints etc. It is missing Entra ID authentication and various other functions - and of course support so use at your own risk.

The whole thing deploys with azd up.

The problem I was trying to solve

If you manage Azure estates at any sort of scale, you have probably lived this loop:

Gather evidence from multiple Azure tools
Interpret what actually changed and whether it matters
Decide whether cost, reliability, compliance, or architecture should take priority
Draft a remediation plan
Route it through approval
Hope the action actually improved things

That loop is manual, slow, and happens in spreadsheets or meeting rooms. The tools tell you what is wrong, but very few of them can tell you why it matters for a specific workload, what you should fix first, how to remediate it safely, or whether the change you made actually delivered value.

NimbusIQ automates that decision-support loop.

How NimbusIQ differs from existing tools

I want to be clear - NimbusIQ is not a replacement for Azure Advisor, PSRule for Azure, or Azure Quick Review. Those are solid detection and standards tools, and NimbusIQ actually uses their rule sets internally. What NimbusIQ adds is the orchestration and decision-support layer that sits above them.

Capability	Azure Advisor	PSRule	Azure Quick Review	NimbusIQ
Detect configuration violations	✓	✓	✓	✓
Continuous drift trending	✗	✗	✗	✓
AI-powered reasoning across signals	✗	✗	✗	✓ (6 LLM agents)
Workload-scoped analysis	✗	✗	✗	✓ (Azure Service Groups)
Generate deployable IaC (Bicep/Terraform)	✗	✗	✗	✓
Dual-control approval workflow	✗	✗	✗	✓
Explain WHY issues exist	~Basic	~Pattern-based	~Checklist-based	✓ (AI narrative)
Track value realisation	✗	✗	✗	✓
Auditable agent-to-agent lineage	✗	✗	✗	✓ (A2A tracing)

The way I think about it: if Azure Advisor is a dashboard, NimbusIQ is a cloud architect in the loop.

The architecture

NimbusIQ has three services:

Frontend - React with Fluent UI v9, showing a service graph, recommendations, approval workflow, and drift timeline
Control Plane API - ASP.NET Core (.NET 10) handling service groups, analysis runs, decisions, and RFC 9457 error responses
Agent Orchestrator - a .NET 10 background worker that runs the multi-agent pipeline using Microsoft Agent Framework

All three run on Azure Container Apps with managed identity everywhere. No secrets in config files - just DefaultAzureCredential and RBAC.

┌──────────────────────────────────────────────────────────────────┐
│  Frontend (React + Fluent UI v9)                                  │
│  Service graph · Recommendations · Approval workflow · Timeline   │
└─────────────────────────┬────────────────────────────────────────┘
                          │ REST / JWT (Entra ID - planned, not yet implemented)
┌─────────────────────────▼────────────────────────────────────────┐
│  Control Plane API (.NET 10 / ASP.NET Core)                       │
│  Service groups · Analysis runs · Decisions · RFC 9457 errors     │
└──────────┬──────────────────────────────┬────────────────────────┘
           │ PostgreSQL (EF Core)          │ Agent messages
┌──────────▼──────────────────────────────▼────────────────────────┐
│  Agent Orchestrator (.NET 10 background worker)                   │
│                                                                   │
│  DiscoveryWorkflow ──► MultiAgentOrchestrator (Microsoft MAF)    │
│    Resource Graph        │                                        │
│    Cost Management       ├─ ServiceIntelligenceAgent              │
│    Log Analytics         ├─ BestPracticeEngine (700+ rules)      │
│                          ├─ DriftDetectionAgent                   │
│                          ├─ WellArchitectedAssessmentAgent       │
│                          ├─ FinOpsOptimizerAgent                 │
│                          ├─ CloudNativeMaturityAgent             │
│                          ├─ ArchitectureAgent                    │
│                          ├─ ReliabilityAgent                     │
│                          ├─ SustainabilityAgent                  │
│                          └─ GovernanceNegotiationAgent           │
│                                                                   │
│  IacGenerationWorkflow (Foundry-powered Bicep/Terraform)         │
└──────────────────────────────────────────────────────────────────┘
           All on Azure Container Apps + PostgreSQL Flexible Server
           Managed Identity · OpenTelemetry · Key Vault

The ten agents

This is the bit I am most pleased with. NimbusIQ runs ten specialised agents, each with a distinct responsibility. Six of them use Microsoft Foundry (GPT-4) for reasoning; four are deterministic rule-based evaluators.

Here is how they are wired up using Microsoft Agent Framework's WorkflowBuilder:

WorkflowBuilder builder = new(executorBindings[0]);
builder.WithName("nimbusiq-sequential");
builder.WithDescription(
    "NimbusIQ multi-agent orchestration workflow powered by Microsoft Agent Framework.");

for (var index = 0; index < executorBindings.Count - 1; index++)
{
    builder.AddEdge(executorBindings[index], executorBindings[index + 1]);
}

builder.WithOutputFrom(executorBindings[^1]);
var workflow = builder.Build(validateOrphans: true);

await using Run run = await InProcessExecution.RunAsync(
    workflow, executionState, session.SessionId, cancellationToken);

Each agent is registered with a clear name and purpose:

_agents = new Dictionary
{
    ["ServiceIntelligence"] = CreateDeterministicAgent(
        "service-intelligence-agent",
        "Service Intelligence",
        "Calculates service-group intelligence scores.",
        (context, _, _) => Task.FromResult(
            serviceIntelligenceAgent.CalculateScores(context.Snapshot))),

    ["BestPractice"] = CreateDeterministicAgent(
        "best-practice-agent",
        "Best Practice",
        "Evaluates best-practice rules against discovered resources.",
        async (context, _, ct) =>
            await bestPracticeEngine.EvaluateAsync(context.Snapshot, ct)),

    ["DriftDetection"] = CreateDeterministicAgent(
        "drift-detection-agent",
        "Drift Detection",
        "Detects drift across service resources and best-practice violations.",
        async (context, _, ct) =>
            await driftDetectionAgent.AnalyzeDriftAsync(context.Snapshot, null, ct)),

    // ... WellArchitected, FinOps, CloudNative, Architecture,
    //     Reliability, Sustainability, Governance agents follow
};

The BestPracticeEngine sits at the heart of the deterministic layer. It packages over 700 rules sourced from Azure Well-Architected Framework, PSRule for Azure, Azure Quick Review, and the Azure Architecture Centre. The AI agents then reason over those normalised results rather than making things up from scratch.

Why hybrid?

I deliberately kept four agents as pure rule-based evaluators. Not everything needs an LLM - drift scoring, cloud-native maturity checks, and best-practice rule evaluation are deterministic operations where you want consistent, reproducible results. The AI agents handle the subjective bits: explaining trade-offs, generating narratives, and producing remediation code.

Drift detection

One of the features I spent the most time on is continuous drift detection. NimbusIQ does not just compare two ARM templates - it evaluates the current state of your resources against the full rule set and produces a severity-weighted score.

The scoring works like this:

Severity	Weight
Critical	10
High	5
Medium	2
Low	1

Each analysis run produces a drift snapshot with a score, category breakdown, and trend direction (stable, degrading, or improving). The dashboard shows those trends over time, so you can see whether your estate is getting better or worse.

IaC generation

When a recommendation is approved, NimbusIQ calls Microsoft Foundry with structured context - the action type, target SKU, cost impact, and confidence - and generates Bicep or Terraform code. A rollback plan is generated alongside every change.

If Foundry is unavailable (because these things happen), it falls back to built-in code templates rather than failing silently. Every generated plan goes through the dual-control approval workflow before anything is applied.

warning

NimbusIQ generates IaC and presents it for review. It does not apply changes automatically. Every remediation requires explicit human approval through an idempotent state machine. This is a deliberate design choice - enterprise governance requires that a human is always in the loop for infrastructure changes.

Observability

The entire agent pipeline is instrumented with OpenTelemetry. Every agent step, every Foundry call, every MCP tool invocation gets a trace with correlation IDs. You get traces that look like this:

atlas-control-plane-api
    └── AnalysisRun: Execute (3200ms)
         ├── Atlas.AgentOrchestrator.MultiAgent: RunAnalysis (2800ms)
         │    ├── ServiceIntelligence: CalculateScores (45ms)
         │    ├── BestPractice: Evaluate (320ms)
         │    ├── DriftDetection: AnalyzeDrift (180ms)
         │    ├── WellArchitected: Assess (520ms)
         │    │    └── Atlas.AgentOrchestrator.Azure.AIFoundry: GenerateNarrative (340ms)
         │    ├── FinOps: Analyze (410ms)
         │    └── Governance: Negotiate (290ms)
         └── Atlas.AgentOrchestrator.DriftPersistence: PersistSnapshot (15ms)

That level of visibility matters. When an agent produces a questionable recommendation, you can trace exactly what data it saw, what rules fired, and what the LLM was asked.

Deployment

The whole thing deploys with Azure Developer CLI:

azd init
azd env set NIMBUSIQ_POSTGRES_ADMIN_PASSWORD "YourSecurePassword123!"
azd up

The infrastructure is defined in Bicep using Azure Verified Modules where available. It provisions:

Azure Container Apps (all three services)
Azure Container Registry
PostgreSQL Flexible Server
Key Vault
Microsoft Foundry with GPT-4 deployment
Log Analytics workspace
Managed identities with least-privilege RBAC
Optional VNet integration and Network Security Perimeter

tip

If you want to try it yourself, clone the repo and run azd up. You will need an Azure subscription, Docker Desktop, .NET 10 SDK, and Node.js 20+. The deployment takes about 15–20 minutes.

What I learned building this

A few things stood out:

Microsoft Agent Framework is genuinely useful for orchestration. The WorkflowBuilder pattern gives you a clean way to compose agents with explicit edges and validation. The InProcessExecution runner handles the lifecycle well. I would not want to build this kind of multi-agent pipeline without it.

Microsoft Foundry works well when you scope it tightly. The key is not giving the LLM free rein - it is providing structured context (rule results, resource metadata, cost data) and asking it to reason over that context. When you do that, the outputs are useful. When you do not, you get platitudes.

Grounding through Azure MCP makes a real difference. Without MCP, the LLM would be making recommendations based on its training data, which might be months out of date. With Azure MCP and Learn MCP, the agents can check current Azure capabilities and documentation before recommending changes.

Managed identity simplifies everything. No connection strings, no key rotation, no secrets in environment variables. Just DefaultAzureCredential, RBAC role assignments in Bicep, and everything wires up. This is how Azure services should be connected.

Wrapping up

NimbusIQ is my attempt at building the thing I wish existed when I am helping teams sort out their Azure estates. Not another dashboard with red/amber/green indicators, but something that actually reasons across the signals, explains what matters and why, and generates remediation plans that a human can review and approve.

The code is on GitHub: github.com/lukemurraynz/NimbusIQ

If you have questions or want to chat about the architecture, feel free to reach out.

Change-Driven Architecture on Azure with Drasi

Wed, 04 Mar 2026 21:47:48 GMT

Today, we are going to look at change-driven architecture on Azure using Drasi, and why it matters from a Well-Architected perspective.

If you have ever built a system that polls a database every few seconds, asking, "Has anything changed?" - this one is for you.

I recently built an Emergency Alert System and Santa Digital Workshop and Automate Azure Bastion with Drasi Realtime RBAC Monitoring proof of concepts on Azure that use Drasi for reactive data processing. One of the most interesting things I discovered was that change-driven architecture fundamentally shifts how you think about reliability, cost, and operational efficiency.

info

This article explores architectural patterns from a proof of concept. The patterns are production-applicable, but the implementation itself is a learning exercise.

The Polling Problem

Most event-driven systems I have worked on follow the same pattern: a background service queries the database on a timer, checks for changes, and then acts on them.

It works, but it has some well-known problems:

Wasted compute - 99% of polls return "nothing changed"
Latency - you only detect changes at the poll interval (1 second, 5 seconds, 30 seconds?)
Race conditions - if multiple instances poll simultaneously, you need distributed locks
Scaling challenges - more instances means more database load, not faster detection

From a Well-Architected Cost Optimization perspective, polling is paying for compute that mostly does nothing.

From a Reliability perspective, poll intervals create a detection floor - you simply cannot react faster than your timer.

Enter Change Data Capture

Change Data Capture (CDC) flips this model. Instead of asking the database whether something has changed, the database tells you when it does.

PostgreSQL Flexible Server (just one of Drasi sources) supports logical replication natively, which streams every INSERT, UPDATE, and DELETE as it happens.

Drasi sits on top of this CDC stream and runs continuous queries - written in Cypher - that evaluate incoming changes against patterns you define. When a pattern matches, Drasi fires a reaction (in my case, an HTTP callback to an API).

The architecture follows a simple flow: Source → Queries → Reactions.

# Drasi CDC Source Configuration
apiVersion: v1
kind: Source
name: postgres-alerts
spec:
  kind: PostgreSQL
  properties:
    host: ${POSTGRES_HOST}
    port: ${POSTGRES_PORT}
    user: ${POSTGRES_USER}
    password: ${POSTGRES_PASSWORD}
    database: ${POSTGRES_DATABASE}
    ssl: true
    tables:
      - emergency_alerts.alerts
      - emergency_alerts.areas
      - emergency_alerts.recipients
      - emergency_alerts.delivery_attempts
      - emergency_alerts.approval_records
      - emergency_alerts.correlation_events
      - emergency_alerts.area_signals
      - emergency_alerts.weather_observations
      - emergency_alerts.road_maintenance

This source watches nine tables. Every change to any of these tables flows into the continuous query engine.

Which Drasi Mode Should You Use?

One useful design decision early on is picking the right Drasi runtime for your workload. Drasi is available in three forms with the same core model (Sources → Continuous Queries → Reactions), but different operational trade-offs.

Drasi for Kubernetes (D4K8s) - best for production-scale, cloud-native platforms where you want Kubernetes-native scaling, observability, and operational controls.
Drasi Server - best for local development, Docker Compose, edge, and non-Kubernetes environments where you still want full Drasi capabilities in a single process/container.
drasi-lib - best when building a Rust app and you want in-process change detection with no separate Drasi infrastructure.

A practical path I have found useful: start with Server to iterate quickly, move to D4K8s as reliability/scale requirements grow, and choose drasi-lib when your change logic should live directly inside a Rust service.

Continuous Queries - The Logic Layer

Here is where it gets interesting.

A continuous query is not a one-off SQL statement. It is a standing query that continuously evaluates against the stream of changes (it could be one or across multiple sources).

For example, the delivery trigger query fires when an alert transitions to Approved with a Pending delivery status:

apiVersion: v1
kind: ContinuousQuery
name: delivery-trigger
spec:
  mode: query
  sources:
    subscriptions:
      - id: postgres-alerts
        nodes:
          - sourceLabel: alerts
  query: |
    MATCH (a:alerts)
    WHERE a.status = 'Approved' AND a.delivery_status = 'Pending'
    RETURN
      a.alert_id AS alertId,
      a.headline AS headline,
      a.severity AS severity,
      a.sent_at AS approvedAt,
      drasi.changeDateTime(a) AS triggeredAt

No polling. No timers.

The moment a row changes in the alerts table and matches these conditions, Drasi fires the reaction.

The Well-Architected Impact

Reliability

Change-driven architecture eliminates the detection gap.

In a polling model, if your timer runs every 5 seconds, a critical SLA breach might sit undetected for up to 5 seconds. With CDC, detection is near-instantaneous.

In my proof of concept, I run 15+ continuous queries simultaneously - including SLA-breach detection every 60 seconds, approval-timeout detection every 5 minutes, cross-region correlation, and severity-escalation tracking.

Each query runs independently, and if one fails, the others continue operating. This aligns with the Well-Architected failure mode analysis guidance - decompose your detection logic so a failure in one area does not cascade.

Cost Optimization

No idle compute cycles polling an unchanged database.

The compute only activates when data actually changes. For workloads with bursty change patterns (like an emergency alert system), this can significantly reduce steady-state cost compared to a fleet of polling workers.

Operational Excellence

Each continuous query is a declarative YAML file, version-controlled alongside the infrastructure.

Adding a new detection pattern means writing a new query file and deploying it - no code changes to the application, no new background services, no additional infrastructure.

infrastructure/drasi/queries/
├── sla-monitoring/
│   ├── delivery-sla-breach.yaml
│   ├── approval-timeout.yaml
│   └── expiry-warning.yaml
├── risk-detection/
│   ├── geographic-correlation.yaml
│   ├── regional-hotspot.yaml
│   ├── severity-escalation.yaml
│   └── duplicate-suppression.yaml
└── recommendations/
    ├── delivery-trigger.yaml
    ├── all-clear-suggestion.yaml
    └── area-expansion-suggestion.yaml

When to Use This Pattern

Change-driven architecture is a good fit when:

Low-latency detection matters - SLA monitoring, fraud detection, security alerts
Multiple detection rules run in parallel - you need 10+ independent queries watching the same data
The write-to-read ratio is low - changes happen infrequently relative to how often you would poll
You already use PostgreSQL or another source containing CDC - CDC comes free with logical replication

It is less suited for:

High-frequency OLTP - if every row changes every second, you are essentially processing the full table continuously
Simple CRUD - if you just need "notify me when a row is inserted," a database trigger or Event Grid integration might be simpler
Teams unfamiliar with Cypher - the learning curve for graph-style queries is real

Getting Started

If you want to try this pattern, you need:

Azure Kubernetes Service (AKS) - Drasi currently runs on Kubernetes (or a local KIND cluster you can run in a devcontainer for testing)
PostgreSQL Flexible Server with logical replication enabled
The Drasi CLI installed in your cluster

The Drasi documentation covers installation well. The key Azure-specific step is to enable logical replication on your PostgreSQL Flexible Server - set wal_level = logical and configure max_replication_slots to match the number of sources you plan to run.

info

If you are using Bicep to deploy PostgreSQL Flexible Server, set azure.extensions = postgis as a server parameter if you need spatial queries. The CDC source does not require PostGIS, but if your queries reference spatial data, the extension must be installed before running migrations.

Wrapping Up

Change-driven architecture addresses several Well-Architected concerns simultaneously:

It reduces wasted compute (Cost Optimization)
It eliminates detection gaps (Reliability)
It keeps detection logic declarative and version-controlled (Operational Excellence)

Drasi makes this pattern accessible on Azure without writing custom CDC consumers or managing Kafka/Debezium infrastructure yourself.

The shift from "ask the database" to "let the database tell you" is subtle, but the architectural implications are significant.

You can find the full proof of concept on GitHub: lukemurraynz/EmergencyAlertSystem.

Container Security Hardening for Azure Container Apps

Wed, 04 Mar 2026 07:33:14 GMT

Every time I see a production container running as root, I wince.

It is one of those things that is easy to fix but gets overlooked because the app "works fine" without it. But container security is not just about non-root users. It is about the full stack: image build, runtime configuration, network policy, input validation, and rate limiting.

In this post, I will walk through a checklist I used to harden a .NET project running on Azure Container Apps.

1. Non-root containers

Running as root inside a container means that if an attacker exploits a vulnerability in your application, they inherit root privileges within the container. In some scenarios, that can be leveraged for container escape.

The fix is straightforward. In your Dockerfile:

FROM mcr.microsoft.com/dotnet/aspnet:10.0 AS runtime
WORKDIR /app

COPY --from=build /app/publish .

ENV ASPNETCORE_HTTP_PORTS=8080
EXPOSE 8080

# Switch to non-root user
USER $APP_UID

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
    CMD curl -f http://localhost:8080/health/ready || exit 1

ENTRYPOINT ["dotnet", "App.ControlPlane.Api.dll"]

Key points:

For official Microsoft .NET Linux images (.NET 8+), you do not need to create your own user. The images already include a non-root app user.
Use USER app or USER $APP_UID ($APP_UID is UID 1654). I prefer USER $APP_UID because it also works cleanly with Kubernetes runAsNonRoot checks.
The image is non-root capable, but it is not automatically non-root unless you set USER explicitly.
Place USER after COPY so the app files are copied first and then executed as non-root.
Use port 8080 (not 80/443). Non-privileged ports avoid root requirements, and moving back to port 80 means you cannot run as non-root.

warning

If you are using a base image that does not provide a non-root user (or you have custom filesystem write paths), create/chown a dedicated runtime user for those paths before switching away from root.

2. Multi-stage builds

Multi-stage Docker builds keep build tools (SDK, compilers, npm dev dependencies) out of the runtime image. This reduces the attack surface and image size.

# Build stage — SDK and build toolchain
FROM mcr.microsoft.com/dotnet/sdk:10.0 AS build
WORKDIR /src
COPY . .
RUN dotnet restore src/Api/App.ControlPlane.Api.csproj
RUN dotnet publish src/Api/App.ControlPlane.Api.csproj -c Release -o /app/publish /p:UseAppHost=false

# Runtime stage — minimal runtime only
FROM mcr.microsoft.com/dotnet/aspnet:10.0 AS runtime

For frontend workloads, the pattern is similar:

# Build stage with Node.js
FROM node:20-alpine AS build
# ... npm ci, vite build

# Runtime stage with production dependencies only
FROM node:20-alpine AS runtime
RUN npm ci --only=production

tip

Use --only=production (or --omit=dev in npm 9+) in runtime stages so TypeScript, ESLint, Vite, and other dev tooling are not shipped to production.

3. Pin base image versions

Never use latest in production images.

❌ Bad — unpredictable

FROM mcr.microsoft.com/dotnet/aspnet:latest

✅ Good — deterministic and reproducible

FROM mcr.microsoft.com/dotnet/aspnet:10.0

Pinning to major.minor gives you a solid balance between stability and patch cadence. If you need strict reproducibility, pin to an image digest.

4. Health probes that bypass auth

Health endpoints should bypass authentication middleware. If readiness requires a JWT, the platform cannot accurately determine service health.

app.MapGet("/health/ready", () => Results.Ok(new
{
    Status = "Healthy",
    Timestamp = DateTime.UtcNow,
    Service = "app-control-plane-api",
    Version = "1.0.0"
}));

app.MapGet("/health/live", () => Results.Ok(new
{
    Status = "Alive",
    Timestamp = DateTime.UtcNow
}));

In practice, map these endpoints before strict authorization rules, or explicitly bypass auth for /health/*.

note

Configure both liveness and readiness. Liveness answers "is the process alive?" Readiness answers "Can it safely receive traffic?"

5. Rate limiting

The API uses ASP.NET Core rate limiting middleware with a fixed-window policy:

builder.Services.AddRateLimiter(options =>
{
    options.GlobalLimiter = PartitionedRateLimiter.Create(httpContext =>
        RateLimitPartition.GetFixedWindowLimiter(
            partitionKey: httpContext.Connection.RemoteIpAddress?.ToString() ?? "anonymous",
            factory: _ => new FixedWindowRateLimiterOptions
            {
                PermitLimit = 100,
                Window = TimeSpan.FromMinutes(1),
                QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
                QueueLimit = 0
            }));

    options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;
});

This gives a clear policy: 100 requests per minute per IP, fail fast with 429, and no queuing.

warning

In multi-replica environments (including Azure Container Apps), in-memory rate limiting is per instance. For true global limits across replicas, use a distributed store such as Azure Cache for Redis.

6. Input validation at the API boundary

Input validation should happen at the edge of the API, before expensive processing.

// Validate input length to prevent abuse
const int MaxMessageLength = 4000;
if (userMessage.Length > MaxMessageLength)
{
    // Return 400 Bad Request with specific error
}

This is a small change that helps with:

Prompt injection attempts using oversized payloads
Resource exhaustion from unbounded request bodies
Token/cost control for downstream AI calls

7. Authentication with Entra ID JWT bearer

If you have a system, such as an API use Microsoft Entra ID bearer tokens for authentication:

builder.Services.AddAuthentication(JwtBearerDefaults.AuthenticationScheme)
    .AddMicrosoftIdentityWebApi(builder.Configuration.GetSection("AzureAd"));

Authorization policies then control operation-level access:

[Authorize(Policy = "AnalysisRead")]
public async Task AgentChat([FromBody] AgentChatRequest request, ...)

Mutating endpoints are authenticated. Health probes remain the only unauthenticated paths.

8. Restrictive CORS

Configure Cross-Origin Resource Sharing (CORS) for known frontend origins only:

builder.Services.AddCors(options =>
{
    options.AddPolicy("AllowFrontend", policy =>
    {
        policy.WithOrigins(allowedOrigins)
              .AllowAnyHeader()
              .AllowAnyMethod()
              .AllowCredentials();
    });
});

tip

If allowed origins are sourced from config, remember most apps load this at startup. Update config and restart the deployment to apply changes.

9. HTTPS termination at ingress (not inside container)

For Azure Container Apps, TLS is terminated at ingress. Your container should listen on HTTP internally:

ENV ASPNETCORE_HTTP_PORTS=8080
EXPOSE 8080

If you force HTTPS in-container (https://+:443) without mounting certificates, startup failures are expected.

Practical hardening checklist

Use this in PR reviews:

Check	Status
Non-root user in Dockerfile	✅
Multi-stage build (no SDK in runtime)	✅
Pinned base image version (not `latest`)	✅
Health probes bypass auth	✅
Liveness and readiness probes configured	✅
Rate limiting enabled	✅
Input validation at API boundary	✅
Entra ID JWT authentication	✅
CORS restricted to known origins	✅
HTTP (not HTTPS) inside container	✅
`imagePullPolicy: Always` in manifests	✅
No secrets in Dockerfile or image layers	✅
`HEALTHCHECK` instruction in Dockerfile	✅

Final thoughts

Container security is not a single switch.

It is a set of patterns that compound: non-root containers, deterministic builds, probe hygiene, rate limiting, input validation, and clear auth boundaries. Applied together, they significantly reduce risk for workloads running on Azure Container Apps.

And don't forget Azure Container Registry Continuous Patching and Containers Supply Chain Framework.

If you want to map this to broader platform guidance, review the Security pillar of the Azure Well-Architected Framework.

luke.geek.nz Blog

Running LiteLLM on AKS with azd and Bicep

What LiteLLM does​

The deployment​

Infrastructure​

Network design​

The azd lifecycle​

The LiteLLM config​

Adding OpenCode Zen and Go models​

Production settings from the LiteLLM docs​

Testing multi-replica behaviour​

AKS configuration for LiteLLM​

Rolling updates and pod lifecycle​

Autoscaling and disruption budget​

Read-only root filesystem​

Some things I learned​

Operational checklist​

Cost and cleanup​

Wrapping up​

References​

From cloud adoption to value realisation

Adoption is not the finish line​

The familiar pattern​

Start with the outcome, not the service​

Four questions before calling an Azure roadmap done​

What business outcome should change?​

Who owns the value after go-live?​

What should move first?​

What gets retired?​

A few field learnings​

A landing zone without a first workload is hard to explain​

Cost visibility is not the same as cost ownership​

Monitoring has to produce a decision​

AI pilots need a boring operating model​

The old thing has a habit of surviving​

Value needs a review cadence​

A simple value realisation loop​

A practical example​

Where this fits for Azure teams​

Final thoughts​

References​

OMO Teams: Multi-agent project delivery with ARB gates

The five-phase model​

How the teams are wired​

The project: LinkSnap​

Phase 0: intake​

Phase 1: architecture​

Phase 2: build​

Phase 3: validate​

Phase 4: production​

The ARB team​

What the gates caught​

What the economics looked like​

Lessons learned​

The Quickstart​

References​

Agentic Operations Lakehouse: Drasi & Microsoft Framework

The architecture in one sentence​

The split that actually matters​

Drasi keeps detection honest​

The workflow is the product, not the chat​

Role-aware recommendations​

Handling transient Foundry failures​

Fabric as operational memory​

Three layers of safety​

What I learned​

The reusable pattern​

Open source​

References​

Running Azure SRE Agent for AKS and Drasi Operations

Why I Built This​

What this Blueprint Deploys​

Agent Design​

Built-In Skills Still Matter​

Custom Skills​

Connector Lesson: Connected Does Not Always Mean Enabled​

Response Plans​

Session Insights​

Scheduled Tasks​

Fault Injection​

What LiteLLM does

The deployment

Infrastructure

Network design

The `azd` lifecycle

The LiteLLM config

Adding OpenCode Zen and Go models

Production settings from the LiteLLM docs

Testing multi-replica behaviour

AKS configuration for LiteLLM

Rolling updates and pod lifecycle

Autoscaling and disruption budget

Read-only root filesystem

Some things I learned

Operational checklist

Cost and cleanup

Wrapping up

References

Adoption is not the finish line

The familiar pattern

Start with the outcome, not the service

Four questions before calling an Azure roadmap done

What business outcome should change?

Who owns the value after go-live?

What should move first?

What gets retired?

A few field learnings

A landing zone without a first workload is hard to explain

Cost visibility is not the same as cost ownership

Monitoring has to produce a decision

AI pilots need a boring operating model

The old thing has a habit of surviving

Value needs a review cadence

A simple value realisation loop

A practical example

Where this fits for Azure teams

Final thoughts

References

The five-phase model

How the teams are wired

The project: LinkSnap

Phase 0: intake

Phase 1: architecture

Phase 2: build

Phase 3: validate

Phase 4: production

The ARB team

What the gates caught

What the economics looked like

Lessons learned

The Quickstart

References

The architecture in one sentence

The split that actually matters

Drasi keeps detection honest

The workflow is the product, not the chat

Role-aware recommendations

Handling transient Foundry failures

Fabric as operational memory

Three layers of safety

What I learned

The reusable pattern

Open source

References

Why I Built This

What this Blueprint Deploys

Agent Design

Built-In Skills Still Matter

Custom Skills

Connector Lesson: Connected Does Not Always Mean Enabled

Response Plans

Session Insights

Scheduled Tasks

Fault Injection

Real Finding: Container Insights Was Broken

Drasi Example: Source and Query Issues

The Deployment Flow

Lessons Learned

1. Route by failure phase before the product

2. Autonomous should be boring