<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>luke.geek.nz Blog</title>
        <link>https://luke.geek.nz/</link>
        <description>luke.geek.nz Blog</description>
        <lastBuildDate>Fri, 12 Jun 2026 02:18:34 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright © 2026 luke.geek.nz.</copyright>
        <item>
            <title><![CDATA[Running LiteLLM on AKS with azd and Bicep]]></title>
            <link>https://luke.geek.nz/azure/litellm-aks/</link>
            <guid>https://luke.geek.nz/azure/litellm-aks/</guid>
            <pubDate>Fri, 12 Jun 2026 02:18:34 GMT</pubDate>
            <description><![CDATA[Deploy LiteLLM on AKS effortlessly using azd and Bicep for a robust, self-hosted LLM gateway with caching and spend tracking.]]></description>
            <content:encoded><![CDATA[<p>I've been spending time with <a href="https://litellm.ai/" target="_blank" rel="noopener noreferrer" class="">LiteLLM</a> and wanted to see how far I could take it as a self-hosted LLM gateway on Azure Kubernetes Service. The goal was simple: build a deployment that you can spin up with a single <code>azd up</code> command, with all the production bits - private networking, Redis caching, PostgreSQL for spend tracking, and a proper ingress with automatic TLS.</p>
<p>Turns out it works pretty well. Here's what I built and what I found.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-litellm-does">What LiteLLM does<a href="https://luke.geek.nz/azure/litellm-aks/#what-litellm-does" class="hash-link" aria-label="Direct link to What LiteLLM does" title="Direct link to What LiteLLM does" translate="no">​</a></h2>
<p>LiteLLM is an open-source proxy that sits between your applications and the LLM providers they call. You hit a single OpenAI-compatible endpoint, and LiteLLM routes to whatever backend you've configured - Azure OpenAI, Anthropic, OpenAI, or any of the 100+ providers it supports.</p>
<p>The useful bit for me was having a single place to manage authentication, rate limiting, caching, and spend tracking across all the models our team uses. Rather than distributing API keys for every provider, the team gets a virtual key from LiteLLM and I control which models they can access and what their budget is.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-deployment">The deployment<a href="https://luke.geek.nz/azure/litellm-aks/#the-deployment" class="hash-link" aria-label="Direct link to The deployment" title="Direct link to The deployment" translate="no">​</a></h2>
<p>I wanted to deploy this on AKS with everything managed through Infrastructure as Code. I used the Azure Developer CLI (<code>azd</code>) with Bicep for the infrastructure, with deployment lifecycle handled by PowerShell hooks.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The full source is at <a href="https://github.com/lukemurraynz/LiteLLM.AKSGateway" target="_blank" rel="noopener noreferrer" class="">lukemurraynz/LiteLLM.AKSGateway</a> if you want to skip the walkthrough and just deploy it.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="infrastructure">Infrastructure<a href="https://luke.geek.nz/azure/litellm-aks/#infrastructure" class="hash-link" aria-label="Direct link to Infrastructure" title="Direct link to Infrastructure" translate="no">​</a></h3>
<p>The Bicep template provisions a greenfield environment in a single resource group:</p>
<table><thead><tr><th>Resource</th><th>Purpose</th></tr></thead><tbody><tr><td>AKS cluster</td><td>Standard tier, system + user node pools, Azure CNI Overlay</td></tr><tr><td>Azure Container Registry</td><td>Stores the LiteLLM Docker image</td></tr><tr><td>PostgreSQL Flexible Server</td><td>Spend tracking, virtual key storage, user management</td></tr><tr><td>Azure Managed Redis</td><td>Distributed caching and rate limiting across replicas</td></tr><tr><td>Key Vault</td><td>Managed by the proxy for credential storage</td></tr><tr><td>Azure OpenAI</td><td>GPT-4o model deployment</td></tr><tr><td>VNet with 3 subnets</td><td>AKS nodes, private endpoints, and a dedicated ingress subnet</td></tr><tr><td>NAT Gateway</td><td>Outbound connectivity for the AKS cluster</td></tr><tr><td>Private DNS zones</td><td>Private endpoint resolution for all PaaS services</td></tr></tbody></table>
<p>Every data service (PostgreSQL, Redis, ACR, Key Vault) is configured with private endpoints. No public IPs on the data plane. The AKS cluster uses Azure CNI Overlay with Azure Network Policy for pod-level segmentation.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="network-design">Network design<a href="https://luke.geek.nz/azure/litellm-aks/#network-design" class="hash-link" aria-label="Direct link to Network design" title="Direct link to Network design" translate="no">​</a></h3>
<p>The VNet is split into three subnets. The CIDR blocks are defined in <code>infra/core/networking.bicep</code>:</p>
<table><thead><tr><th>Subnet</th><th>CIDR</th><th>Purpose</th></tr></thead><tbody><tr><td><code>snet-aks</code></td><td><code>10.30.0.0/23</code></td><td>AKS node pool (system + user)</td></tr><tr><td><code>snet-pe</code></td><td><code>10.30.2.0/24</code></td><td>Private endpoints for PostgreSQL, Redis, ACR, Key Vault</td></tr><tr><td><code>snet-ingress</code></td><td><code>10.30.3.0/24</code></td><td>Reserved for the NGINX ingress controller</td></tr></tbody></table>
<p>Outbound connectivity uses a NAT Gateway attached to the AKS subnet via a dedicated public IP (for example, <code>pip-nat-vnet-&lt;suffix&gt;</code>). The AKS cluster uses <code>userAssignedNATGateway</code> as its outbound type, which avoids the SNAT port exhaustion that can happen with <code>loadBalancer</code> outbound type at scale.</p>
<p>Private endpoint DNS resolution uses four Azure Private DNS zones, each linked to the VNet:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">privatelink.postgres.database.azure.com</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">privatelink.redis.azure.net</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">privatelink.azurecr.io</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">privatelink.vaultcore.azure.net</span><br></div></code></pre></div></div>
<p>These zones are created in the Bicep template and linked automatically. Pods resolve private endpoint IPs through the CoreDNS split-DNS patch (more on that later).</p>
<p>The public ingress path is separate: the NGINX ingress controller gets a public LoadBalancer IP, and cert-manager handles the Let's Encrypt TLS certificate through the HTTP-01 challenge. The DNS A record is managed through the <code>sync-public-dns.ps1</code> hook, or created manually if you prefer.</p>
<p><img decoding="async" loading="lazy" alt="LiteLLM AKS architecture showing VNet, subnets, AKS pods, private endpoints, and PaaS services" src="https://luke.geek.nz/assets/images/litellm-aks-architecture-3c9130565a4bbae5bfa0038893ec4ea6.png" width="2775" height="2310" class="img_ev3q"></p>
<p><img decoding="async" loading="lazy" alt="Request flow diagram showing client to LiteLLM proxy through ingress, auth, routing, cache, and provider" src="https://luke.geek.nz/assets/images/litellm-request-flow-b1e8b8a52a3ed31a988b6f0d9a834c88.png" width="2133" height="1947" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-azd-lifecycle">The <code>azd</code> lifecycle<a href="https://luke.geek.nz/azure/litellm-aks/#the-azd-lifecycle" class="hash-link" aria-label="Direct link to the-azd-lifecycle" title="Direct link to the-azd-lifecycle" translate="no">​</a></h3>
<p><code>azd up</code> runs through several stages, each with a PowerShell hook:</p>
<ol>
<li class="">
<p><strong>Preprovision</strong> - generates random secrets for the PostgreSQL admin password, LiteLLM master key, and salt key. Also installs <code>kustomize</code> via Winget if it is not present.</p>
</li>
<li class="">
<p><strong>Provision</strong> - deploys the Bicep template to Azure. This takes about 10-15 minutes and creates the full environment.</p>
</li>
<li class="">
<p><strong>Postprovision</strong> - gets the AKS credentials, patches CoreDNS with a split-DNS configuration (Google DNS for public resolution, Azure DNS for private endpoint resolution), deploys the Kubernetes manifests via kustomize, and sets up the ingress controller.</p>
</li>
<li class="">
<p><strong>Postdeploy</strong> - refreshes the Kubernetes secret with the latest connection strings and keys, restarts the deployment, and syncs the public DNS A record.</p>
</li>
</ol>
<p>One thing I hit early on - the Bicep output <code>aksClusterName</code> was captured by azd as an environment variable, but <code>azure.yaml</code> was referencing <code>${AZURE_AKS_CLUSTER_NAME}</code> which did not exist. A quick fix to <code>${aksClusterName}</code> and the deployment ran clean.</p>
<p><img decoding="async" loading="lazy" alt="Terminal demo showing azd deploy completing, LiteLLM pods reaching Ready, and the HPA status" src="https://luke.geek.nz/assets/images/litellm-deploy-rollout-cf0c450d9ab50e28d2eb2737404e18ac.gif" width="1040" height="484" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-litellm-config">The LiteLLM config<a href="https://luke.geek.nz/azure/litellm-aks/#the-litellm-config" class="hash-link" aria-label="Direct link to The LiteLLM config" title="Direct link to The LiteLLM config" translate="no">​</a></h2>
<p>The proxy is configured through a ConfigMap that kustomize applies to the cluster. The config file defines the models, authentication, caching, and routing.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token key atrule">model_list</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> </span><span class="token key atrule">model_name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> azure</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">gpt</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">4o</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">litellm_params</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token key atrule">model</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> azure/gpt</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">4o</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token key atrule">api_base</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> os.environ/AZURE_OPENAI_ENDPOINT</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token key atrule">api_key</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> os.environ/AZURE_OPENAI_KEY</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token key atrule">api_version</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"2024-10-21"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">general_settings</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">master_key</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> os.environ/LITELLM_MASTER_KEY</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">database_url</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> os.environ/DATABASE_URL</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">database_connection_pool_limit</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token number">10</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">proxy_batch_write_at</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token number">60</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">allow_requests_on_db_unavailable</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token boolean important">true</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">litellm_settings</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">cache</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token boolean important">true</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">cache_params</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">type</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> redis</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">host</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> os.environ/REDIS_HOST</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">port</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> os.environ/REDIS_PORT</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">password</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> os.environ/REDIS_PASSWORD</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">ssl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token boolean important">true</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="adding-opencode-zen-and-go-models">Adding OpenCode Zen and Go models<a href="https://luke.geek.nz/azure/litellm-aks/#adding-opencode-zen-and-go-models" class="hash-link" aria-label="Direct link to Adding OpenCode Zen and Go models" title="Direct link to Adding OpenCode Zen and Go models" translate="no">​</a></h2>
<p>Once the Azure OpenAI model was working, I also wanted to see whether LiteLLM could front the OpenCode models I use day to day. The answer is yes. OpenCode Zen exposes an OpenAI-compatible endpoint at <code>https://opencode.ai/zen/v1</code>, and OpenCode Go exposes one at <code>https://opencode.ai/zen/go/v1</code> for the OpenAI-compatible Go models.</p>
<p>That let me put Azure OpenAI, OpenCode Zen, and OpenCode Go behind the same LiteLLM endpoint:</p>
<table><thead><tr><th>Model group</th><th>Example models</th><th>Upstream base URL</th></tr></thead><tbody><tr><td>Azure OpenAI</td><td><code>azure-gpt-4o</code></td><td><code>os.environ/AZURE_OPENAI_ENDPOINT</code></td></tr><tr><td>OpenCode Zen</td><td><code>big-pickle</code>, <code>deepseek-v4-flash-free</code>, <code>mimo-v2.5-free</code></td><td><code>https://opencode.ai/zen/v1</code></td></tr><tr><td>OpenCode Go (OpenAI-compatible)</td><td><code>glm-5.1</code>, <code>kimi-k2.6</code>, <code>deepseek-v4-pro</code></td><td><code>https://opencode.ai/zen/go/v1</code></td></tr><tr><td>OpenCode Go (Anthropic-compatible)</td><td><code>minimax-m2.5</code>, <code>qwen3.7-plus</code></td><td><code>https://opencode.ai/zen/go</code></td></tr></tbody></table>
<p>The last row is the small gotcha. For the Anthropic-compatible Go models, LiteLLM's Anthropic provider appends <code>/v1/messages</code> automatically. If you set the base URL to <code>https://opencode.ai/zen/go/v1</code>, LiteLLM sends the request to <code>/v1/v1/messages</code>, and OpenCode quite correctly returns a <code>404</code>. Setting the base URL to <code>https://opencode.ai/zen/go</code> fixes it.</p>
<p>The single proxy now exposes 19 models. A few of the Go subscription models currently return a <code>GoUsageLimitError</code> because my Go monthly limit is exhausted, but that still proved the routing path was correct. The free Zen models, including <code>big-pickle</code>, are available through the same proxy key.</p>
<p>The UI is useful once the proxy is running. I recorded this walkthrough after logging in (the login step is deliberately not captured so the master key never appears in the recording). It shows the live dashboard moving through the current configuration: Models + Endpoints, MCP Servers, and Virtual Keys.</p>
<p><img decoding="async" loading="lazy" alt="LiteLLM UI walkthrough showing the configured models, MCP servers, and virtual keys" src="https://luke.geek.nz/assets/images/litellm-ui-configuration-a62065667591022e8e2f5fbabf3c218c.gif" width="1280" height="720" class="img_ev3q"></p>
<p>I also tested the virtual key path from the API. The flow below creates a scoped key, uses it for a chat completion, and records the follow-up issue I saw when trying to delete it.</p>
<p><img decoding="async" loading="lazy" alt="Terminal demo showing LiteLLM virtual key creation, model access, chat completion, and a follow-up note about delete behaviour" src="https://luke.geek.nz/assets/images/litellm-virtual-key-f046d80ff654bd0b1a2ae1c7f1a690e6.gif" width="1040" height="484" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="production-settings-from-the-litellm-docs">Production settings from the LiteLLM docs<a href="https://luke.geek.nz/azure/litellm-aks/#production-settings-from-the-litellm-docs" class="hash-link" aria-label="Direct link to Production settings from the LiteLLM docs" title="Direct link to Production settings from the LiteLLM docs" translate="no">​</a></h2>
<p>The <a href="https://docs.litellm.ai/docs/proxy/prod" target="_blank" rel="noopener noreferrer" class="">LiteLLM production best practices page</a> had a few things worth picking up.</p>
<p>The Redis config was the one that surprised me most - using <code>host</code>/<code>port</code>/<code>password</code> separately rather than a <code>redis_url</code> string is measurably faster (about 80 RPS according to their benchmarks). I had originally used the URL format and switched it after reading that.</p>
<p>The database settings are less surprising but worth noting. <code>proxy_batch_write_at: 60</code> batches spend updates every 60 seconds rather than writing on every request, which makes a meaningful difference to PostgreSQL write load. I paired that with <code>database_connection_pool_limit: 10</code> - with 3 replicas and 4 workers each that's 120 total connections, sitting comfortably inside PostgreSQL's defaults.</p>
<p>The other two I'd put in any production gateway: <code>allow_requests_on_db_unavailable: true</code> keeps the proxy serving requests even if PostgreSQL is momentarily unreachable (useful when you're in a private VNet and the database briefly hiccups during a scale event), and <code>LITELLM_MODE: "PRODUCTION"</code> disables the <code>load_dotenv()</code> call that would otherwise look for a <code>.env</code> file inside the container at startup.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="testing-multi-replica-behaviour">Testing multi-replica behaviour<a href="https://luke.geek.nz/azure/litellm-aks/#testing-multi-replica-behaviour" class="hash-link" aria-label="Direct link to Testing multi-replica behaviour" title="Direct link to Testing multi-replica behaviour" translate="no">​</a></h2>
<p>One of the main reasons to use AKS is that you can run multiple replicas for high availability. I wanted to verify that the Redis-backed caching works across pods.</p>
<p>I split the capture into three shorter GIFs so each one shows a specific test. The values in these captures are from the live AKS deployment, with API keys redacted.</p>
<p>First, the proxy health and model catalogue:</p>
<p><img decoding="async" loading="lazy" alt="Terminal demo showing LiteLLM readiness and 19 configured models returned by the live proxy" src="https://luke.geek.nz/assets/images/litellm-health-models-af26471afbf65296fcb832581caa2b1d.gif" width="1040" height="484" class="img_ev3q"></p>
<p>Then the completion and cache checks:</p>
<p><img decoding="async" loading="lazy" alt="Terminal demo showing Azure GPT-4o, Big Pickle, and a Redis cache miss then hit from the live proxy" src="https://luke.geek.nz/assets/images/litellm-chat-cache-fc1c25e518f268ef2fc36c34444ed468.gif" width="1040" height="484" class="img_ev3q"></p>
<p>I tested a few things through the public endpoint:</p>
<table><thead><tr><th>Test</th><th>Result</th></tr></thead><tbody><tr><td>Readiness</td><td><code>{"status":"healthy","db":"connected"}</code></td></tr><tr><td>Model count</td><td><code>19</code> models</td></tr><tr><td>Azure OpenAI call</td><td><code>azure-gpt-4o</code>, <code>0.153s</code>, response <code>Hello!</code></td></tr><tr><td>OpenCode Zen call</td><td><code>big-pickle</code>, <code>0.155s</code>, response <code>4</code>, <code>150</code> reasoning tokens</td></tr><tr><td>Redis cache</td><td><code>0.448s</code> miss, <code>0.132s</code> hit</td></tr></tbody></table>
<p>Then I ran a cross-pod cache test against two different LiteLLM pods:</p>
<p><img decoding="async" loading="lazy" alt="Terminal demo showing two AKS LiteLLM pods and a Redis cache hit across replicas" src="https://luke.geek.nz/assets/images/litellm-replica-cache-74236635dee16e0ba4823890531d2659.gif" width="1040" height="484" class="img_ev3q"></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">Pod A: litellm-proxy-9b9d6dffd-p2qqg on aks-userpool-12527619-vmss000001</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">Pod B: litellm-proxy-9b9d6dffd-qjfkr on aks-userpool-12527619-vmss000000</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">Pod A cache miss: 1.040s</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">Pod B cache hit:  0.636s</span><br></div></code></pre></div></div>
<p>Pretty basic test case, but it proves the important bit: the response was cached by one replica and then served by another replica from Azure Managed Redis. That is the behaviour I wanted before trusting horizontal scale-out.</p>
<p>I also tested a shorter prompt through the public ingress:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">Cache miss: 0.448s</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">Cache hit:  0.132s</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">Response:   Blue green</span><br></div></code></pre></div></div>
<p>The NGINX ingress controller distributes requests across the pods transparently, and the Redis cache serves cached responses regardless of which pod handles the request.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="aks-configuration-for-litellm">AKS configuration for LiteLLM<a href="https://luke.geek.nz/azure/litellm-aks/#aks-configuration-for-litellm" class="hash-link" aria-label="Direct link to AKS configuration for LiteLLM" title="Direct link to AKS configuration for LiteLLM" translate="no">​</a></h2>
<p>A few settings in <code>k8s/litellm-deployment.yaml</code> are worth calling out.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="rolling-updates-and-pod-lifecycle">Rolling updates and pod lifecycle<a href="https://luke.geek.nz/azure/litellm-aks/#rolling-updates-and-pod-lifecycle" class="hash-link" aria-label="Direct link to Rolling updates and pod lifecycle" title="Direct link to Rolling updates and pod lifecycle" translate="no">​</a></h3>
<p>The deployment uses <code>maxUnavailable: 0</code> and <code>maxSurge: 1</code>, so Kubernetes never drops below the desired replica count during a rollout. A new pod starts, passes its readiness probe, gets added to the service, and only then does an old pod get terminated.</p>
<p>The readiness probe hits <code>/health/readiness</code> with a 30-second initial delay. LiteLLM won't pass that probe until Prisma has finished running <code>migrate deploy</code>, which matters because on first deploy it runs schema migrations before it's ready for traffic. The liveness probe is separate - <code>/health/liveliness</code> with a 60-second initial delay and 15-second period, so three failures in a row trigger a restart.</p>
<p>Graceful shutdown uses <code>terminationGracePeriodSeconds: 620</code> and a 5-second <code>preStop</code> sleep. The grace period is deliberately longer than LiteLLM's 600-second request timeout so in-flight requests can finish. The preStop sleep gives the load balancer a moment to deregister the pod before SIGTERM lands.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token key atrule">terminationGracePeriodSeconds</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token number">620</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">lifecycle</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">preStop</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">exec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token key atrule">command</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"sh"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"-c"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"sleep 5"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="autoscaling-and-disruption-budget">Autoscaling and disruption budget<a href="https://luke.geek.nz/azure/litellm-aks/#autoscaling-and-disruption-budget" class="hash-link" aria-label="Direct link to Autoscaling and disruption budget" title="Direct link to Autoscaling and disruption budget" translate="no">​</a></h3>
<p>The HPA targets 60% CPU and 80% memory, scaling between 2 and 10 replicas. The PodDisruptionBudget sets <code>minAvailable: 1</code>, so <code>kubectl drain</code> during node maintenance can't terminate the last running pod. Worth having, especially in a two-node user pool.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="read-only-root-filesystem">Read-only root filesystem<a href="https://luke.geek.nz/azure/litellm-aks/#read-only-root-filesystem" class="hash-link" aria-label="Direct link to Read-only root filesystem" title="Direct link to Read-only root filesystem" translate="no">​</a></h3>
<p>The container runs with <code>readOnlyRootFilesystem: true</code>, <code>runAsNonRoot: true</code>, and all capabilities dropped. LiteLLM needs a few writable directories - Prisma writes binaries to a cache directory, migrations state needs somewhere to live, and the UI needs writable paths for assets and logos. I used <code>emptyDir</code> volumes at <code>/app/cache</code>, <code>/app/migrations</code>, <code>/app/var/litellm/ui</code>, <code>/app/var/litellm/assets</code>, and <code>/tmp</code>.</p>
<p>The <code>LITELLM_NON_ROOT=true</code> environment variable adjusts the default UI paths to point into <code>/app/var/litellm</code> rather than trying to write into the container root.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="some-things-i-learned">Some things I learned<a href="https://luke.geek.nz/azure/litellm-aks/#some-things-i-learned" class="hash-link" aria-label="Direct link to Some things I learned" title="Direct link to Some things I learned" translate="no">​</a></h2>
<p>The CoreDNS split-DNS thing caught me early. AKS pods resolve DNS through CoreDNS, which by default forwards everything to the node's resolver - Azure DNS at 168.63.129.16. That works fine for private DNS zones (PostgreSQL, Redis, ACR all resolve correctly through private endpoints), but it breaks for public internet lookups. cert-manager's Let's Encrypt HTTP-01 challenge needs to resolve public domains, and with the default config it can't. The fix is a CoreDNS ConfigMap patch that sends public queries to 8.8.8.8 while keeping Azure private zones on 168.63.129.16 - the <code>postprovision</code> hook applies this automatically, but it's worth understanding why it's there.</p>
<p>Related to that: the first deploy I did, cert-manager started the Let's Encrypt flow before the DNS A record had fully propagated. The challenge timed out and left a stale <code>cm-acme-http-solver</code> pod sitting in the namespace. Deleting the <code>Certificate</code> and <code>CertificateRequest</code> objects forced a fresh attempt once the DNS was actually ready.</p>
<p>Kubernetes secrets don't hot-reload - worth remembering. When I added the OpenCode Go API key I updated the secret in the cluster, but the running pods still had the old environment because <code>envFrom</code> secrets are only loaded at pod startup. A force delete of the pods fixed it.</p>
<p>The virtual key deletion path still needs a closer look. Creating a scoped key and using it for completions worked fine, but the delete step surfaced a Redis cluster <code>MOVED</code> error during auth cache invalidation. The key disappeared from the list so it seems functionally gone, but I wouldn't call that clean. Left it as a follow-up rather than pretending it didn't happen.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="operational-checklist">Operational checklist<a href="https://luke.geek.nz/azure/litellm-aks/#operational-checklist" class="hash-link" aria-label="Direct link to Operational checklist" title="Direct link to Operational checklist" translate="no">​</a></h2>
<p>Once the proxy is deployed, here is how I validate it is working:</p>
<div class="language-powershell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-powershell codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)"># AKS cluster state</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">az aks show </span><span class="token operator">-</span><span class="token plain">g rg-llm </span><span class="token operator">-</span><span class="token plain">n aks-llm </span><span class="token operator">--</span><span class="token plain">query provisioningState </span><span class="token operator">-</span><span class="token plain">o tsv</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">kubectl get nodes</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">kubectl get pods </span><span class="token operator">-</span><span class="token plain">n litellm </span><span class="token operator">-</span><span class="token plain">o wide</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">kubectl get hpa </span><span class="token operator">-</span><span class="token plain">n litellm</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># LiteLLM health and model catalogue</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">curl https:</span><span class="token operator">/</span><span class="token operator">/</span><span class="token plain">litellm</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">headinthecloud</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">co</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">nz/health/readiness</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">curl https:</span><span class="token operator">/</span><span class="token operator">/</span><span class="token plain">litellm</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">headinthecloud</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">co</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">nz/models </span><span class="token operator">-</span><span class="token plain">H </span><span class="token string" style="color:rgb(255, 121, 198)">"Authorization: Bearer &lt;key&gt;"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Test a completion</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">curl </span><span class="token operator">-</span><span class="token plain">X POST https:</span><span class="token operator">/</span><span class="token operator">/</span><span class="token plain">litellm</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">headinthecloud</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">co</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">nz/chat/completions \</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token operator">-</span><span class="token plain">H </span><span class="token string" style="color:rgb(255, 121, 198)">"Content-Type: application/json"</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token operator">-</span><span class="token plain">H </span><span class="token string" style="color:rgb(255, 121, 198)">"Authorization: Bearer &lt;key&gt;"</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token operator">-</span><span class="token plain">d </span><span class="token string" style="color:rgb(255, 121, 198)">'{"model":"azure-gpt-4o","messages":[{"role":"user","content":"hello"}]}'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Test Redis cache by repeating the same request</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">curl </span><span class="token operator">-</span><span class="token plain">X POST https:</span><span class="token operator">/</span><span class="token operator">/</span><span class="token plain">litellm</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">headinthecloud</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">co</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">nz/chat/completions \</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token operator">-</span><span class="token plain">H </span><span class="token string" style="color:rgb(255, 121, 198)">"Content-Type: application/json"</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token operator">-</span><span class="token plain">H </span><span class="token string" style="color:rgb(255, 121, 198)">"Authorization: Bearer &lt;key&gt;"</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token operator">-</span><span class="token plain">d </span><span class="token string" style="color:rgb(255, 121, 198)">'{"model":"azure-gpt-4o","messages":[{"role":"user","content":"hello"}]}'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Check Prometheus metrics</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token function" style="color:rgb(80, 250, 123)">Invoke-WebRequest</span><span class="token plain"> https:</span><span class="token operator">/</span><span class="token operator">/</span><span class="token plain">litellm</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">headinthecloud</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">co</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">nz/metrics</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">Content </span><span class="token punctuation" style="color:rgb(248, 248, 242)">|</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">Select-String</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"litellm_"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Check ingress TLS certificate</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">kubectl get certificate </span><span class="token operator">-</span><span class="token plain">n litellm </span><span class="token operator">-</span><span class="token plain">o wide</span><br></div></code></pre></div></div>
<p>The test suite in <code>scripts/test-litellm.ps1</code> runs 26 checks across health, models, keys, chat, spend, metrics, and security. It cleans up generated test keys at the end.</p>
<div class="language-powershell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-powershell codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token operator">/</span><span class="token plain">scripts/</span><span class="token function" style="color:rgb(80, 250, 123)">test-litellm</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">ps1 </span><span class="token operator">-</span><span class="token plain">BaseUrl https:</span><span class="token operator">/</span><span class="token operator">/</span><span class="token plain">litellm</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">headinthecloud</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">co</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">nz </span><span class="token operator">-</span><span class="token plain">MasterKey &lt;key&gt;</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="cost-and-cleanup">Cost and cleanup<a href="https://luke.geek.nz/azure/litellm-aks/#cost-and-cleanup" class="hash-link" aria-label="Direct link to Cost and cleanup" title="Direct link to Cost and cleanup" translate="no">​</a></h3>
<p>This is not a free deployment. AKS Standard tier with two D4s_v3 node pools, Azure Managed Redis Balanced_B0, and PostgreSQL Standard_B2ms runs roughly $400-$500 per month in Australia East. The Azure OpenAI gpt-4o deployment adds pay-per-token cost.</p>
<p>To tear down the whole environment:</p>
<div class="language-powershell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-powershell codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd down </span><span class="token operator">--</span><span class="token plain">purge</span><br></div></code></pre></div></div>
<p>The <code>predown</code> hook deletes the <code>litellm</code> Kubernetes namespace before the resource group removal starts, so there are no dangling load balancer resources. If you want to keep data for later, export the PostgreSQL database first and store the <code>LITELLM_SALT_KEY</code> somewhere safe - you will need it to decrypt stored credentials when you restore.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrapping-up">Wrapping up<a href="https://luke.geek.nz/azure/litellm-aks/#wrapping-up" class="hash-link" aria-label="Direct link to Wrapping up" title="Direct link to Wrapping up" translate="no">​</a></h2>
<p>Hopefully this gives you a starting point for running LiteLLM on AKS. The <code>azd</code> template handles the full deployment lifecycle, and the production configuration covers caching, rate limiting, spend tracking, and high availability out of the box.</p>
<p>I had a lot of fun setting this up. The mind boggles at what you could do with the MCP Gateway features on top of this - wiring up Microsoft Learn documentation tools or GitHub MCP servers through the same proxy. Something for the backlog.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://luke.geek.nz/azure/litellm-aks/#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<p><strong>LiteLLM:</strong></p>
<ul>
<li class=""><a href="https://docs.litellm.ai/" target="_blank" rel="noopener noreferrer" class="">LiteLLM documentation</a></li>
<li class=""><a href="https://docs.litellm.ai/docs/proxy/prod" target="_blank" rel="noopener noreferrer" class="">LiteLLM production best practices</a></li>
<li class=""><a href="https://github.com/BerriAI/litellm" target="_blank" rel="noopener noreferrer" class="">LiteLLM on GitHub</a></li>
<li class=""><a href="https://docs.litellm.ai/docs/proxy/mcp" target="_blank" rel="noopener noreferrer" class="">LiteLLM MCP Gateway</a></li>
</ul>
<p><strong>Azure Infrastructure &amp; Deployment:</strong></p>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/developer/azure-developer-cli/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Developer CLI</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/aks/concepts-network?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">AKS network concepts and security</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Bicep language overview</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/aks/ingress-tls?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">HTTPS ingress on AKS with cert-manager</a></li>
</ul>
<p><strong>Networking &amp; Private Connectivity:</strong></p>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Private Link and Private Endpoints</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/nat/nat-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure NAT Gateway overview</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Workload Identity on AKS</a></li>
</ul>
<p><strong>Data Services:</strong></p>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-how-to-premium-vnet?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Cache for Redis with private endpoints</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-networking?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">PostgreSQL Flexible Server networking</a></li>
</ul>]]></content:encoded>
            <category>Azure</category>
        </item>
        <item>
            <title><![CDATA[From cloud adoption to value realisation]]></title>
            <link>https://luke.geek.nz/misc/cloud-adoption-value-realisation/</link>
            <guid>https://luke.geek.nz/misc/cloud-adoption-value-realisation/</guid>
            <pubDate>Sun, 07 Jun 2026 07:54:38 GMT</pubDate>
            <description><![CDATA[Explore how to transition from cloud adoption to realizing true value with Azure, ensuring measurable improvements for your organization.]]></description>
            <content:encoded><![CDATA[<p>A lot of Azure programmes can answer one question pretty quickly: what did we deploy?</p>
<ul>
<li class="">Landing zone. Done.</li>
<li class="">Workloads migrated. Done.</li>
<li class="">Monitoring enabled. Done.</li>
<li class="">Tags and budgets configured. Done.</li>
<li class="">Security baseline applied. Done.</li>
</ul>
<p>Those are all useful things, and I am not downplaying them. They are part of getting cloud adoption right. But they are not the whole story.</p>
<p>The harder question is usually this one:</p>
<blockquote>
<p>What is measurably better because we adopted Azure?</p>
</blockquote>
<p>That is where the conversation shifts from cloud adoption to value realisation.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="adoption-is-not-the-finish-line">Adoption is not the finish line<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#adoption-is-not-the-finish-line" class="hash-link" aria-label="Direct link to Adoption is not the finish line" title="Direct link to Adoption is not the finish line" translate="no">​</a></h2>
<p>The <a href="https://learn.microsoft.com/azure/cloud-adoption-framework/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Cloud Adoption Framework</a> gives us a really useful way to think about the cloud journey. Strategy, Plan, Ready, Adopt, Govern, Secure, and Manage all have a place, and they are not only technical activities.</p>
<p>That matters because cloud adoption is not just moving workloads or deploying services. It is integrating cloud into the way the organisation works.</p>
<p>The trap I see is that teams often treat the Adopt phase as the finish line.</p>
<p>A workload is migrated.<br>
<!-- -->A platform is available.<br>
<!-- -->A dashboard exists.<br>
<!-- -->A new service is live.</p>
<p>Then the project closes, and everyone moves to the next thing.</p>
<p>That is where value can quietly leak out of the programme. The technical delivery might be successful, but the outcome is left assumed rather than managed.</p>
<p>A simple way I think about it is:</p>
<table><thead><tr><th>Cloud adoption asks</th><th>Value realisation asks</th></tr></thead><tbody><tr><td>Did we move or build the thing?</td><td>Did the organisation get better because of it?</td></tr><tr><td>Is the platform ready?</td><td>Can teams use it safely and consistently?</td></tr><tr><td>Is monitoring enabled?</td><td>Are decisions improving from the signals?</td></tr><tr><td>Do we have cost visibility?</td><td>Is spend changing behaviour?</td></tr><tr><td>Did we migrate the workload?</td><td>Did we retire cost, risk, or friction?</td></tr></tbody></table>
<p>Cloud adoption creates the conditions for value. It does not automatically realise it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-familiar-pattern">The familiar pattern<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#the-familiar-pattern" class="hash-link" aria-label="Direct link to The familiar pattern" title="Direct link to The familiar pattern" translate="no">​</a></h2>
<p>This is a pattern I have seen a few times, in different shapes.</p>
<p>A business case gets approved. The platform team does the right thing and builds the Azure foundation properly. Workloads move. Governance is applied. Azure Monitor starts collecting useful data. Microsoft Defender for Cloud is switched on. Cost Management reports are available.</p>
<p>The project is technically successful.</p>
<p>Then, six months later, someone asks:</p>
<blockquote>
<p>What did we get for the investment?</p>
</blockquote>
<p>And the answer is harder than it should be.</p>
<p>Not because Azure failed. Usually the platform did exactly what it was asked to do.</p>
<p>The gap is that nobody kept ownership of the value after go-live. The original benefit owner moved on, the review cadence became operational only, and the measures that justified the work were not carried into the run state.</p>
<p>The team can show that the workload is healthy. They can show that it is patched, monitored, backed up, and policy-compliant. But they cannot easily show whether the business outcome improved.</p>
<p><img decoding="async" loading="lazy" alt="Diagram showing a programme timeline split into Delivery and Run State zones. The Delivery zone tracks milestones from business case to go-live. The Run State zone shows operational health items staying tracked but benefit ownership, value reviews, and success measures quietly dropping off after project close." src="https://luke.geek.nz/assets/images/adoption-value-gap-df9e438b86daa3cce2c0e66c7fca5d6e.png" width="3096" height="1476" class="img_ev3q"></p>
<p>That is the bit worth fixing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="start-with-the-outcome-not-the-service">Start with the outcome, not the service<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#start-with-the-outcome-not-the-service" class="hash-link" aria-label="Direct link to Start with the outcome, not the service" title="Direct link to Start with the outcome, not the service" translate="no">​</a></h2>
<p>This is where Azure roadmaps need a bit of discipline.</p>
<p>If the roadmap item says:</p>
<blockquote>
<p>Migrate application X to Azure.</p>
</blockquote>
<p>That is not wrong, but it is incomplete.</p>
<p>A better version is closer to:</p>
<blockquote>
<p>Migrate application X to Azure so that recovery time improves, platform risk reduces, can make use of additional functionality, and the legacy hosting contract can be retired by Q4.</p>
</blockquote>
<p>That second version gives you more to work with. It names the technical move, but it also names the expected value.</p>
<p>Some examples:</p>
<table><thead><tr><th>Azure activity</th><th>Better value framing</th></tr></thead><tbody><tr><td>Deploy an Azure landing zone</td><td>Teams can launch compliant workloads faster, with inherited security and governance controls</td></tr><tr><td>Migrate virtual machines</td><td>Legacy infrastructure risk reduces, recovery improves, and old hosting costs can be retired</td></tr><tr><td>Build a Fabric data platform</td><td>Decision latency reduces because trusted data is available at the right cadence</td></tr><tr><td>Roll out Copilot or Azure AI</td><td>A named workflow improves, with human review, quality thresholds, and support ownership</td></tr><tr><td>Enable Azure Monitor</td><td>Teams can act before users are impacted, not only after an incident is raised</td></tr></tbody></table>
<p>The technical activity still matters. It just needs to sit inside a value story.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="four-questions-before-calling-an-azure-roadmap-done">Four questions before calling an Azure roadmap done<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#four-questions-before-calling-an-azure-roadmap-done" class="hash-link" aria-label="Direct link to Four questions before calling an Azure roadmap done" title="Direct link to Four questions before calling an Azure roadmap done" translate="no">​</a></h2>
<p>These are the questions I like to ask before treating an Azure roadmap item as ready.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-business-outcome-should-change">What business outcome should change?<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#what-business-outcome-should-change" class="hash-link" aria-label="Direct link to What business outcome should change?" title="Direct link to What business outcome should change?" translate="no">​</a></h3>
<p>Be specific.</p>
<p>Not "modernise infrastructure".<br>
<!-- -->Not "improve reporting".<br>
<!-- -->Not "adopt AI".</p>
<p>Better outcomes sound like:</p>
<ul>
<li class="">reduce time to recover from a priority incident</li>
<li class="">reduce manual reporting effort for the finance team</li>
<li class="">improve release confidence for a customer-facing workload</li>
<li class="">reduce audit preparation effort</li>
<li class="">improve cost visibility by product, team, or service</li>
<li class="">reduce dependency on unsupported infrastructure</li>
</ul>
<p>If the outcome is hard to name, the roadmap item may still be a technical task rather than a value item.</p>
<p>That is fine, but call it what it is.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="who-owns-the-value-after-go-live">Who owns the value after go-live?<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#who-owns-the-value-after-go-live" class="hash-link" aria-label="Direct link to Who owns the value after go-live?" title="Direct link to Who owns the value after go-live?" translate="no">​</a></h3>
<p>This is a big one.</p>
<p>The delivery owner is not always the value owner.</p>
<p>A platform team might own the Azure landing zone. An application team might own the workload. But the value might belong to operations, finance, risk, customer service, or a product owner.</p>
<p>If nobody owns the value after go-live, the value probably will not be reviewed.</p>
<p>For each roadmap item, I want to know:</p>
<table><thead><tr><th>Role</th><th>Question</th></tr></thead><tbody><tr><td>Delivery owner</td><td>Who gets it live?</td></tr><tr><td>Operational owner</td><td>Who runs it after go-live?</td></tr><tr><td>Value owner</td><td>Who proves whether it was worth doing?</td></tr><tr><td>Decision owner</td><td>Who can change course if the measure does not move?</td></tr></tbody></table>
<p>Those roles might be the same person in a small organisation. In larger environments, they often are not.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-should-move-first">What should move first?<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#what-should-move-first" class="hash-link" aria-label="Direct link to What should move first?" title="Direct link to What should move first?" translate="no">​</a></h3>
<p>Lagging measures are useful, but they can arrive too late.</p>
<p>If the only measure is annual cost reduction, customer satisfaction, or revenue growth, you might wait months before you know whether the change is working.</p>
<p>A leading indicator gives you an earlier signal.</p>
<p>For example:</p>
<table><thead><tr><th>Outcome</th><th>Early signal</th></tr></thead><tbody><tr><td>Faster release cycle</td><td>Deployment frequency or lead time changes within 30-60 days</td></tr><tr><td>Better reliability</td><td>Incident volume, alert quality, or mean time to restore starts improving</td></tr><tr><td>Better reporting</td><td>Manual report preparation time reduces</td></tr><tr><td>Better cost ownership</td><td>Teams review cost by workload or product each month</td></tr><tr><td>Better adoption</td><td>Repeat usage grows after the first training wave</td></tr></tbody></table>
<p>Azure gives us a lot of telemetry. <a href="https://learn.microsoft.com/azure/azure-monitor/fundamentals/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Monitor</a>, <a href="https://learn.microsoft.com/azure/azure-monitor/app/app-insights-overview?tabs=webapps&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Application Insights</a>, <a href="https://learn.microsoft.com/azure/azure-monitor/logs/log-analytics-overview?tabs=simple&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Log Analytics</a>, <a href="https://learn.microsoft.com/azure/cost-management-billing/costs/overview-cost-management?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Cost Management</a>, <a href="https://learn.microsoft.com/azure/defender-for-cloud/defender-for-cloud-introduction?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Defender for Cloud</a>, and <a href="https://learn.microsoft.com/azure/governance/policy/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Policy</a> can all provide useful signals.</p>
<p>The trick is connecting those signals to the outcome the organisation cares about.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-gets-retired">What gets retired?<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#what-gets-retired" class="hash-link" aria-label="Direct link to What gets retired?" title="Direct link to What gets retired?" translate="no">​</a></h3>
<p>One of the cleanest ways to realise value is to stop paying for, supporting, or working around something.</p>
<p>Cloud programmes sometimes add the new thing but keep the old thing alive.</p>
<ul>
<li class="">New platform, old process.</li>
<li class="">New dashboard, old spreadsheet.</li>
<li class="">New cloud environment, old hosting contract.</li>
<li class="">New automation, old manual approval path.</li>
<li class="">New governance model, old exception process.</li>
</ul>
<p>That is how cost and complexity grow together - someone once told me:</p>
<blockquote>
<p>"Cloud is not where you work, it's how you work."</p>
</blockquote>
<p>For each Azure roadmap item, ask:</p>
<blockquote>
<p>What do we stop, retire, reduce, or simplify if this works?</p>
</blockquote>
<p>That could be:</p>
<ul>
<li class="">a legacy server or hosting contract</li>
<li class="">a manual reporting process</li>
<li class="">an old integration path</li>
<li class="">duplicate tooling</li>
<li class="">a support burden</li>
<li class="">a recurring security exception</li>
<li class="">a governance meeting that no longer adds value</li>
</ul>
<p>If nothing gets retired or simplified, the organisation might still choose to proceed, but it should do so with eyes open.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-few-field-learnings">A few field learnings<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#a-few-field-learnings" class="hash-link" aria-label="Direct link to A few field learnings" title="Direct link to A few field learnings" translate="no">​</a></h2>
<p>None of these are tied to a specific customer. They are patterns I have seen often enough that I now look for them early.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-landing-zone-without-a-first-workload-is-hard-to-explain">A landing zone without a first workload is hard to explain<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#a-landing-zone-without-a-first-workload-is-hard-to-explain" class="hash-link" aria-label="Direct link to A landing zone without a first workload is hard to explain" title="Direct link to A landing zone without a first workload is hard to explain" translate="no">​</a></h3>
<p>Azure landing zones are important. I am a <strong>big fan</strong> of doing the foundation properly: identity, network, management groups, policy, logging, security baseline, subscription structure, and deployment patterns.</p>
<p>But a landing zone by itself can be a hard thing for a business sponsor to value.</p>
<p>The value becomes clearer when it is tied to the first workload or product team that uses it - and I generally try to push for 'what is the business strategy, that the platform can also help to support to ensure alignment'.</p>
<p>Instead of saying:</p>
<blockquote>
<p>We deployed the landing zone.</p>
</blockquote>
<p>Try to get to:</p>
<blockquote>
<p>The first product team can now deploy into a governed subscription without a custom security review every time.</p>
</blockquote>
<p>That moves the conversation from foundation delivered to friction removed.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="cost-visibility-is-not-the-same-as-cost-ownership">Cost visibility is not the same as cost ownership<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#cost-visibility-is-not-the-same-as-cost-ownership" class="hash-link" aria-label="Direct link to Cost visibility is not the same as cost ownership" title="Direct link to Cost visibility is not the same as cost ownership" translate="no">​</a></h3>
<p>I have seen teams build good Azure Cost Management views, dashboards, tags, and budgets, then wonder why spend behaviour does not change.</p>
<p>The missing part is usually ownership.</p>
<p>If a team can see cost but cannot change architecture, scale settings, licensing, retention, or product priority, then visibility becomes reporting rather than management.</p>
<p>The FinOps conversation needs at least three groups in the room:</p>
<table><thead><tr><th>Group</th><th>What they bring</th></tr></thead><tbody><tr><td>Finance</td><td>Budget, forecast, and commercial discipline</td></tr><tr><td>Engineering or platform</td><td>Technical options and trade-offs</td></tr><tr><td>Business or product owner</td><td>Value judgement and prioritisation</td></tr></tbody></table>
<p>If one of those is missing, the conversation usually turns into either "cut spend" or "explain spend", neither of which is enough. I do feel <a href="https://learn.microsoft.com/azure/governance/service-groups/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Service Groups</a> will help here - they provide a way to group and govern resources across subscriptions by business context, making it easier to align cost visibility with the workloads and products that own the spend. I have started including them in all deliverables for that reason.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="monitoring-has-to-produce-a-decision">Monitoring has to produce a decision<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#monitoring-has-to-produce-a-decision" class="hash-link" aria-label="Direct link to Monitoring has to produce a decision" title="Direct link to Monitoring has to produce a decision" translate="no">​</a></h3>
<p>Azure Monitor, Application Insights, and Log Analytics can give you a lot of useful telemetry.</p>
<p>The trap is building dashboards that nobody uses to make a decision.</p>
<p>For each dashboard or workbook, I like to ask:</p>
<ul>
<li class="">who looks at this?</li>
<li class="">how often?</li>
<li class="">what decision can they make from it?</li>
<li class="">what action happens when the number is red?</li>
<li class="">what gets ignored because it is noise?</li>
</ul>
<p>If the answer is "we might need it someday", the dashboard is probably an archive, not an operational tool.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="ai-pilots-need-a-boring-operating-model">AI pilots need a boring operating model<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#ai-pilots-need-a-boring-operating-model" class="hash-link" aria-label="Direct link to AI pilots need a boring operating model" title="Direct link to AI pilots need a boring operating model" translate="no">​</a></h3>
<p>AI pilots are exciting, but the value usually comes from the boring bits around them.</p>
<p>Who owns the workflow?
Who reviews the output?
What quality threshold is acceptable?
What happens when the model is wrong?
Who pays for run cost if usage grows?
Who supports users after the demo?</p>
<p>Those questions are less flashy than the demo, but they are usually what decides whether the pilot becomes useful.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-old-thing-has-a-habit-of-surviving">The old thing has a habit of surviving<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#the-old-thing-has-a-habit-of-surviving" class="hash-link" aria-label="Direct link to The old thing has a habit of surviving" title="Direct link to The old thing has a habit of surviving" translate="no">​</a></h3>
<p>One of the easiest value leaks to miss is the old thing staying alive.</p>
<p>The old report. The old server. The old approval process. The old spreadsheet. The old vendor contract. The old incident workaround.</p>
<p>Sometimes there is a good reason. Maybe the migration is phased, the risk is too high, or the replacement has not proven itself yet.</p>
<p>But if the old thing survives because nobody owns the retirement decision, the business case quietly weakens.</p>
<p>I now try to make retirement part of the roadmap item, not an afterthought.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="value-needs-a-review-cadence">Value needs a review cadence<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#value-needs-a-review-cadence" class="hash-link" aria-label="Direct link to Value needs a review cadence" title="Direct link to Value needs a review cadence" translate="no">​</a></h2>
<p>A value measure without a review cadence is mostly decoration.</p>
<p>The cadence does not need to be heavy. In fact, it is better if it isn't.</p>
<p>A few examples:</p>
<table><thead><tr><th>Cadence</th><th>Useful for</th></tr></thead><tbody><tr><td>Monthly service review</td><td>Operational health, incidents, support, adoption, cost anomalies</td></tr><tr><td>Quarterly value review</td><td>Outcome progress, roadmap adjustment, benefit tracking</td></tr><tr><td>FinOps iteration</td><td>Spend, optimisation, forecasting, unit economics</td></tr><tr><td>Well-Architected review</td><td>Workload risk, trade-offs, quality improvement</td></tr><tr><td>Product or platform steering</td><td>Priorities, ownership, funding, and stop/change/continue decisions</td></tr></tbody></table>
<p>This is where the <a href="https://learn.microsoft.com/azure/cloud-adoption-framework/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Cloud Adoption Framework</a> and <a href="https://learn.microsoft.com/azure/well-architected/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Well-Architected Framework</a> complement each other nicely.</p>
<p>CAF helps shape the adoption and operating model. Well-Architected helps you keep improving workload quality across reliability, security, cost optimisation, operational excellence, and performance efficiency.</p>
<p>FinOps adds another important lens: cost decisions need to be connected to business value, not just lower spend.</p>
<p>To be frank, that is where some cloud cost conversations go sideways. Saving money is good, but the better question is whether the spend is producing enough value for the workload, product, or service it supports.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-simple-value-realisation-loop">A simple value realisation loop<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#a-simple-value-realisation-loop" class="hash-link" aria-label="Direct link to A simple value realisation loop" title="Direct link to A simple value realisation loop" translate="no">​</a></h2>
<p>The model I keep coming back to is:</p>
<p><img decoding="async" loading="lazy" alt="Value realisation loop showing six steps from business outcome through to stop, change, or continue decision, with a feedback arrow returning to the top" src="https://luke.geek.nz/assets/images/value-realisation-loop-c8adb54cb73e5595bdef06d3b9ff63dd.png" width="1863" height="1722" class="img_ev3q"></p>
<p>The six steps are:</p>
<ol>
<li class=""><strong>Define the business outcome</strong> - name what should improve, not just what gets built</li>
<li class=""><strong>Identify the Azure capability</strong> - the service, pattern, or platform that enables it</li>
<li class=""><strong>Assign ownership</strong> - delivery, operational, and value owners, not just a project lead</li>
<li class=""><strong>Set the leading indicator</strong> - the earliest signal that the outcome is moving</li>
<li class=""><strong>Review on cadence</strong> - a scheduled checkpoint where the measure is actually looked at</li>
<li class=""><strong>Stop, change, or continue</strong> - a decision trigger so the work stays accountable</li>
</ol>
<p>Nothing clever there, but it forces the right conversation.</p>
<p>If an item cannot make it through that loop, it is probably not ready to be called a strategic roadmap item yet. It might still be necessary technical work, but you need to be aware of it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-practical-example">A practical example<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#a-practical-example" class="hash-link" aria-label="Direct link to A practical example" title="Direct link to A practical example" translate="no">​</a></h2>
<p>Take a simple migration example.</p>
<p>Weak framing:</p>
<blockquote>
<p>Migrate the claims application to Azure.</p>
</blockquote>
<p>Better framing:</p>
<blockquote>
<p>Migrate the claims application to Azure so that recovery time improves, unsupported infrastructure risk is removed, deployment lead time reduces, and the legacy hosting arrangement can be retired.</p>
</blockquote>
<p>The value realisation loop might look like this:</p>
<table><thead><tr><th>Element</th><th>Example</th></tr></thead><tbody><tr><td>Outcome</td><td>Claims platform is more resilient and cheaper to operate</td></tr><tr><td>Azure capability</td><td>Azure landing zone, workload migration, Azure Monitor, Azure Backup, policy baseline</td></tr><tr><td>Operating owner</td><td>Application owner plus platform operations</td></tr><tr><td>Leading indicator</td><td>Restore test completes within RTO target within 30 days; deployment lead time drops from days to hours within 60 days; support tickets down 20% by month 3</td></tr><tr><td>Review cadence</td><td>Monthly service review for 3 months post go-live, then quarterly</td></tr><tr><td>Stop/change/continue</td><td>Continue if recovery and deployment measures improve; revisit scope if legacy hosting contract cannot be confirmed for retirement by end of Q3</td></tr></tbody></table>
<p>The point is not the table itself - it is that the team now has something to look at together at the monthly review, rather than only tracking whether the workload is green.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-this-fits-for-azure-teams">Where this fits for Azure teams<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#where-this-fits-for-azure-teams" class="hash-link" aria-label="Direct link to Where this fits for Azure teams" title="Direct link to Where this fits for Azure teams" translate="no">​</a></h2>
<p>For Azure architects and platform teams, this does not mean turning every technical task into a business case.</p>
<p>Some work is foundational. Some work is hygiene. Some work is risk reduction. Not everything needs a dramatic value story.</p>
<p>But the bigger the investment, the more important this becomes.</p>
<p>If you are asking for funding, executive attention, delivery capacity, migration downtime, security exceptions, or behaviour change, then value ownership matters.</p>
<p>At minimum, each major Azure roadmap item (I would do this as an Epic in Agile) should have:</p>
<ul>
<li class="">a named outcome</li>
<li class="">a value owner</li>
<li class="">one leading indicator</li>
<li class="">one lagging measure if available</li>
<li class="">a review cadence</li>
<li class="">a clear statement of what gets retired, reduced, or simplified</li>
<li class="">a stop/change/continue trigger</li>
</ul>
<p>That is enough to move the conversation from "we adopted Azure" to "Azure helped us improve something that mattered."</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thoughts">Final thoughts<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#final-thoughts" class="hash-link" aria-label="Direct link to Final thoughts" title="Direct link to Final thoughts" translate="no">​</a></h2>
<p>The best Azure roadmap is not the one with the most services on it.</p>
<p>It is the one where every platform decision traces to an outcome, every outcome has an owner, every owner has a measure, and every measure is reviewed after go-live.</p>
<p>Adoption gets you into Azure. Value realisation proves Azure was worth adopting.</p>
<p>Hopefully, this helps you look at your next Azure roadmap with a slightly sharper lens.</p>
<p>Before approving the next wave, ask:</p>
<blockquote>
<p>What will be measurably better 90 days after this goes live, and who owns proving it?</p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://luke.geek.nz/misc/cloud-adoption-value-realisation/#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/azure/cloud-adoption-framework/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Cloud Adoption Framework for Azure</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/cloud-adoption-framework/plan/prepare-organization-for-cloud?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Prepare your organisation for the cloud</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/cloud-adoption-framework/plan/document-cloud-adoption-plan?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Document your cloud adoption plan</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/cloud-adoption-framework/manage/ready-cloud-operations?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Ready your Azure cloud operations</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/well-architected/what-is-well-architected-framework?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Well-Architected Framework</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/well-architected/operational-excellence/principles?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Operational Excellence design principles</a></li>
<li class=""><a href="https://learn.microsoft.com/cloud-computing/finops/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">What is FinOps?</a></li>
<li class=""><a href="https://learn.microsoft.com/cloud-computing/finops/framework/quantify/quantify-business-value?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Quantify business value with FinOps</a></li>
</ul>]]></content:encoded>
            <category>Misc</category>
        </item>
        <item>
            <title><![CDATA[OMO Teams: Multi-agent project delivery with ARB gates]]></title>
            <link>https://luke.geek.nz/misc/omo-teams-arb-gates/</link>
            <guid>https://luke.geek.nz/misc/omo-teams-arb-gates/</guid>
            <pubDate>Mon, 01 Jun 2026 02:28:10 GMT</pubDate>
            <description><![CDATA[Explore how OMO Teams streamline multi-agent project delivery with ARB gates, enhancing governance and efficiency in Azure workflows.]]></description>
            <content:encoded><![CDATA[<p>I've spent the last year building AI agent workflows for Azure projects, and I kept running into the same problem. The agents were useful in isolation - writing code, reviewing PRs, checking security - but there was no structure connecting them. No governance. No audit trail. No one could tell me who approved what and why.</p>
<p>So I built some Teams, using the <a href="https://omo.dev/docs#team-mode" target="_blank" rel="noopener noreferrer" class="">Oh My OpenAgent Team Mode</a> using the opensource <a href="https://opencode.ai/" target="_blank" rel="noopener noreferrer" class="">OpenCode</a> harness.</p>
<p>The idea is simple: five phases, each with a dedicated team of AI agents, and an Architecture Review Board (ARB) gate between them. The gate has real voters, real quorum rules, and a real escalation path when things deadlock. Every decision gets committed as a Markdown file - essentially governance as code.</p>
<p>And because I believe in eating your own dog food, I used OMO Teams to build the OMO Teams Quickstart. This post walks through what happened.</p>
<p><img decoding="async" loading="lazy" alt="OMO Teams overview" src="https://luke.geek.nz/assets/images/OhmyOpenCodeAgent-Overview-e8905de1476658361197430b6a9a8732.png" width="654" height="192" class="img_ev3q"></p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-five-phase-model">The five-phase model<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#the-five-phase-model" class="hash-link" aria-label="Direct link to The five-phase model" title="Direct link to The five-phase model" translate="no">​</a></h2>
<p>The lifecycle covers everything from intake to production. Each phase has a different voter set because the decisions at each gate are different.</p>
<table><thead><tr><th>Phase</th><th>Voters</th><th>What they're signing off</th></tr></thead><tbody><tr><td>Phase 0 - Intake</td><td>Product Owner, Cloud Economics</td><td>Is the business case sound? Is the budget real?</td></tr><tr><td>Phase 1 - Architecture</td><td>Principal Architect, Security Lead, Product Owner</td><td>Are the ADRs correct? Is the threat model complete?</td></tr><tr><td>Phase 2 - Build</td><td>Principal Architect, Product Owner</td><td>Does the code match the architecture?</td></tr><tr><td>Phase 3 - Validate</td><td>Security Lead, Product Owner, Cloud Economics</td><td>Do the tests pass? Are there open security findings?</td></tr><tr><td>Phase 4 - Prod</td><td>Product Owner, Security Lead, Cloud Economics</td><td>Are the runbooks ready? Has DR been tested?</td></tr></tbody></table>
<p>The voters are defined in a YAML config file that gets fed into a tally python script. The script reads individual vote files, checks quorum, and writes an immutable outcome. No dashboards, no ticket systems, no approval workflows in a SaaS tool. Just Markdown and a Python script.</p>
<p><img decoding="async" loading="lazy" alt="OMO five-stage lifecycle teams" src="https://luke.geek.nz/assets/images/OMO_5StageLifecycleTeams-55a4bc703b8a1941cc9bc56e29c99489.jpg" width="4503" height="1056" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-the-teams-are-wired">How the teams are wired<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#how-the-teams-are-wired" class="hash-link" aria-label="Direct link to How the teams are wired" title="Direct link to How the teams are wired" translate="no">​</a></h2>
<p>Before walking through what happened, it helps to understand how these teams actually get their capabilities.</p>
<p>Each team is a JSON config file in <code>.omo/teams/</code>. A team has a <strong>lead</strong> (always the <code>atlas</code> subagent type, which acts as the phase coordinator) and a list of <strong>members</strong>. Every member has two things: an <code>always_load</code> skill list and a <code>conditional_load</code> list.</p>
<p><code>always_load</code> gives the member its baseline capabilities - skills that load regardless of the project. <code>conditional_load</code> is where it gets interesting: the orchestrator scans the ADR files for keywords and loads additional skills only when they match. A backend architect working on a Cosmos DB project gets <code>cosmos-db-nosql-patterns</code> loaded automatically. One working on Postgres gets <code>postgresql-npgsql</code> instead. The member prompts stay generic; the skills carry the domain depth.</p>
<p><img decoding="async" loading="lazy" alt="OMO skill loading model" src="https://luke.geek.nz/assets/images/OMO_SkillLoading-f22a810d695431ef89f0582f68b63227.jpg" width="1920" height="2553" class="img_ev3q"></p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token property">"always_load"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"context-map"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"adr-management"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"azure-container-apps"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token property">"conditional_load"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"> </span><span class="token property">"if_adr_contains"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"networking"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"vnet"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"private endpoint"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token property">"load"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"private-networking"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"azure-network-security-perimeter"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"> </span><span class="token property">"if_adr_contains"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"managed identity"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token property">"load"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"identity-managed-identity"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><br></div></code></pre></div></div>
<p>Members also consume <code>load_instructions</code> - paths to instruction files (coding standards, writing style guides, data sovereignty rules) that get prepended to the member's prompt before it runs. A backend builder picks up C# and ASP.NET Core standards automatically. An infra builder picks up Bicep conventions. The knowledge lives in the instruction files, not duplicated across every member prompt.</p>
<p>Members can also declare <code>trivial_project_skip: true</code>. For a simple stateless API like LinkSnap, the orchestrator reads a trivial-project-mode marker and skips members that aren't relevant - the data architect, agent architect, frontend builder, and MCP server builder all sat out. That's not a limitation; it's the teams adapting to project scope.</p>
<p>The orchestrator detects the current phase by reading <code>.sisyphus/state/</code> - a set of marker files that gate progression. No state variable in memory, no database. Just files on disk. <code>phase1-gate-approved.md</code> exists → Phase 2 can start. It doesn't exist yet → ARB Gate 1 must run first.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-project-linksnap">The project: LinkSnap<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#the-project-linksnap" class="hash-link" aria-label="Direct link to The project: LinkSnap" title="Direct link to The project: LinkSnap" translate="no">​</a></h2>
<p>The application itself is deliberately boring. It's a URL shortener API built with FastAPI, deployed on Azure Container Apps, with Cosmos DB for NoSQL. About 200 lines of Python and 60 lines of Bicep. I wanted the product to be simple enough that nobody would mistake it for the interesting part. The interesting part is the process.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-0-intake">Phase 0: intake<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#phase-0-intake" class="hash-link" aria-label="Direct link to Phase 0: intake" title="Direct link to Phase 0: intake" translate="no">​</a></h2>
<p>Phase 0 is a single-member team: the <strong>product-owner</strong> agent, running on a <code>writing</code> category model with <code>quick</code> as fallback.</p>
<p><img decoding="async" loading="lazy" alt="Phase 0 intake artifacts" src="https://luke.geek.nz/assets/images/OhmyOpenCodeAgent-Phase0Intake-32f3f140af08fb9e75746a77ddac2b16.png" width="1461" height="264" class="img_ev3q"></p>
<p>The product-owner carries a substantial always_load skill set:</p>
<ul>
<li class=""><code>create-prd</code> - structures the project requirements document</li>
<li class=""><code>business-case-investment-justification</code> - formalises the financial case</li>
<li class=""><code>stakeholder-map</code> - identifies who needs to be in the room</li>
<li class=""><code>risk-register</code> - Likelihood × Impact matrix</li>
<li class=""><code>agent-economics</code> - estimates token spend across all five phases before a single line of code is written</li>
<li class=""><code>identify-assumptions</code>, <code>pre-mortem</code>, <code>prioritize-features</code>, <code>value-proposition</code>, <code>naming-strategist</code>, <code>user-stories</code></li>
</ul>
<p>That last one matters: running the <code>agent-economics</code> skill at intake means the ARB gate voters know what the agent budget is before they approve anything. It's not just infrastructure cost - it's a budget for the agents doing the work.</p>
<p>The output gate requires six documents to exist before the ARB will even convene: problem statement, success metrics, budget constraints, compliance scope, risk register, and stakeholder map. If any are missing, the gate blocks.</p>
<p>For LinkSnap, the product-owner ran a discovery session on the project brief. The output was a risk register with seven items, a cost estimate showing $60/month, and an agent economics budget of around $18 in total token spend across all five phases.</p>
<p>The risk register is worth a closer look because it shows the kind of thing that usually gets missed until it becomes a problem.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">R01 - Team has no production experience with ACA secrets</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      Likelihood: High, Impact: High</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      Mitigation: Spike on managed identity + Key Vault in Phase 1</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">R06 - No DR plan for single-region ACA deployment</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      Likelihood: Low, Impact: High</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      Acceptance: Document in runbook as known gap</span><br></div></code></pre></div></div>
<p>R01 turned out to be prescient. More on that later.</p>
<p>Gate 0 to 1 passed unanimously. Product Owner approved the business case. Cloud Economics signed off on the $60 monthly estimate with a 20% variance clause. The Quickstart had a green light.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-1-architecture">Phase 1: architecture<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#phase-1-architecture" class="hash-link" aria-label="Direct link to Phase 1: architecture" title="Direct link to Phase 1: architecture" translate="no">​</a></h2>
<p>Phase 1 is the most complex team: six members running in three sequential waves, each wave gated by artifacts from the previous one.</p>
<p><strong>Wave 1 - backend-arch and infra-arch run in parallel.</strong> Neither needs the other's output to start.</p>
<p>The <strong>backend-arch</strong> member uses a <code>quick</code> model and always loads <code>context-map</code>, <code>adr-management</code>, <code>dotnet-backend-patterns</code>, <code>typespec-api-design</code>, <code>observability-monitoring</code>, and <code>azure-well-architected-assessment</code>. Its job is to produce the backend API ADR (REST style, Problem Details, versioning), an async/event strategy ADR, an OpenAPI contract, and an observability design. For LinkSnap, <code>cosmos-db-nosql-patterns</code> loaded conditionally because the word "cosmos" appeared in the project brief.</p>
<p>The <strong>infra-arch</strong> member always loads <code>azure-container-apps</code>, <code>azd-deployment</code>, <code>azure-deployment-preflight</code>, <code>finops</code>, and <code>azure-well-architected-assessment</code>. It produces the hosting platform ADR with a cost estimate, a secret management ADR, a deployment toolchain ADR, and a DR strategy ADR. For LinkSnap, <code>identity-managed-identity</code> loaded conditionally from the <code>key vault, secret</code> match.</p>
<p><strong>Wave 2 - data-arch and agent-arch</strong> require at least two ADRs to already exist before they run. They self-report missing prerequisites by writing a <code>*-waiting.md</code> file and stopping. The orchestrator's pre-flight scan catches these waiting files at the start of each invocation and runs the blocker first. Both members have <code>trivial_project_skip: true</code> - they sat out for LinkSnap entirely.</p>
<p><strong>Wave 3 - security-arch and cloud-economics</strong> need at least four ADRs plus a data model before they start.</p>
<p>The <strong>security-arch</strong> member always loads <code>threat-modelling</code>, <code>api-security-review</code>, and <code>risk-register</code>. It writes a STRIDE threat model, an OWASP security review, and updates the risk registry with any high or critical findings. For agentic projects it picks up <code>owasp-agentic</code> and <code>agent-governance-toolkit</code> conditionally.</p>
<p>The <strong>cloud-economics</strong> member always loads <code>finops</code>, <code>cost-optimization</code>, <code>agent-economics</code>, and <code>azure-cost-calculator</code>. It produces the Azure cost estimate and an agent economics phase report. The exit gate requires at least six ADRs before the ARB can convene.</p>
<p>For LinkSnap: three ADRs got written. ACA hosting (ADR-001), Cosmos DB partition key on <code>tenant_id</code> (ADR-002), and managed identity for all Azure resource authentication (ADR-003). The backend-arch member reviewed the ADRs and confirmed the Cosmos partition key choice was correct. The infra-arch member approved the hosting and secret management approach.</p>
<p>Then the security ARB voter flagged a conditional.</p>
<p>The managed identity ADR said "use managed identity" but didn't say how local development would authenticate. The security lead's point was fair: without a documented local auth path, someone would inevitably drop a connection string into a <code>.env</code> file and commit it. Two mandatory revisions came out of it: document the DefaultAzureCredential fallback path, and decide whether Cosmos DB would use a private endpoint or public access with IP firewall.</p>
<table><thead><tr><th>Voter</th><th>Vote</th><th>Why</th></tr></thead><tbody><tr><td>Principal Architect</td><td>Approved</td><td>ADRs are solid</td></tr><tr><td>Security Lead</td><td>Conditional</td><td>Need local auth path and network decision</td></tr><tr><td>Product Owner</td><td>Approved</td><td>Conditions are manageable</td></tr></tbody></table>
<p>The gate passed as CONDITIONAL. Phase 2 couldn't start until both revisions were committed. It took about an hour to resolve. That's the whole point of a gate - catch it before it becomes code.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-2-build">Phase 2: build<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#phase-2-build" class="hash-link" aria-label="Direct link to Phase 2: build" title="Direct link to Phase 2: build" translate="no">​</a></h2>
<p>Phase 2 has six members and runs in two waves. The first wave produces contracts; the second wave builds everything.</p>
<p><strong>Wave 1 - contracts only.</strong> The <strong>backend</strong> member writes the OpenAPI spec to <code>api-contracts/openapi.yaml</code>, a contract summary, and a <code>backend-contracts-ready.md</code> state marker, then stops. The <strong>mcp-server</strong> member (if not skipped) writes tool schemas to <code>tool-contracts/tools.json</code> and stops. Wave 2 can't start until the contracts exist - the frontend and AI engineer members check for their contract files and write a <code>*-waiting.md</code> file if they're missing.</p>
<p>For LinkSnap, both <code>mcp-server</code> and <code>frontend</code> had <code>trivial_project_skip: true</code>, so Wave 1 was just the backend producing its OpenAPI contract.</p>
<p><strong>Wave 2 - full implementation.</strong> Four members ran for LinkSnap (two were skipped):</p>
<p>The <strong>backend</strong> member uses a <code>deep</code> category model - the highest compute tier in the team. Its always_load includes <code>dotnet-backend-patterns</code>, <code>observability-monitoring</code>, and <code>code-review</code>. The <code>identity-managed-identity</code> and <code>azure-role-selector</code> skills loaded conditionally from the managed identity ADR. It built the FastAPI app with three endpoints (<code>/health</code>, <code>POST /links</code>, <code>GET /links/{short_code}</code>), managed identity auth to Cosmos DB, and a container image pinned to a specific Python 3.12-slim digest. It then ran <code>code-review</code> on its own output before declaring done.</p>
<p>The <strong>infra</strong> member always loads <code>azd-deployment</code>, <code>azure-container-apps</code>, <code>azure-defaults</code>, <code>code-review</code>, and <code>azure-deployment-preflight</code>. The <code>identity-managed-identity</code> and <code>azure-role-selector</code> skills loaded from the managed identity and RBAC ADR keywords. It built the Bicep template: Container Apps environment, Cosmos DB in serverless mode, ACR registry, and the RBAC assignment wiring the ACA managed identity to Cosmos read/write. No connection strings anywhere. It also ran <code>code-review</code> on its own output.</p>
<p>The <strong>data-engineer</strong> and <strong>ai-engineer</strong> members both had <code>trivial_project_skip: true</code> for a stateless API demo.</p>
<p>Code review caught a hardcoded ACR URL in the CI pipeline. Security scan flagged a base image with no SHA pin - fixed before merge. Both issues got resolved in the same PR. That's the advantage of having agents that review as you build rather than finding these things in a production incident post-mortem.</p>
<p>Gate 2 to 3 passed clean. Code matched the ADRs, tests passed, CI pipeline was green.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-3-validate">Phase 3: validate<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#phase-3-validate" class="hash-link" aria-label="Direct link to Phase 3: validate" title="Direct link to Phase 3: validate" translate="no">​</a></h2>
<p>Phase 3 has three members, all running without trivial project skips.</p>
<p>The <strong>integration</strong> member always loads <code>code-review</code> and <code>azure-well-architected-assessment</code>. It verifies integration points match the ADRs and runs a WAF assessment. For Cosmos-based projects it picks up <code>cosmos-db-nosql-patterns</code> conditionally. Results go to <code>evidence/phase3/integration/test-results.md</code>.</p>
<p>The <strong>qa</strong> member has no always_load skills - it picks everything up conditionally. Performance testing? <code>azure-performance-resilience-validation</code>. E2E tests? <code>playwright-testing</code>. Deployment pipeline? <code>github-actions-ci-cd</code>. For LinkSnap the load was light: just functional test verification. Results go to <code>evidence/phase3/qa/test-results.md</code>.</p>
<p>The <strong>security-engineering</strong> member always loads <code>threat-modelling</code> and <code>api-security-review</code>. It reads the threat model from Phase 1, implements all mitigations, checks for committed secrets, and addresses OWASP Top 10. For agentic projects it picks up <code>owasp-agentic</code> conditionally. Results go to <code>evidence/phase3/security/hardening-evidence.md</code>.</p>
<p>The exit gate requires all three evidence files to exist before the ARB convenes.</p>
<p>For LinkSnap: load test ran with 50 concurrent users against a staging revision. Average response time was 180ms for writes and 45ms for reads. ACA scaled from one to two replicas during the test. Cosmos DB serverless handled the load without throttling.</p>
<p>Trivy scan found zero critical or high vulnerabilities. pip-audit found zero known advisories in the dependency tree. The two accepted gaps from the threat model - no auth and no rate limiting - were documented in the runbook as known v1 limitations.</p>
<p>Security Lead approved. Product Owner approved. Cloud Economics confirmed no cost overrun.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-4-production">Phase 4: production<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#phase-4-production" class="hash-link" aria-label="Direct link to Phase 4: production" title="Direct link to Phase 4: production" translate="no">​</a></h2>
<p>Phase 4 has three members. Entry requires the Phase 3 evidence directory to exist.</p>
<p>The <strong>reliability</strong> member always loads <code>observability-monitoring</code>. It picks up <code>dr-design</code>, <code>chaos-engineering</code>, <code>azure-sre-agent</code>, and <code>azure-safe-deployment-practices</code> conditionally based on ADR keywords. It writes SLOs to <code>reliability/slos.md</code> and produces dashboards, alerts, and DR validation evidence.</p>
<p>The <strong>runbooks</strong> member uses a <code>writing</code> category model and always loads <code>docs-style</code>. It picks up <code>azure-troubleshooting</code>, <code>release-notes</code>, <code>post-mortem</code>, and <code>adr-management</code> conditionally. It also consumes writing-style and markdown instruction files from <code>load_instructions</code>, so the runbook prose matches the project's documentation standards. Outputs: deployment runbook and incident response guide.</p>
<p>The <strong>cloud-economics</strong> member always loads <code>finops</code>, <code>cost-optimization</code>, <code>agent-economics</code>, and <code>azure-cost-calculator</code>. It writes the final variance report comparing actuals against the Phase 1 estimates, and a phase-4 agent economics summary.</p>
<p>The exit gate requires SLOs, the deployment runbook, and the incident response guide to all exist before the final ARB convenes.</p>
<p>For LinkSnap: the runbook covers health checks, log queries, scaling commands, escalation contacts, and the known gaps. DR is documented as a single-region best-effort arrangement with manual redeployment as the recovery path. Realistic for a Quickstart. Wouldn't fly for a production customer workload, but it's honest about what it is.</p>
<p>The final gate passed unanimously. Five phases, five gates, one shipped application.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-arb-team">The ARB team<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#the-arb-team" class="hash-link" aria-label="Direct link to The ARB team" title="Direct link to The ARB team" translate="no">​</a></h2>
<p>Every gate invokes the same <code>arb-review</code> team. It has four members.</p>
<p><strong>Janus</strong> is the chair. Janus always loads <code>governance-gate</code>, <code>azure-well-architected-assessment</code>, <code>pressure-test</code>, and <code>risk-register</code>. Janus runs the evidence inventory using the governance-gate checklist, writes a WAF assessment, assembles the gate checklist, collects the three votes, and tallies the outcome. The quorum rule: 4/5 = APPROVED, 3/5 = CONDITIONAL (mandatory revisions), 2+/5 = REJECTED (phase must be redone). If the ARB deadlocks, Janus writes <code>escalated.md</code> and surfaces to a human arbiter.</p>
<p>The three voters are <strong>product-owner</strong>, <strong>security</strong>, and <strong>cloud-economics</strong> - each reading their relevant artifacts and writing a vote file to <code>arb/{phase}-gate/votes/</code>.</p>
<p>The product-owner voter always loads <code>pressure-test</code> and votes on alignment with business intent, success metrics, budget, and scope.</p>
<p>The security voter always loads <code>threat-modelling</code> and votes on whether threats are mitigated or accepted, OWASP findings addressed, and compliance met. For agentic projects it picks up <code>owasp-agentic</code>, <code>responsible-ai-operating-model</code>, and <code>agent-governance-toolkit</code> conditionally.</p>
<p>The cloud-economics voter always loads <code>cost-optimization</code>, <code>agent-economics</code>, and <code>azure-cost-calculator</code> and votes on whether cost is within budget, agent economics are within baseline, and variance is acceptable.</p>
<p>Every vote file, every checklist, and every outcome is committed to <code>.sisyphus/knowledge/arb/{phase}-gate/</code>. The audit trail is the repository.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-the-gates-caught">What the gates caught<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#what-the-gates-caught" class="hash-link" aria-label="Direct link to What the gates caught" title="Direct link to What the gates caught" translate="no">​</a></h2>
<p>The CONDITIONAL at Gate 1 was the most valuable outcome of the entire process. The security ARB voter's two revisions forced a decision that would have been painful to retrofit. The local dev auth path is the kind of thing that gets deferred until someone commits a key. The network perimeter decision is the kind of thing that gets deferred until a compliance audit finds a public Cosmos endpoint.</p>
<p>The earlier you catch these, the cheaper they are to fix. That's not a new insight. What's new is that the ARB gate made it structural rather than depending on someone happening to ask the right question in a meeting.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-the-economics-looked-like">What the economics looked like<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#what-the-economics-looked-like" class="hash-link" aria-label="Direct link to What the economics looked like" title="Direct link to What the economics looked like" translate="no">​</a></h2>
<p>One thing I tracked across all five phases was the agent token budget. The <code>agent-economics</code> skill, loaded by both the Phase 0 product-owner and the Phase 1 cloud-economics member, allocated 510,000 tokens across the lifecycle, split by model tier. Phase 1 (architecture) used the heaviest models (Opus for lead, Sonnet for members). Phase 2 (build) had the highest raw token count because of all the code generation and review loops - the backend member runs on the <code>deep</code> category, which is the costliest tier.</p>
<p>The actuals came in under budget. Phase 0 used 14,200 tokens against a 20,000 budget. Phase 1 used 72,000 against 80,000. Phase 2 was the tightest at 185,000 against 200,000 - the code review loops chewed up more iterations than expected. The total across all five phases was about 460,000 tokens, which at current API pricing works out to roughly $16. The cheapest governance process I've ever run.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="lessons-learned">Lessons learned<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#lessons-learned" class="hash-link" aria-label="Direct link to Lessons learned" title="Direct link to Lessons learned" translate="no">​</a></h2>
<p>Three things stood out from running this end to end.</p>
<p>First, the conditional gate was the most valuable outcome. If every gate had passed unanimously, I'd be writing a different post - one about how governance is easy when nothing goes wrong. The security ARB voter's conditional at Gate 1 proved the system works. It caught something real, produced a concrete revision list, and the team resolved it in an hour.</p>
<p>Second, the skill-loading design means the teams adapt to the project. For LinkSnap, four of the six Phase 2 members were skipped entirely via trivial project mode, and the remaining two picked up exactly the skills they needed (Cosmos patterns, managed identity) from the ADR keyword matching. A more complex project with AI agents, a frontend, and a data layer would activate all six members and load the relevant skills for each domain. The same config file, different capability surface.</p>
<p>Third, the artifact trail is the killer feature. Every approval, every condition, every rejection rationale is committed as a Markdown file. If someone asks six months from now who approved the Cosmos partition key decision, the answer is in <code>.sisyphus/knowledge/adrs/ADR-002-cosmos-partition-key.md</code> with a signed-off vote from the Principal Architect. No digging through Teams messages or email threads.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-quickstart">The Quickstart<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#the-quickstart" class="hash-link" aria-label="Direct link to The Quickstart" title="Direct link to The Quickstart" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>Quick link</div><div class="admonitionContent_BuS1"><p>Browse the full implementation in the <a href="https://github.com/lukemurraynz/omo-teams-quickstart" target="_blank" rel="noopener noreferrer" class="">OMO Teams Quickstart repository</a>.</p></div></div>
<p>The full Quickstart repo is at <code>lukemurraynz/omo-teams-quickstart</code> if you want to see all the artifacts. Every ADR, every vote file, every gate outcome, every risk register entry, and the full application code. The README walks through how to deploy it yourself.</p>
<p>If you're running AI agents on Azure projects and wondering where the governance is, this is a pattern you can adapt. Start with the five-phase model, pick your voters, and run your first gate.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://luke.geek.nz/misc/omo-teams-arb-gates/#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://opencode.ai/" target="_blank" rel="noopener noreferrer" class="">OpenCode</a></li>
<li class=""><a href="https://omo.dev/" target="_blank" rel="noopener noreferrer" class="">Oh My OpenAgent</a></li>
<li class=""><a href="https://github.com/lukemurraynz/omo-teams-quickstart" target="_blank" rel="noopener noreferrer" class="">OMO Teams Quickstart repo</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/container-apps/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Container Apps documentation</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/cosmos-db/serverless/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Cosmos DB for NoSQL serverless</a></li>
</ul>]]></content:encoded>
            <category>Azure</category>
            <category>Misc</category>
        </item>
        <item>
            <title><![CDATA[Agentic Operations Lakehouse: Drasi & Microsoft Framework]]></title>
            <link>https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/</link>
            <guid>https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/</guid>
            <pubDate>Wed, 27 May 2026 09:41:53 GMT</pubDate>
            <description><![CDATA[Explore how to enhance hospital operations with an Agentic Operations Lakehouse using Drasi and Microsoft Agent Framework for efficient risk management.]]></description>
            <content:encoded><![CDATA[<p>Hospital operations run on a web of concurrent signals. Theatre lists change throughout the day. PACU bays fill and empty. Sterile tray queues build up. Discharge blockers cascade into bed shortages. None of these individually defines a risk <em>(it's the combination that matters)</em>, and the window to act is often under an hour.</p>
<p>A traditional response is a coordinator checking spreadsheets, chasing phone calls, and making judgement calls with incomplete information. A common response to using AI for this kind of scenario, would be to route the problem through a chat assistant and hope the prompt captures enough context. In this kind of operational workflow, that is not enough on its own: the system needs an audit trail, grounding in historical outcomes, and a clear boundary between what it can decide autonomously and what needs a human to approve.</p>
<p>I wanted to see if a different approach was feasible:</p>
<blockquote>
<p>One where AI agents can produce evidence-backed recommendations grounded in historical patterns, high-impact actions always require human approval, every decision is recorded for audit and replay, and the detection logic is deterministic and testable <em>(not buried in a prompt)</em>.</p>
</blockquote>
<p>This post covers the Proof of Technology I built to validate an <strong>Agentic Operations Lakehouse</strong> style pattern _(and to be frank, it was a good chance for some fun, tieing these technologies together).</p>
<!-- -->
<p>Three Azure/hero technologies each own a distinct part of the problem:</p>
<blockquote>
<p><strong><a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a></strong> for live risk detection
<strong><a href="https://learn.microsoft.com/agent-framework/overview/?pivots=programming-language-csharp&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Agent Framework</a></strong> for governed agent reasoning
<strong><a href="https://learn.microsoft.com/fabric/fundamentals/microsoft-fabric-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Fabric</a></strong> for operational memory.</p>
</blockquote>
<p>To do this, we will use a healthcare scenario using entirely synthetic data <em>(no real patient data, clinical records, or live hospital systems at any point)</em>, and the full source is in the <a href="https://github.com/lukemurraynz/AgenticLakehousePoT" target="_blank" rel="noopener noreferrer" class="">AgenticLakehousePoT repository</a>.</p>
<blockquote>
<p><strong>The short version.</strong> Drasi runs continuous queries that detect risk signals deterministically <em>(in my scenario its multiple tables in Azure Postgres and Event Hub sources, but it could be from multiple different sources)</em>. Microsoft Agent Framework runs a 14-stage workflow <em>(5 LLM calls, 9 deterministic stages)</em> that reasons about the event and produces a recommendation. In the <a href="https://github.com/lukemurraynz/AgenticLakehousePoT" target="_blank" rel="noopener noreferrer" class="">source implementation</a>, the LLM cannot write directly to storage or bypass the action routing table, and every decision is recorded in Fabric for audit. High-impact actions require human approval.</p>
</blockquote>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The full implementation is in the <a href="https://github.com/lukemurraynz/AgenticLakehousePoT" target="_blank" rel="noopener noreferrer" class="">AgenticLakehousePoT repository</a> on GitHub. Everything in this post deploys from that repo with <code>azd up</code>. Feel free to fork, review etc.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-in-one-sentence">The architecture in one sentence<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#the-architecture-in-one-sentence" class="hash-link" aria-label="Direct link to The architecture in one sentence" title="Direct link to The architecture in one sentence" translate="no">​</a></h2>
<p><img decoding="async" loading="lazy" alt="Agentic Operations Lakehouse - System Architecture" src="https://luke.geek.nz/assets/images/AgenticOperationsLakehouse_SystemArchitecture-6308f7ef2fd1443729c8d97dd9917e24.png" width="1525" height="982" class="img_ev3q">
<em>Full system architecture: Drasi detects risks from PostgreSQL and Event Hubs; the 14-stage MAF workflow reasons over them using Microsoft Foundry agents; Microsoft Fabric stores every outcome; the React operator portal surfaces recommendations and approvals to role-selected operators.</em></p>
<p><img decoding="async" loading="lazy" alt="System overview walkthrough" src="https://luke.geek.nz/assets/images/SystemOverview-171195a9c0eaa9f47c7d812fe3e4b7bd.gif" width="1689" height="977" class="img_ev3q">
<em>Live walkthrough of the operator portal: risk events detected by Drasi appearing as recommendations, with role-based access and action approval flow visible in the UI.</em></p>
<p>Drasi detects that a risk exists. Microsoft Agent Framework reasons about it and produces a recommendation. Fabric stores the evidence. Humans approve or reject. The workflow records everything. It is straightforward when you break it down, but the interesting part is how these three pieces fit together (and what each one is not allowed to do).</p>
<p><img decoding="async" loading="lazy" alt="End-to-end event flow - signal to outcome" src="https://luke.geek.nz/assets/images/EndtoEndFlow-SignalToOutcome-3a99a0a129cf3ae83c919e7462fd21a9.png" width="1555" height="974" class="img_ev3q">
<em>Temporal ordering of the full pipeline: synthetic signals trigger Drasi detection, which kicks off the MAF workflow, which reads and writes Fabric context, ultimately serving the operator portal. Self-messages show internal processing; dashed arrows are replies.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-split-that-actually-matters">The split that actually matters<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#the-split-that-actually-matters" class="hash-link" aria-label="Direct link to The split that actually matters" title="Direct link to The split that actually matters" translate="no">​</a></h2>
<p>The design principle that sets this apart from "put everything in a prompt" is the agentic/deterministic boundary. In this implementation, there are fourteen workflow stages: five invoke a Foundry LLM agent, the other nine are deterministic <em>(schema validation, database queries, KQL reads, static routing lookups, API calls, and Fabric writes)</em>.</p>
<p><img decoding="async" loading="lazy" alt="14-stage RiskDecisionWorkflow - agentic and deterministic stages annotated" src="https://luke.geek.nz/assets/images/RiskDecisionWorkflow_14StageExecution-e444e368af461e9bc5ad147fb0e53c39.png" width="845" height="1069" class="img_ev3q">
<em>All 14 stages in execution order. Navy fill = LLM-backed (Azure AI Foundry agent). White fill = fully deterministic. Stage 7 (ApplySafetyPolicy) runs both layers.</em></p>
<p>The LLM handles the parts that need contextual reasoning <em>(classifying a risk, routing to the right specialist, producing a recommendation for a specific operator role, checking whether a risk is still live)</em>. The deterministic code handles the parts that need repeatability and safety enforcement <em>(validation, state queries, the action routing table, SLA policy resolution, and Fabric writes)</em>.</p>
<p>If an agent returns unexpected output, the deterministic stages either catch it <em>(the <code>decisionDrivers</code> validation)</em>, ignore it <em>(the routing lookup blocks unknown actions)</em>, or record it for audit without acting on it. The LLM cannot bypass the routing table or write to Fabric directly.</p>
<p><img decoding="async" loading="lazy" alt="Component boundaries - what each system owns and does not own" src="https://luke.geek.nz/assets/images/ComponentBoundaries-0bf4be2b690eebaedf6cdd8e6712c95c.png" width="1435" height="500" class="img_ev3q">
<em>Each component's explicit boundary: green = owned responsibility, grey = explicitly excluded. When something goes wrong, you know exactly which component to investigate.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="drasi-keeps-detection-honest">Drasi keeps detection honest<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#drasi-keeps-detection-honest" class="hash-link" aria-label="Direct link to Drasi keeps detection honest" title="Direct link to Drasi keeps detection honest" translate="no">​</a></h2>
<p>First design question: who decides that a risk exists? The answer is not the AI agent. It is <a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a>.</p>
<p>Drasi is an open-source project from Microsoft and a Sandbox project on the CNCF (Cloud Native Computing Foundation) that runs continuous queries over live operational state. When source state changes (a new theatre entry, a PACU bay becoming unavailable, a discharge flag being set), Drasi re-evaluates its queries and emits structured change events. I wrote one query per risk type. Each query defines the signal combination that constitutes a risk. The bed capacity query, for example:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> v1</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> ContinuousQuery</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> healthcare</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">bed</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">capacity</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">risk</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">spec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">mode</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> query</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">queryLanguage</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> Cypher</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">sources</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">subscriptions</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> </span><span class="token key atrule">id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> aol</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">operational</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">postgres</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        </span><span class="token key atrule">nodes</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">          </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> </span><span class="token key atrule">sourceLabel</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> surgical_cases</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">          </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> </span><span class="token key atrule">sourceLabel</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> ward_bed_forecasts</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">query</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">&gt;</span><span class="token scalar string" style="color:rgb(255, 121, 198)"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">    MATCH (c:surgical_cases)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">    MATCH (w:ward_bed_forecasts)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">    WHERE</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      c.ScenarioRunId = w.ScenarioRunId</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      AND c.CorrelationId = w.CorrelationId</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      AND w.StateValue = 'blocked'</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">    RETURN</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      c.Id AS workItemId,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      'bed-capacity-risk' AS riskType,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      'high' AS riskLevel,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      'Post-op bed forecast indicates blocked capacity' AS observedFact,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      'human-approval-required' AS approvalRequirement</span><br></div></code></pre></div></div>
<p>It watches two PostgreSQL tables (<code>surgical_cases</code>, <code>ward_bed_forecasts</code>), matches them on a correlation ID, and when a ward forecast flips to <code>blocked</code> it emits a structured risk event. The output feeds directly into the MAF <em>(Microsoft Agent Framework)</em> workflow as the <code>observedFacts</code> you see in the code samples.</p>
<p><img decoding="async" loading="lazy" alt="Drasi continuous queries running on AKS" src="https://luke.geek.nz/assets/images/Drasi_AKS_Cluster_List-30191c1492c8550ab4e9c19144453b86.gif" width="1005" height="438" class="img_ev3q">
<em>Drasi continuous query containers deployed on Azure Kubernetes Service, listing running pods across the cluster namespace.</em></p>
<p>Detection is testable_(a Drasi query is a declarative expression you can write unit tests against and replay historical events through - if detection logic were inside an agent prompt, testing it would mean evaluating LLM outputs). Detection is observable (Drasi emits structured events with correlation IDs, observed facts, and lifecycle state. When an operator asks "why was this risk flagged?", the answer comes from structured output, not from reconstructing what a model was thinking). And detection is separated from recommendation (if the recommendation is wrong, you can tell whether the agent misread the situation or simply gave bad advice). Drasi hands the MAF workflow a structured, verifiable risk event, and that separation is the important design boundary.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-workflow-is-the-product-not-the-chat">The workflow is the product, not the chat<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#the-workflow-is-the-product-not-the-chat" class="hash-link" aria-label="Direct link to The workflow is the product, not the chat" title="Direct link to The workflow is the product, not the chat" translate="no">​</a></h2>
<p>Once Drasi emits a risk event, the MAF workflow runs. For this pattern, a single LLM call is the wrong boundary because waiting for human approval, checkpointing state for restart, and separating contextual reasoning from structural routing need explicit workflow state. The 14-stage workflow makes each concern an explicit, auditable stage with a clear input, output, and responsibility.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="role-aware-recommendations">Role-aware recommendations<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#role-aware-recommendations" class="hash-link" aria-label="Direct link to Role-aware recommendations" title="Direct link to Role-aware recommendations" translate="no">​</a></h3>
<p>The bed manager wanted discharge-blocker advice. The theatre coordinator wanted case-sequencing language. I had to map the risk type to a role before the recommendation prompt made sense to its reader. The fix was a lookup that derives the primary operator role from the risk type and injects it into the context:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">private static (string Role, string Context) MapRoleFromRiskType(string? riskType) =&gt; riskType switch {</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    "bed-capacity-risk" or "post-op-discharge-coordination-risk" =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        ("Bed Manager",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         "Responsible for ward capacity and patient discharge flow..."),</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    "pacu-throughput-risk" or "theatre-turnover-risk" =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        ("Theatre Coordinator",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         "Responsible for theatre list execution and perioperative flow..."),</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    _ =&gt; ("Operational Manager", "...")</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">};</span><br></div></code></pre></div></div>
<p>Before I added this, the <code>operatorGuidance</code> field read like generic advice. After adding the role name and one sentence of role context, it started using domain vocabulary. A single string addition to the prompt context.</p>
<p>The prompts themselves are versioned artefacts stored in <a href="https://learn.microsoft.com/azure/azure-app-configuration/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure App Configuration</a>, not hardcoded in the worker. They're loaded at startup by <code>PromptLibrary.LoadFromAppConfigurationAsync</code> with a version label and cached with a configurable TTL so the 14-stage workflow doesn't hit App Configuration on every stage call. Updating an agent's instructions means updating an App Configuration key-value pair, not redeploying the service.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="handling-transient-foundry-failures">Handling transient Foundry failures<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#handling-transient-foundry-failures" class="hash-link" aria-label="Direct link to Handling transient Foundry failures" title="Direct link to Handling transient Foundry failures" translate="no">​</a></h3>
<p>A number of the early PACU throughput risk runs hit an <code>incomplete</code> status from the Foundry agent on the first attempt (a transient infrastructure failure with no error detail). The initial code threw immediately, which restarted the entire 14-stage workflow from scratch. The fix was an internal retry within <code>RunAgentAsync</code> on <code>incomplete</code> status, before escalating to the workflow-level retry:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">const int MaxAttempts = 2;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">for (int attempt = 1; attempt &lt;= MaxAttempts; attempt++) {</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    PersistentAgentThread thread = await agentsClient.Threads.CreateThreadAsync(...);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    try {</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        if (run.Status != RunStatus.Completed) {</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            bool isIncomplete = string.Equals(run.Status.ToString(), "incomplete", ...);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            if (isIncomplete &amp;&amp; attempt &lt; MaxAttempts) { continue; }</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            throw new InvalidOperationException(...);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        }</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        return result;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    }</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    finally {</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        await agentsClient.Threads.DeleteThreadAsync(thread.Id, ...);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    }</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></div></code></pre></div></div>
<p>The workflow session stays alive, no stale-progress UX, and the workflow-level retry stays as a safety net for other failure modes.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="fabric-as-operational-memory">Fabric as operational memory<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#fabric-as-operational-memory" class="hash-link" aria-label="Direct link to Fabric as operational memory" title="Direct link to Fabric as operational memory" translate="no">​</a></h2>
<p>Every record the workflow produces is written to Fabric. Every record the next run needs is read from Fabric. That bidirectional relationship is what makes recommendations grounded rather than speculative. When the agent generates a recommendation, it can see how many times this risk type has occurred historically, what the typical escalation latency looks like, and which actions have been effective. The Fabric context is fed directly into the generation prompt as structured data.</p>
<p><img decoding="async" loading="lazy" alt="Fabric workspace containing Eventhouse and KQL data sources" src="https://luke.geek.nz/assets/images/FabricWorkspaceOverview-dac800da822a231cd4ac29d6c7c3fb95.gif" width="1897" height="862" class="img_ev3q">
<em>The Fabric workspace hosting the Eventhouse that stores risk events, recommendations, and approval records (the operational memory layer).</em></p>
<p>The operational memory lives in a Fabric Eventhouse. Risk events stream in from Drasi, the MAF workflow writes recommendation records directly, and the operator portal reads the current state through KQL queries. The Eventhouse schema was designed around three core tables: risk events, recommendation records, and action outcomes.</p>
<p>The KQL schema was the first thing I built and it stayed stable through the whole PoT <em>(Proof of Technology)</em>. I changed it three times and each change required updating the write path, read path, and API contract simultaneously — so I froze it and worked around the constraints instead.</p>
<p><img decoding="async" loading="lazy" alt="Risk event lifecycle data summarised over 7 days in Fabric Eventhouse" src="https://luke.geek.nz/assets/images/Fabric_Eventhouse_RiskEventLifecycle_SummaryOVer7Days-bb9a57b6460ae5258d1b15289222dace.png" width="2158" height="861" class="img_ev3q">
<em>KQL query showing risk event lifecycle data aggregated over a seven-day window. Answers the question "how many risks did I detect, and what happened to each one?"</em></p>
<p>This query is the one I reached for most when testing. It tells you at a glance whether the pipeline is healthy - if new risk events are arriving, if recommendations are being produced, and if they're reaching the operator portal.</p>
<p><img decoding="async" loading="lazy" alt="Recommendation records query - top 100 results in Eventhouse" src="https://luke.geek.nz/assets/images/Fabric_EventHouse_RecommendationsRecords_QueryTop100-a0e942e7dc7cee45bad829cf827cef4c.png" width="2153" height="934" class="img_ev3q">
<em>Querying the top 100 recommendation records in Eventhouse. Each record captures the full decision chain: risk event, agent recommendation, safety policy evaluation, and human approval outcome.</em></p>
<p>The recommendation records are the audit trail. Every decision (agentic and human) is captured in a single row you can trace from the risk event through to the outcome. This was important to me from the start - if someone asks "what happened with this risk?", there is one place to look.</p>
<p><img decoding="async" loading="lazy" alt="System dashboard showing Fabric insights view within the operator portal" src="https://luke.geek.nz/assets/images/SystemOverview_FabricInsightsView-e3f96f46dc035d358cb4909db591cc1a.png" width="1892" height="636" class="img_ev3q">
<em>Fabric insights surfaced in the operator portal, giving operators real-time visibility into the health and throughput of the operational memory layer.</em></p>
<p>A design decision that surprised me: the frontend polls for recommendations and gets nothing until stage 13 (seven stages after the recommendation is generated at stage 6). The recommendation lives in workflow state from stage 6. Fabric only sees it after stage 13, when the complete record (including approval decisions and policy evaluations) is written. This is by design, but operators will see "checking again" for 30-60 seconds. I had to make this explicit in the frontend (Fluent on React).</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="three-layers-of-safety">Three layers of safety<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#three-layers-of-safety" class="hash-link" aria-label="Direct link to Three layers of safety" title="Direct link to Three layers of safety" translate="no">​</a></h2>
<p>Every recommended action is classified before it reaches the operator portal:</p>
<p><img decoding="async" loading="lazy" alt="Safety Model - Three-Layer Action Classification" src="https://luke.geek.nz/assets/images/ThreeLayerSafetyModel-637de320d30297597ce1905932b1108a.png" width="1328" height="947" class="img_ev3q">
<em>Layer 1: deterministic routing table. Layer 2: LLM policy evaluation (parallel). Layer 3: deterministic post-approval gate. Only Layer 2 invokes a Foundry agent.</em></p>
<p><strong>Layer 1 - deterministic routing.</strong> <code>ActionRoutingSteps.Route()</code> classifies each action against hardcoded lookup sets: <code>SafeActions</code>, <code>ApprovalRequiredActions</code>, and a blocked bucket. This runs before any LLM evaluation. An unknown action goes to blocked, regardless of what the agent recommended.</p>
<p><strong>Layer 2 - LLM policy evaluation.</strong> <code>EvaluateSafetyPolicyAsync</code> calls the Foundry agent for each action in parallel to produce contextual rationale. The LLM provides the explanation, the routing table provides the enforced classification.</p>
<p><strong>Layer 3 - deterministic final gate (post-approval).</strong> After the human approves, <code>SafetyPolicyEngine.Evaluate()</code> runs again with freshness signals (no LLM). If the risk has gone stale, no approved actions execute.</p>
<table><thead><tr><th>Class</th><th>Actions</th><th>Behaviour</th></tr></thead><tbody><tr><td>Safe-automated</td><td>create-risk-board-entry, send-role-notification, pacu-throughput-coordination</td><td>Recorded immediately</td></tr><tr><td>Approval-required</td><td>theatre-case-resequencing, alternate-ward-placement, overtime-approval, duty-manager-escalation</td><td>Approval request to Duty Manager</td></tr><tr><td>Blocked (never-allowed)</td><td>surgery-cancellation, clinical-prioritisation, any unknown action</td><td>Blocked before LLM evaluation</td></tr></tbody></table>
<p><img decoding="async" loading="lazy" alt="Recommendation technical detail blade showing the full decision chain" src="https://luke.geek.nz/assets/images/SystemOverview-RecommendationsTechnicalDetailBlade-36cae9952682697944c3d95e0cf92356.gif" width="1897" height="862" class="img_ev3q">
<em>Operator view of a recommendation technical detail blade, showing the risk event summary, agent-generated recommendation, applied safety policy classification, and the approval action buttons for the Duty Manager role.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-i-learned">What I learned<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#what-i-learned" class="hash-link" aria-label="Direct link to What I learned" title="Direct link to What I learned" translate="no">​</a></h2>
<p>These were not obvious up front. I hit each of them the first time I ran a full scenario end-to-end.</p>
<p><strong>The <code>incomplete</code> status from Foundry is a retry problem, not a workflow problem.</strong> I initially threw immediately on the first <code>incomplete</code> return, which restarted all 14 stages from scratch. The fix was an internal retry loop inside <code>RunAgentAsync</code> — operators never saw the stall once I hid it behind the call-level retry. The workflow-level retry is still there as a safety net for everything else.</p>
<p><strong>The recommendation is not visible in Fabric until stage 13, even though stage 6 generates it.</strong> This surprised me the first time I watched the operator portal — the screen looked like it was doing nothing for 30-60 seconds after a risk appeared. I had to add a stage-start event so operators see "running..." instead of a frozen UI. Make this explicit in the portal from the start.</p>
<p><strong>The KQL schema was the first thing I built and it stayed stable.</strong> I changed it three times and each change required updating the write path, read path, and API contract simultaneously. I froze it and worked around the constraints. That was the right call — locking the schema early kept those contracts stable through the rest of the PoT.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-reusable-pattern">The reusable pattern<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#the-reusable-pattern" class="hash-link" aria-label="Direct link to The reusable pattern" title="Direct link to The reusable pattern" translate="no">​</a></h2>
<p>The healthcare scenario is one instance of a general pattern. The same architecture applies anywhere live signals create operational risk, humans need evidence-backed recommendations, and high-impact actions need approval.</p>
<p><img decoding="async" loading="lazy" alt="Cross-industry architecture mapping" src="https://luke.geek.nz/assets/images/CrossIndustryServiceMapping-b70beef458778165d107afe480b0d27b.png" width="1461" height="641" class="img_ev3q">
<em>The same architecture pattern maps across healthcare, manufacturing, logistics, and field service. Each industry has its own work item, capacity constraint, and risk event, but the detection/reasoning/approval/memory structure stays the same.</em></p>
<p>The repository includes three scenario packs (healthcare, manufacturing, and a manufacturing stub). The stub is the quickest way to understand the pattern - it implements <code>IScenarioPack</code> with inline comments mapping each healthcare concept to its manufacturing equivalent. Adding a new industry means implementing that same interface: define risk types, actions, roles, and synthetic data rules. The cross-industry mapping document covers utilities and emergency management as worked examples.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="open-source">Open source<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#open-source" class="hash-link" aria-label="Direct link to Open source" title="Direct link to Open source" translate="no">​</a></h2>
<p>The full implementation is on GitHub under MIT licence in the <a href="https://github.com/lukemurraynz/AgenticLakehousePoT" target="_blank" rel="noopener noreferrer" class="">AgenticLakehousePoT repository</a>. It includes all five microservices, both full scenario packs, Fabric workspace item definitions, Drasi continuous query definitions, the React/Fluent UI operator portal, and eight test projects covering domain logic, integration, safety policy, and agent evaluation (AgentEval) for .NET/MAF agent evaluation.</p>
<p>Clone the repo, run the deployment script, and try it with your own risk types. The manufacturing stub is a good starting point - walk through the inline comments and you can see how the full detection-to-approval flow maps to a different industry. If you build on this pattern for a new industry, I would like to hear about it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://luke.geek.nz/azure/building-agentic-operations-lakehouse-drasi-maf/#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://github.com/lukemurraynz/AgenticLakehousePoT" target="_blank" rel="noopener noreferrer" class="">Agentic Operations Lakehouse on GitHub</a></li>
<li class=""><a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi: open-source continuous query engine</a></li>
<li class=""><a href="https://learn.microsoft.com/agent-framework/overview/?pivots=programming-language-csharp&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Agent Framework documentation</a></li>
<li class=""><a href="https://learn.microsoft.com/fabric/real-time-intelligence/eventhouse?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Fabric Eventhouse documentation</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/foundry/what-is-foundry?tabs=python&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Foundry</a></li>
<li class=""><a href="https://github.com/lukemurraynz/AgenticLakehousePoT/blob/main/docs/architecture/overview.md" target="_blank" rel="noopener noreferrer" class="">Architecture overview</a></li>
<li class=""><a href="https://github.com/lukemurraynz/AgenticLakehousePoT/blob/main/docs/scenarios/healthcare-theatre-flow.md" target="_blank" rel="noopener noreferrer" class="">Healthcare theatre-flow walkthrough</a></li>
<li class=""><a href="https://github.com/lukemurraynz/AgenticLakehousePoT/blob/main/docs/approval-action-system.md" target="_blank" rel="noopener noreferrer" class="">Approval and action system</a></li>
<li class=""><a href="https://github.com/lukemurraynz/AgenticLakehousePoT/blob/main/docs/cross-industry-mapping.md" target="_blank" rel="noopener noreferrer" class="">Cross-industry mapping</a></li>
<li class=""><a href="https://github.com/lukemurraynz/AgenticLakehousePoT/blob/main/docs/safety-boundaries.md" target="_blank" rel="noopener noreferrer" class="">Safety boundaries</a></li>
<li class=""><a href="https://agenteval.dev/" target="_blank" rel="noopener noreferrer" class="">AgentEval: .NET-native evaluation for AI agents</a></li>
</ul>]]></content:encoded>
            <category>Azure</category>
        </item>
        <item>
            <title><![CDATA[Running Azure SRE Agent for AKS and Drasi Operations]]></title>
            <link>https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/</link>
            <guid>https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/</guid>
            <pubDate>Fri, 08 May 2026 07:57:10 GMT</pubDate>
            <description><![CDATA[A practical walkthrough of deploying Azure SRE Agent for AKS and Drasi with AZD, then testing how it handles real platform and runtime issues.]]></description>
            <content:encoded><![CDATA[<p>I have been spending time with <a href="https://learn.microsoft.com/azure/sre-agent/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent</a> and wanted to see how far I could take it beyond the "click around the portal" experience.</p>
<p>The goal was simple: build a public, repeatable blueprint that deploys an Azure SRE Agent for <a href="https://learn.microsoft.com/azure/aks/what-is-aks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">AKS</a> and <a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a> operations with:</p>
<ul>
<li class="">infrastructure deployed through <a href="https://learn.microsoft.com/azure/developer/azure-developer-cli/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Developer CLI</a></li>
<li class="">custom SRE subagents</li>
<li class="">skills and runbooks</li>
<li class="">Azure Monitor response plans</li>
<li class="">scheduled health checks</li>
<li class="">MCP connectors for Microsoft Learn and Drasi docs</li>
<li class="">fault-injection tests for AKS and Drasi failure modes</li>
</ul>
<p>The result is an <a href="https://github.com/lukemurraynz/drasi-aks-sre-agent/" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent with support for Drasi on AKS</a> that can be deployed with <code>azd up</code> using an AVM-style (Azure Verified Modules) Bicep module and PowerShell.</p>
<!-- -->
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent Operations Hub" src="https://luke.geek.nz/assets/images/AzureSREAgent_AKSDrasiOperationsHubOverview-59e0ce1ecc0f10a7e7f22a8bdb2fdeaf.png" width="1892" height="927" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-i-built-this">Why I Built This<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#why-i-built-this" class="hash-link" aria-label="Direct link to Why I Built This" title="Direct link to Why I Built This" translate="no">​</a></h2>
<p><a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a> is a good workload for this pattern because it sits right on the boundary between application runtime and platform reliability.</p>
<p>When a Drasi query is stale or a source is not delivering changes, the root cause might be Drasi itself.</p>
<p>But it might also be:</p>
<ul>
<li class="">an AKS scheduling problem</li>
<li class="">a missing metrics API</li>
<li class="">a broken admission webhook</li>
<li class="">a node under pressure</li>
<li class="">a stopped cluster</li>
<li class="">a DCR or DCRA problem</li>
<li class="">a private-cluster operations path issue.</li>
</ul>
<p>That is where <a href="https://learn.microsoft.com/azure/sre-agent/overview?tabs=task&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">SRE Agent</a> becomes interesting. I had a lot of fun setting this up, and the mind boggles at what this can do!</p>
<p>The Azure SRE Agent can receive an incident from Azure Monitor, route it to a specialist agent, collect evidence, reason through likely causes, and either propose or execute a remediation depending on the response plan mode.</p>
<p>The trick is giving it enough structure so it does not treat every symptom as 'restart the app' - and go through appropriate troubleshooting and evidence gathering steps.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-this-blueprint-deploys">What this Blueprint Deploys<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#what-this-blueprint-deploys" class="hash-link" aria-label="Direct link to What this Blueprint Deploys" title="Direct link to What this Blueprint Deploys" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The full source is at <a href="https://github.com/lukemurraynz/drasi-aks-sre-agent/" target="_blank" rel="noopener noreferrer" class="">lukemurraynz/drasi-aks-sre-agent</a> on GitHub. Everything in this post deploys from that repo with <code>azd up</code>.</p></div></div>
<p>The repository deploys the resources for the SRE Agent with Bicep and wires the agent configuration through a post-provision script.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">drasi-aks-sre-agent/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">├── infra/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── main.bicep</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── drasi-sre-agent.bicep</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   └── drasi-sre-agent-rbac.bicep</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">├── avm/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   └── res/app/agent/main.bicep</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">├── scripts/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   └── setup-sre-agent.ps1</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">├── sre-config/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── agents/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── skills/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── response-plans/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── scheduled-tasks/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   └── testing/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">└── azure.yaml</span><br></div></code></pre></div></div>
<p>At a high level, <code>azd up</code> gives you:</p>
<ul>
<li class=""><code>Microsoft.App/agents</code> Azure SRE Agent</li>
<li class="">managed identity for resource operations</li>
<li class="">Application Insights</li>
<li class="">Log Analytics workspace integration</li>
<li class="">Azure Monitor incident platform</li>
<li class="">Azure Monitor, Log Analytics, Application Insights, Microsoft Learn, and Drasi docs connectors</li>
<li class="">response plans for AKS and Drasi incidents</li>
<li class="">scheduled health probes and daily resilience summaries</li>
<li class="">scoped RBAC for the Drasi resource group and AKS cluster</li>
</ul>
<p><img decoding="async" loading="lazy" alt="Azure Deployed Resources" src="https://luke.geek.nz/assets/images/AzureSREAgent_AzureDeployedResourceOverview-1aa0a515d40b8d21d190cc7782ce07ce.png" width="782" height="344" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="agent-design">Agent Design<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#agent-design" class="hash-link" aria-label="Direct link to Agent Design" title="Direct link to Agent Design" translate="no">​</a></h2>
<p>I split the agent capability into four custom agents:</p>
<table><thead><tr><th>Agent</th><th>Purpose</th></tr></thead><tbody><tr><td><code>drasi-incident-triage</code></td><td>First responder. Classifies the incident and routes by failure phase.</td></tr><tr><td><code>aks-platform-diagnostics</code></td><td>Handles AKS, node, networking, autoscaler, metrics, admission, and upgrade issues.</td></tr><tr><td><code>drasi-runtime-diagnostics</code></td><td>Handles Drasi sources, continuous queries, reactions, Dapr, Redis, Mongo, and Drasi rollout issues.</td></tr><tr><td><code>drasi-remediation-review</code></td><td>Reviews proposed fixes for evidence, risk, rollback, and validation.</td></tr></tbody></table>
<p>I did not want the Drasi runtime agent to debug a cluster-wide scheduling issue. I also don't want the AKS agent deleting Drasi resources when a query or source isn't working.</p>
<p>So the response plans route by failure phase first:</p>
<table><thead><tr><th>Failure phase</th><th>Prefer this route</th></tr></thead><tbody><tr><td>Pod creation fails</td><td>Admission webhook, workload identity, policy, or API server</td></tr><tr><td>Pod is pending</td><td>Scheduler, node capacity, autoscaler, subnet, or quota</td></tr><tr><td>HPA/KEDA is blind</td><td>Metrics API or external metrics API</td></tr><tr><td>Broad <code>kubectl</code> and controller timeouts</td><td>API server, konnectivity, node/network health</td></tr><tr><td>Only Drasi resources are unhealthy after source/query changes</td><td>Drasi lifecycle diagnostics</td></tr></tbody></table>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent - Agent Canvas View" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_AgentCanvas_TableView-4353936a53d8cce65795f053de5d158b.gif" width="1621" height="867" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="built-in-skills-still-matter">Built-In Skills Still Matter<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#built-in-skills-still-matter" class="hash-link" aria-label="Direct link to Built-In Skills Still Matter" title="Direct link to Built-In Skills Still Matter" translate="no">​</a></h2>
<p>One thing I tested was whether custom skills replaced the built-in skills.</p>
<p>They should not.</p>
<p>For <a href="https://learn.microsoft.com/azure/aks/what-is-aks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Kubernetes Service (AKS)</a>, the built-in <code>aks_general</code> skill is still useful for generic Kubernetes operations. The custom <code>aks-platform-diagnostics</code> skill I added contains the more local context for Drasi, known false-positive patterns, and our route-specific evidence bundles.</p>
<p>The setup script only upserts custom skills and agents. It does not overwrite the built-in SRE Agent skills.</p>
<p>That distinction matters because future platform improvements should continue to flow through the built-in skill set.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="custom-skills">Custom Skills<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#custom-skills" class="hash-link" aria-label="Direct link to Custom Skills" title="Direct link to Custom Skills" translate="no">​</a></h2>
<p>Skills are the runbooks that tell each agent what to collect, what to query, and how to reason before proposing a fix.</p>
<p>I wrote three custom skills for this blueprint:</p>
<table><thead><tr><th>Skill</th><th>Attached to</th><th>Evidence bundle</th></tr></thead><tbody><tr><td><code>aks-platform-diagnostics</code></td><td><code>aks-platform-diagnostics</code> agent</td><td>Node status, pod events, admission webhook health, metrics API availability, konnectivity tunnel state, SNAT stats</td></tr><tr><td><code>drasi-runtime-diagnostics</code></td><td><code>drasi-runtime-diagnostics</code> agent</td><td>Drasi source and query status, Dapr sidecar health, Redis and Mongo connectivity, resource-provider logs</td></tr><tr><td><code>drasi-remediation-review</code></td><td><code>drasi-remediation-review</code> agent</td><td>Evidence completeness checklist, risk classification, rollback path verification, validation steps</td></tr></tbody></table>
<p>The setup script applies them on every <code>azd up</code> without touching the built-in skills.</p>
<p>I kept each evidence bundle deliberately narrow. The Drasi runtime skill, for example, always checks source status before looking at any continuous query — because a stale-looking query usually has a source connection problem behind it. If I left that ordering to the model, it would take longer and sometimes go the wrong way.</p>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent - Skills View" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_SkillsView-3f16a4c46fcd9a6d27555a07c225dbdb.gif" width="1879" height="862" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="connector-lesson-connected-does-not-always-mean-enabled">Connector Lesson: Connected Does Not Always Mean Enabled<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#connector-lesson-connected-does-not-always-mean-enabled" class="hash-link" aria-label="Direct link to Connector Lesson: Connected Does Not Always Mean Enabled" title="Direct link to Connector Lesson: Connected Does Not Always Mean Enabled" translate="no">​</a></h2>
<p>The first issue I hit was with the Microsoft Learn and Drasi docs MCP connectors.</p>
<p>The connector status was healthy, but the tools were not active for the agent. In the portal, they showed up as connected but with zero active tools.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>A healthy connector status does not mean the tools are active for your agent. Always verify the tool assignment in the portal, not just the connector health indicator.</p></div></div>
<p>The fix was to configure both the connector metadata and the agent tool assignment:</p>
<div class="language-powershell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-powershell codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">Enable-AgentTools</span><span class="token plain"> </span><span class="token operator">-</span><span class="token plain">ToolNames @</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">'microsoft-learn_microsoft_docs_search'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">'microsoft-learn_microsoft_code_sample_search'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">'microsoft-learn_microsoft_docs_fetch'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">'drasi-docs_fetch_docs_documentation'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">'drasi-docs_search_docs_documentation'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">'drasi-docs_search_docs_code'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">'drasi-docs_fetch_generic_url_content'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></div></code></pre></div></div>
<p>After that, the agent had access to current Microsoft documentation and live Drasi docs during investigations.</p>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent Tools" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_ToolsView-135a8680203d1fe88dad2a0b71136c17.gif" width="1621" height="867" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="response-plans">Response Plans<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#response-plans" class="hash-link" aria-label="Direct link to Response Plans" title="Direct link to Response Plans" translate="no">​</a></h2>
<p>The repo includes direct routes for common <a href="https://learn.microsoft.com/azure/aks/what-is-aks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Kubernetes Service (AKS)</a> and <a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a> incidents.</p>
<p>For <a href="https://learn.microsoft.com/azure/aks/what-is-aks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Kubernetes Service (AKS)</a>:</p>
<ul>
<li class="">cluster stopped</li>
<li class="">CoreDNS unavailable</li>
<li class="">node pressure</li>
<li class="">image pull failures</li>
<li class="">pod scheduling failures</li>
<li class="">storage mount failures</li>
<li class="">Dapr system faults</li>
<li class="">Cilium/network faults</li>
<li class="">Azure Monitor agent faults</li>
<li class="">admission webhook failures</li>
<li class="">autoscaler stuck or capped</li>
<li class="">metrics API unavailable</li>
<li class="">SNAT port exhaustion</li>
<li class="">API server overload</li>
<li class="">konnectivity tunnel faults</li>
<li class="">AKS upgrade blockers</li>
<li class="">namespace or PVC stuck terminating</li>
</ul>
<p>For <a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a>:</p>
<ul>
<li class="">platform fault</li>
<li class="">source unavailable</li>
<li class="">query staleness</li>
<li class="">reaction unavailable</li>
<li class="">Redis/Mongo/Dapr state store faults</li>
<li class="">partial upgrade or failed rollback</li>
<li class="">source bootstrap race</li>
<li class="">source dependency break</li>
</ul>
<p>Most routes stay in <strong>Review</strong> mode. One route is intentionally <strong>Autonomous</strong>:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token property">"id"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"aks-cluster-stopped"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token property">"handlingAgent"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"aks-platform-diagnostics"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token property">"agentMode"</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"autonomous"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><br></div></code></pre></div></div>
<p>If the cluster is stopped, the agent is allowed to start the same AKS cluster <em>(if you grant it permissions to the resource through the User Assigned managed identity to do so)</em> otherwise, you can have this notify you through email/teams, and you can elevate the permissions <em>(as long as you yourself have access to do so)</em>:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">az aks start </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-g</span><span class="token plain"> </span><span class="token operator">&lt;</span><span class="token plain">resource-group</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-n</span><span class="token plain"> </span><span class="token operator">&lt;</span><span class="token plain">aks-cluster-name</span><span class="token operator">&gt;</span><br></div></code></pre></div></div>
<p>That is a bounded, reversible-enough action for my use case. It does not authorize node-pool scale-out, upgrades, networking changes, add-on changes, or cluster recreation.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>Autonomy should be route-specific. Do not make the entire agent autonomous, as a single remediation is sufficient for your environment.</p></div></div>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent - Stopped Cluster" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_ShutdownStartupClusterTest-bc3344326e9c8d8261611c7dc3b4a7ed.gif" width="1893" height="894" class="img_ev3q"></p>
<p>The Alert then changed to Acknowledged, and the Agent will output a Kepner-Tregoe problem management table <em>(i.e., IS vs IS NOT)</em>.</p>
<p>We can even have a look at the Trace of the process, to see the steps it took, this can help us improve the Agents and their Skill calling:</p>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent - Stopped Cluster Trace" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_ShutdownStartupClusterTestTrace-88759f384a16f338a482d59e2014b05b.gif" width="1390" height="584" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="session-insights">Session Insights<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#session-insights" class="hash-link" aria-label="Direct link to Session Insights" title="Direct link to Session Insights" translate="no">​</a></h2>
<p>Every incident creates an investigation session that you can open in the portal. I found these worth going back and reading properly after each test run.</p>
<p>Each session shows you:</p>
<ul>
<li class="">the triggering alert and incident metadata</li>
<li class="">Which response plan and subagent handled the route</li>
<li class="">every tool call made during the investigation (Log Analytics queries, <code>kubectl</code> commands, Azure REST calls, MCP doc lookups)</li>
<li class="">the evidence collected and how the agent reasoned about it</li>
<li class="">the proposed or executed remediation</li>
<li class="">a Kepner-Tregoe IS / IS NOT table where the agent produced one</li>
</ul>
<p>That last part is worth calling out. It is not just tidy output — it forces the agent to be explicit about what is not broken, which is often as useful as knowing what is.</p>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent - Session Insights" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_SessionInsights-f27b66431043a9f1ff484dae94587a87.gif" width="1900" height="862" class="img_ev3q"></p>
<p>Because the blueprint wires in Application Insights as a connector, you can query the agent's own telemetry directly:</p>
<div class="language-kusto codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-kusto codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">dependencies</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">| where cloud_RoleName == "sre-agent"</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">| where timestamp &gt; ago(1h)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">| project timestamp, name, duration, success</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">| order by timestamp desc</span><br></div></code></pre></div></div>
<p>That helps surface slow tool calls or failed skill invocations that the session view does not always make obvious.</p>
<p>After a real incident, I would go through the session and:</p>
<ol>
<li class="">Check which tools fired and in what order.</li>
<li class="">Look for tool calls that did not make it into the reasoning — wasted round-trips.</li>
<li class="">Look for places where the agent guessed at evidence rather than retrieved it.</li>
<li class="">Update the skill to tighten the evidence bundle for that route.</li>
</ol>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>Sessions are the fastest way to improve your agent over time. One review after a real incident is worth more than ten synthetic tests.</p></div></div>
<p>The Trace view shows the order of skill calls and subagent handoffs. If a route touched three agents before finding the right one, the triage logic in <code>drasi-incident-triage</code> needs to be adjusted.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="scheduled-tasks">Scheduled Tasks<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#scheduled-tasks" class="hash-link" aria-label="Direct link to Scheduled Tasks" title="Direct link to Scheduled Tasks" translate="no">​</a></h2>
<p>Azure SRE Agent scheduled tasks are useful for proactive reliability checks. The Microsoft docs describe them as scheduled natural-language checks that create a conversation thread, query data sources, reason about findings, and return an actionable summary.</p>
<p>This blueprint adds:</p>
<table><thead><tr><th>Task</th><th>Purpose</th></tr></thead><tbody><tr><td><code>drasi-health-probe-15m</code></td><td>Recurring AKS and Drasi health probe</td></tr><tr><td><code>drasi-daily-resilience-report</code></td><td>Daily operational risk and resilience summary</td></tr></tbody></table>
<p>The 15-minute task checks the cluster power state before trying any Kubernetes command. If the cluster is stopped, it reports that directly and avoids wasting time on failed <code>kubectl</code> calls.</p>
<p>The daily report is more architectural: recurring risks, noisy components, failed remediations, and follow-up work.</p>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent - Scheduled Tasks" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_ScheduledTasks-ece3aaba6d0525b88bc92c7be478e42e.gif" width="1617" height="862" class="img_ev3q"></p>
<p>But you could use this for cost analysis reporting, configuration drift, and more.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="fault-injection">Fault Injection<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#fault-injection" class="hash-link" aria-label="Direct link to Fault Injection" title="Direct link to Fault Injection" translate="no">​</a></h2>
<p>I wanted this to be testable without breaking a shared AKS cluster, so the repo includes a fault-injection matrix and synthetic route validation.</p>
<p>For destructive or noisy cases, use synthetic alerts:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">az monitor metrics alert create </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  --resource-group </span><span class="token operator">&lt;</span><span class="token plain">resource-group</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token plain"> sre-e2e-aks-admission-webhook-failure </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--scopes</span><span class="token plain"> </span><span class="token operator">&lt;</span><span class="token plain">aks-cluster-resource-id</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--description</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"Synthetic route validation. Expected route: aks-admission-webhook-failure"</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--severity</span><span class="token plain"> </span><span class="token number">3</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  --evaluation-frequency 1m </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  --window-size 5m </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--condition</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"avg kube_node_status_allocatable_cpu_cores &gt; 0"</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--action</span><span class="token plain"> </span><span class="token operator">&lt;</span><span class="token plain">sre-agent-action-group-resource-id</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  --auto-mitigate </span><span class="token boolean">false</span><br></div></code></pre></div></div>
<p>That alert intentionally fires without damaging the cluster. The important part is the route ID in the alert name and description.</p>
<p>The Bicep also supports this with an opt-in flag:</p>
<div class="language-bicep codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bicep codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">param</span><span class="token plain"> deploySyntheticRouteValidationAlerts </span><span class="token datatype class-name">bool</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token boolean">false</span><br></div></code></pre></div></div>
<p>Keep it off by default. Turn it on only for validation windows.</p>
<div class="theme-admonition theme-admonition-danger admonition_xJq3 alert alert--danger"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M5.05.31c.81 2.17.41 3.38-.52 4.31C3.55 5.67 1.98 6.45.9 7.98c-1.45 2.05-1.7 6.53 3.53 7.7-2.2-1.16-2.67-4.52-.3-6.61-.61 2.03.53 3.33 1.94 2.86 1.39-.47 2.3.53 2.27 1.67-.02.78-.31 1.44-1.13 1.81 3.42-.59 4.78-3.42 4.78-5.56 0-2.84-2.53-3.22-1.25-5.61-1.52.13-2.03 1.13-1.89 2.75.09 1.08-1.02 1.8-1.86 1.33-.67-.41-.66-1.19-.06-1.78C8.18 5.31 8.68 2.45 5.05.32L5.03.3l.02.01z"></path></svg></span>danger</div><div class="admonitionContent_BuS1"><p>Always-firing synthetic alerts that run continuously will trigger autonomous or review-mode agent runs, burning through tokens and tools. Deploy them, validate them, then delete or disable them.</p></div></div>
<p><img decoding="async" loading="lazy" alt="Synthetic Incidents" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_Synthetic_Incidents-ecfa00bbb70bc9be0894fe531cdbcfce.gif" width="1617" height="862" class="img_ev3q"></p>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent Canvas View" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_AgentCanvas_CanvasView-4164193ea14f372a40f2f422b0e004f9.png" width="1778" height="826" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="real-finding-container-insights-was-broken">Real Finding: Container Insights Was Broken<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#real-finding-container-insights-was-broken" class="hash-link" aria-label="Direct link to Real Finding: Container Insights Was Broken" title="Direct link to Real Finding: Container Insights Was Broken" translate="no">​</a></h2>
<p>One useful outcome from testing was that the SRE Agent surfaced a real platform issue.</p>
<p>The AKS monitoring add-on was enabled, and <code>ama-logs</code> pods were running, but Log Analytics had no recent rows in:</p>
<ul>
<li class=""><code>KubePodInventory</code></li>
<li class=""><code>ContainerLogV2</code></li>
<li class=""><code>Heartbeat</code></li>
<li class=""><code>InsightsMetrics</code></li>
</ul>
<p>The <code>ama-logs</code> pod logs showed DCR parsing errors, and there were no Data Collection Rules or DCR associations.</p>
<p>That is a perfect example of why you need platform routes before application routes. If Drasi looks unhealthy but your AKS telemetry pipeline is broken, the first incident is not "restart Drasi". It is "fix monitoring".</p>
<p>I added a baseline alert for that:</p>
<div class="language-kusto codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-kusto codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">KubePodInventory</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">| where TimeGenerated &gt; ago(30m)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">| summarize CurrentRows=count()</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">| where CurrentRows == 0</span><br></div></code></pre></div></div>
<p>This routes to:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">aks-monitoring-agent-fault</span><br></div></code></pre></div></div>
<p>The SRE Agent correctly diagnosed the missing DCR/DCRA path and proposed re-onboarding Container Insights. That is a sensible fix, but it changes AKS monitoring configuration, so the remediation review skill keeps it as a human approval path.</p>
<p><img decoding="async" loading="lazy" alt="Azure SRE Agent - Container Insights Incident" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_ContainerInsightsMissingTest-151b30300559d1fa68f1e5a68ce3d820.gif" width="1617" height="862" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="drasi-example-source-and-query-issues">Drasi Example: Source and Query Issues<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#drasi-example-source-and-query-issues" class="hash-link" aria-label="Direct link to Drasi Example: Source and Query Issues" title="Direct link to Drasi Example: Source and Query Issues" translate="no">​</a></h2>
<p>Drasi has its own failure modes that are not generic Kubernetes failures.</p>
<p>One route in the blueprint handles a documented lifecycle case: creating a Source and then immediately creating a dependent Continuous Query before the Source has connected cleanly.</p>
<p>The response plan is:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">drasi-source-bootstrap-race</span><br></div></code></pre></div></div>
<p>The correct remediation is not to restart the cluster. It is:</p>
<ol>
<li class="">Confirm the Source is healthy.</li>
<li class="">Inspect the Continuous Query status and resource-provider logs.</li>
<li class="">Delete and recreate only the affected Continuous Query if the bootstrap failed.</li>
</ol>
<p>That is the kind of domain-specific behavior that belongs in a Drasi runtime skill, not a generic AKS skill.</p>
<p><img decoding="async" loading="lazy" alt="Drasi source fix" src="https://luke.geek.nz/assets/images/AzureSREAgent_DrasiAKS_DrasiIncidentFix-bf05c41f8a4e2361deeca7cf00c246d2.gif" width="1617" height="862" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-deployment-flow">The Deployment Flow<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#the-deployment-flow" class="hash-link" aria-label="Direct link to The Deployment Flow" title="Direct link to The Deployment Flow" translate="no">​</a></h2>
<p>To deploy, the flow is:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">git</span><span class="token plain"> clone https://github.com/lukemurraynz/drasi-aks-sre-agent.git</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">cd</span><span class="token plain"> drasi-aks-sre-agent</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd auth login</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">az login</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> new drasi-sre-dev</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> DRASI_RESOURCE_GROUP_NAME </span><span class="token operator">&lt;</span><span class="token plain">drasi-resource-group</span><span class="token operator">&gt;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> DRASI_AKS_CLUSTER_NAME </span><span class="token operator">&lt;</span><span class="token plain">aks-cluster-name</span><span class="token operator">&gt;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> DRASI_LOG_ANALYTICS_WORKSPACE_NAME </span><span class="token operator">&lt;</span><span class="token plain">workspace-name</span><span class="token operator">&gt;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> AZURE_RESOURCE_GROUP </span><span class="token operator">&lt;</span><span class="token plain">agent-resource-group</span><span class="token operator">&gt;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> AZURE_SRE_AGENT_NAME </span><span class="token operator">&lt;</span><span class="token plain">agent-name</span><span class="token operator">&gt;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd up</span><br></div></code></pre></div></div>
<blockquote>
<p>Refer to my previous blog article <a href="https://luke.geek.nz/azure/drasi-azd-extension/" target="_blank" rel="noopener noreferrer" class="">Deploy Drasi Faster with the Azure Developer CLI Extension</a> if you want to get Drasi running on AKS using an AZD extension.</p>
</blockquote>
<p>The first run provisions the agent and then applies the data-plane configuration:</p>
<ul>
<li class="">custom agents</li>
<li class="">skills</li>
<li class="">response plans</li>
<li class="">scheduled tasks</li>
<li class="">MCP tool enablement</li>
</ul>
<p>The reason for the post-provision step is pragmatic: not every SRE Agent object is cleanly portable through ARM in every tenant yet, so the repo uses Bicep for infrastructure and the SRE Agent data-plane API for operational content.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="lessons-learned">Lessons Learned<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#lessons-learned" class="hash-link" aria-label="Direct link to Lessons Learned" title="Direct link to Lessons Learned" translate="no">​</a></h2>
<p>A few things stood out.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-route-by-failure-phase-before-the-product">1. Route by failure phase before the product<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#1-route-by-failure-phase-before-the-product" class="hash-link" aria-label="Direct link to 1. Route by failure phase before the product" title="Direct link to 1. Route by failure phase before the product" translate="no">​</a></h3>
<ul>
<li class="">Creation-time failures usually mean admission, workload identity, policy, or API-server health.</li>
<li class="">Pending-time failures usually mean scheduling, capacity, subnet, or autoscaler.</li>
<li class="">Metrics blindness usually means the metrics API or the monitoring pipeline.</li>
</ul>
<p>Only after those are clean should the Drasi specialist take over.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-autonomous-should-be-boring">2. Autonomous should be boring<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#2-autonomous-should-be-boring" class="hash-link" aria-label="Direct link to 2. Autonomous should be boring" title="Direct link to 2. Autonomous should be boring" translate="no">​</a></h3>
<p>Starting a stopped AKS cluster is boring enough for my environment.</p>
<p>Recreating Container Insights, changing DCRs, scaling node pools, changing webhooks, deleting finalizers, or modifying networking is not.</p>
<p>Those remain approval-gated.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-synthetic-alerts-are-useful-but-dangerous-if-left-on">3. Synthetic alerts are useful, but dangerous if left on<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#3-synthetic-alerts-are-useful-but-dangerous-if-left-on" class="hash-link" aria-label="Direct link to 3. Synthetic alerts are useful, but dangerous if left on" title="Direct link to 3. Synthetic alerts are useful, but dangerous if left on" translate="no">​</a></h3>
<p>Always-firing metric alerts are great for response-plan validation.</p>
<p>They are terrible as a permanent baseline.</p>
<p>Deploy them behind a flag, run the validation, capture the evidence, and delete them.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-connected-is-not-the-same-as-usable">4. "Connected" is not the same as "usable."<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#4-connected-is-not-the-same-as-usable" class="hash-link" aria-label="Direct link to 4. &quot;Connected&quot; is not the same as &quot;usable.&quot;" title="Direct link to 4. &quot;Connected&quot; is not the same as &quot;usable.&quot;" translate="no">​</a></h3>
<p>MCP connectors can be connected and remain healthy even when their tools are not active for the agent.</p>
<p>Check the actual tool assignment, not just connector health.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="5-observability-needs-its-own-alert">5. Observability needs its own alert<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#5-observability-needs-its-own-alert" class="hash-link" aria-label="Direct link to 5. Observability needs its own alert" title="Direct link to 5. Observability needs its own alert" translate="no">​</a></h3>
<p>If Container Insights stops sending inventory, many AKS alerts become blind.</p>
<p>That is a reliability incident in its own right.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-this-fits-in-well-architected">Where This Fits in Well-Architected<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#where-this-fits-in-well-architected" class="hash-link" aria-label="Direct link to Where This Fits in Well-Architected" title="Direct link to Where This Fits in Well-Architected" translate="no">​</a></h2>
<p>From a <a href="https://learn.microsoft.com/azure/well-architected/reliability/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Well-Architected Reliability</a> perspective, this is about reducing detection and diagnosis time without blindly increasing the risk of automation.</p>
<p>From an Operational Excellence perspective, it gives you:</p>
<ul>
<li class="">version-controlled runbooks</li>
<li class="">repeatable deployment</li>
<li class="">consistent incident routing</li>
<li class="">explicit approval boundaries</li>
<li class="">scheduled operational review</li>
<li class="">post-incident feedback loops</li>
</ul>
<p>From a Cost Optimization perspective, it also matters because noisy autonomous agents can quickly burn through tokens and tools. Route narrowly, scope tools, and keep high-impact flows in Review until you have real evidence.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thoughts">Final Thoughts<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#final-thoughts" class="hash-link" aria-label="Direct link to Final Thoughts" title="Direct link to Final Thoughts" translate="no">​</a></h2>
<p><a href="https://learn.microsoft.com/azure/sre-agent/overview?tabs=task&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent</a> is most useful when you treat it like an operational platform, not a chatbot.</p>
<p>The value comes from the structure around it:</p>
<ul>
<li class="">focused agents</li>
<li class="">route-specific response plans</li>
<li class="">current documentation tools</li>
<li class="">scoped RBAC</li>
<li class="">review-mode safety gates</li>
<li class="">scheduled checks</li>
<li class="">fault-injection evidence</li>
</ul>
<p>For AKS and Drasi, that structure matters even more because the symptoms overlap. A Drasi issue can look like a Kubernetes issue, and a Kubernetes issue can make Drasi look broken, but hopefully this gives you enough of a view and scaffold to fit your own purposes.</p>
<p>That is exactly the kind of ambiguity SRE Agents can help with, as long as we give them the right guardrails.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://luke.geek.nz/azure/azure-sre-agent-aks-drasi/#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/azure/sre-agent/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent overview</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/sre-agent/incident-platforms?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent incident platforms</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/sre-agent/incident-response-plans?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent response plans</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/sre-agent/scheduled-tasks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent scheduled tasks</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/sre-agent/sub-agents?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent custom agents</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/sre-agent/connectors?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure SRE Agent connectors</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/aks/monitor-aks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Monitor Azure Kubernetes Service</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/azure-monitor/containers/kubernetes-monitoring-enable?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Enable monitoring for AKS clusters</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/azure-monitor/containers/container-insights-troubleshoot?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Troubleshoot container log collection</a></li>
<li class=""><a href="https://learn.microsoft.com/azure/developer/azure-developer-cli/azd-up-workflow?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Developer CLI <code>azd up</code> workflow</a></li>
<li class=""><a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi documentation</a></li>
</ul>]]></content:encoded>
            <category>Azure</category>
        </item>
        <item>
            <title><![CDATA[Deploy Drasi Faster with the Azure Developer CLI Extension]]></title>
            <link>https://luke.geek.nz/azure/drasi-azd-extension/</link>
            <guid>https://luke.geek.nz/azure/drasi-azd-extension/</guid>
            <pubDate>Wed, 15 Apr 2026 06:24:12 GMT</pubDate>
            <description><![CDATA[Learn how to use the azure.drasi extension to standardize Drasi project setup, deployment, and operations using native azd workflows.]]></description>
            <content:encoded><![CDATA[<p>I have deployed <a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a> enough times now to know exactly where the pain shows up: too much manual scaffolding, inconsistent post-provision steps, and "it worked in one environment but not the other" cluster setup drift.</p>
<p>So I built a custom <a href="https://learn.microsoft.com/azure/developer/azure-developer-cli/extensions/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Developer CLI extension</a> for <a href="https://learn.microsoft.com/azure/developer/azure-developer-cli/overview?tabs=windows&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">AZD</a> called <code>azure.drasi</code> to standardize that workflow end-to-end.</p>
<p>It gives you a clean, repeatable way to:</p>
<ul>
<li class="">Scaffold <a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a> projects from templates</li>
<li class="">Validate config before touching infrastructure</li>
<li class="">Provision <a href="https://learn.microsoft.com/azure/aks/what-is-aks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">AKS</a> + supporting Azure resources in one flow</li>
<li class="">Deploy sources, queries, middleware, and reactions in dependency order</li>
<li class="">Operate and troubleshoot Drasi workloads with native <code>azd</code> commands</li>
</ul>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-i-built-this">Why I Built This<a href="https://luke.geek.nz/azure/drasi-azd-extension/#why-i-built-this" class="hash-link" aria-label="Direct link to Why I Built This" title="Direct link to Why I Built This" translate="no">​</a></h2>
<p>Drasi deployments are not just "deploy app and move on". You normally need to coordinate:</p>
<ul>
<li class="">Azure Kubernetes Service (AKS) configuration (including Workload Identity)</li>
<li class="">Namespace/runtime setup</li>
<li class="">Managed identity + Key Vault + diagnostics plumbing</li>
<li class="">Correct deployment order for Drasi components</li>
</ul>
<p>This is exactly the kind of process that becomes fragile if left to handwritten, ad hoc scripts per repo.</p>
<p>The extension wraps those moving parts into a consistent set of AZD commands, so your Drasi workloads feel like any other <code>azd</code> project lifecycle.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-the-extension-covers">What the Extension Covers<a href="https://luke.geek.nz/azure/drasi-azd-extension/#what-the-extension-covers" class="hash-link" aria-label="Direct link to What the Extension Covers" title="Direct link to What the Extension Covers" translate="no">​</a></h2>
<p>The current <code>azure.drasi</code> extension supports:</p>
<ul>
<li class="">Project scaffolding templates:<!-- -->
<ul>
<li class=""><code>blank</code></li>
<li class=""><code>blank-terraform</code></li>
<li class=""><code>event-hub-routing</code></li>
<li class=""><code>postgresql-source</code></li>
</ul>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="supported-template-matrix">Supported Template Matrix<a href="https://luke.geek.nz/azure/drasi-azd-extension/#supported-template-matrix" class="hash-link" aria-label="Direct link to Supported Template Matrix" title="Direct link to Supported Template Matrix" translate="no">​</a></h3>
<table><thead><tr><th>Template</th><th>Best for</th><th>Typical use case</th></tr></thead><tbody><tr><td><code>blank</code></td><td>Starting from scratch</td><td>Build a custom Drasi topology with your own sources/queries/reactions</td></tr><tr><td><code>blank-terraform</code></td><td>Infra-first teams</td><td>Use Terraform-based provisioning workflows with Drasi project scaffolding</td></tr><tr><td><code>event-hub-routing</code></td><td>Streaming/event routing</td><td>Ingest from Event Hubs and route/filter events with Drasi queries</td></tr><tr><td><code>postgresql-source</code></td><td>Relational CDC demos/POCs</td><td>Capture PostgreSQL changes and validate end-to-end Drasi flow quickly</td></tr></tbody></table>
<ul>
<li class="">
<p><strong>These templates are starting points, not rigid blueprints.</strong> Before you run <code>azd drasi provision</code>, you can modify infrastructure settings (for example VM sizes/SKUs, PostgreSQL sizing, networking, and environment parameters) to fit your subscription limits, region availability, and production standards.</p>
</li>
<li class="">
<p>Offline validation of Drasi config before deployment</p>
</li>
<li class="">
<p>Infrastructure provisioning for AKS, Key Vault, UAMI, and Log Analytics</p>
</li>
<li class="">
<p>Ordered Drasi component deployment with health checks</p>
</li>
<li class="">
<p>Operations commands for status, logs, and diagnostics</p>
</li>
<li class="">
<p>Safe teardown and runtime upgrade actions</p>
</li>
</ul>
<p><img decoding="async" loading="lazy" alt="azd drasi init template selection" src="https://luke.geek.nz/assets/images/drasiazdextensiotemplateselection-37edf4f73d47907a2a62cb10620b2dd3.gif" width="1009" height="421" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="installation">Installation<a href="https://luke.geek.nz/azure/drasi-azd-extension/#installation" class="hash-link" aria-label="Direct link to Installation" title="Direct link to Installation" translate="no">​</a></h2>
<p>Install the extension from my GitHub Releases registry:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd extension </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">source</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">add</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-n</span><span class="token plain"> drasi-lukemurray-azdext </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-t</span><span class="token plain"> url </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-l</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"https://github.com/lukemurraynz/azd.extensions.drasi/releases/latest/download/registry.json"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd extension </span><span class="token function" style="color:rgb(80, 250, 123)">install</span><span class="token plain"> azure.drasi </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-s</span><span class="token plain"> drasi-lukemurray-azdext</span><br></div></code></pre></div></div>
<p>Verify:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--help</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi version</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="Drasi azd extension install" src="https://luke.geek.nz/assets/images/drasiazdextensioninstall-53d34874f7bc90939c52167b921399d3.gif" width="1009" height="421" class="img_ev3q"></p>
<p>You can upgrade the extension with the latest upstream version from my repo using:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd extension upgrade azure.drasi</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="quick-start-first-run">Quick Start (First Run)<a href="https://luke.geek.nz/azure/drasi-azd-extension/#quick-start-first-run" class="hash-link" aria-label="Direct link to Quick Start (First Run)" title="Direct link to Quick Start (First Run)" translate="no">​</a></h2>
<p>This is the fast path from an empty folder to deployed Drasi components:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">mkdir</span><span class="token plain"> my-drasi-app </span><span class="token operator">&amp;&amp;</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">cd</span><span class="token plain"> my-drasi-app</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd init </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--minimal</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-force</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi init </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--template</span><span class="token plain"> postgresql-source</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> new drasienv</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi validate </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--strict</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd auth login</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">az login</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi provision</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi deploy</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi status</span><br></div></code></pre></div></div>
<blockquote>
<p><strong>Cost note:</strong> <code>azd drasi provision</code> can create billable resources (especially AKS and Log Analytics). Use a dedicated dev/test subscription or budget guardrails for experimentation. The following are example costs only to give a view of cost; Azure Developer CLI shines with the removal and redeployment of entire environments.</p>
</blockquote>
<p>The <code>postgresql-source</code> template baseline (SKUs as defined in the Bicep: 2× <code>Standard_D2s_v5</code> AKS nodes, <code>Standard_B1ms</code> PostgreSQL, Standard NAT Gateway + Public IP) — estimated USD, pay-as-you-go, 24 h/day:</p>
<p><strong>newzealandnorth</strong></p>
<table><thead><tr><th>Resource</th><th>SKU</th><th style="text-align:right">1 day</th><th style="text-align:right">7 days</th><th style="text-align:right">30 days</th></tr></thead><tbody><tr><td>AKS nodes ×2</td><td>Standard_D2s_v5</td><td style="text-align:right">$6.05</td><td style="text-align:right">$42.34</td><td style="text-align:right">$181.44</td></tr><tr><td>PostgreSQL compute</td><td>Standard_B1ms (Burstable)</td><td style="text-align:right">$0.66</td><td style="text-align:right">$4.59</td><td style="text-align:right">$19.66</td></tr><tr><td>NAT Gateway</td><td>Standard</td><td style="text-align:right">$1.08</td><td style="text-align:right">$7.56</td><td style="text-align:right">$32.40</td></tr><tr><td>Public IP</td><td>Standard Static</td><td style="text-align:right">$0.12</td><td style="text-align:right">$0.84</td><td style="text-align:right">$3.60</td></tr><tr><td><strong>Total</strong></td><td></td><td style="text-align:right"><strong>$7.90</strong></td><td style="text-align:right"><strong>$55.32</strong></td><td style="text-align:right"><strong>$237.10</strong></td></tr></tbody></table>
<p><em>Key Vault (Standard) and Log Analytics are consumption-based: Key Vault is negligible for dev use; Log Analytics adds $3.51/GB (NZ North) above the 5 GB/day free allowance. VNet and managed identities are free.</em></p>
<blockquote>
<p><strong>Region note:</strong> If a SKU/offer is restricted in your default location, set a supported region before provisioning. For example:</p>
</blockquote>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> AZURE_LOCATION australiaeast</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi provision</span><br></div></code></pre></div></div>
<p>This flow is intentionally opinionated: validate early, provision once, then deploy in a known order.</p>
<p><img decoding="async" loading="lazy" alt="End-to-end quick provision" src="https://luke.geek.nz/assets/images/drasiazdextensionprovision-857b141d3b5b840e6fca7e2d53a89a68.gif" width="1005" height="327" class="img_ev3q"></p>
<p><img decoding="async" loading="lazy" alt="azd drasi status" src="https://luke.geek.nz/assets/images/azd_drasi_status-aa744a66aa107b4f2b01991168d14f3a.png" width="280" height="228" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="common-scenarios">Common Scenarios<a href="https://luke.geek.nz/azure/drasi-azd-extension/#common-scenarios" class="hash-link" aria-label="Direct link to Common Scenarios" title="Direct link to Common Scenarios" translate="no">​</a></h2>
<p>These are the scenarios I hit most often when building demos and internal proofs-of-concept.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-scaffold-and-start-with-a-known-pattern">1. Scaffold and Start with a Known Pattern<a href="https://luke.geek.nz/azure/drasi-azd-extension/#1-scaffold-and-start-with-a-known-pattern" class="hash-link" aria-label="Direct link to 1. Scaffold and Start with a Known Pattern" title="Direct link to 1. Scaffold and Start with a Known Pattern" translate="no">​</a></h3>
<p>When you want to get moving quickly with a real source/reaction shape, start from a template:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi init </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--template</span><span class="token plain"> event-hub-routing</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi validate</span><br></div></code></pre></div></div>
<p>This avoids copy/paste YAML drift and gives you a repeatable baseline across contributors.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-validate-in-ci-before-provisiondeploy">2. Validate in CI Before Provision/Deploy<a href="https://luke.geek.nz/azure/drasi-azd-extension/#2-validate-in-ci-before-provisiondeploy" class="hash-link" aria-label="Direct link to 2. Validate in CI Before Provision/Deploy" title="Direct link to 2. Validate in CI Before Provision/Deploy" translate="no">​</a></h3>
<p>If you want fast feedback on pull requests:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi validate </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--strict</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="azd drasi validate" src="https://luke.geek.nz/assets/images/azd_drasi_validate-feceaaa59235917dac162093ec09c290.jpg" width="624" height="273" class="img_ev3q"></p>
<p>Because validation runs offline, you can fail quickly without needing cluster access.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-dry-run-before-a-live-deploy">3. Dry-Run Before a Live Deploy<a href="https://luke.geek.nz/azure/drasi-azd-extension/#3-dry-run-before-a-live-deploy" class="hash-link" aria-label="Direct link to 3. Dry-Run Before a Live Deploy" title="Direct link to 3. Dry-Run Before a Live Deploy" translate="no">​</a></h3>
<p>Useful when you want confidence in component changes:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi deploy --dry-run</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="azd drasi deploy --dry-run" src="https://luke.geek.nz/assets/images/azd_drasi_deploy_dryrun-cc0a2b458b81cd65889886927642a0c8.jpg" width="990" height="83" class="img_ev3q"></p>
<p>Think of this as your safety rail before touching a shared environment.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-multi-environment-deployments">4. Multi-Environment Deployments<a href="https://luke.geek.nz/azure/drasi-azd-extension/#4-multi-environment-deployments" class="hash-link" aria-label="Direct link to 4. Multi-Environment Deployments" title="Direct link to 4. Multi-Environment Deployments" translate="no">​</a></h3>
<p>Use overlays and environment targeting for dev/stage/prod separation:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi provision </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--environment</span><span class="token plain"> dev</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi deploy </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--environment</span><span class="token plain"> dev</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi provision </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--environment</span><span class="token plain"> prod</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi deploy </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--environment</span><span class="token plain"> prod</span><br></div></code></pre></div></div>
<p>This is where the extension helps prevent "prod got dev settings" moments.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="5-operate-and-troubleshoot-a-running-deployment">5. Operate and Troubleshoot a Running Deployment<a href="https://luke.geek.nz/azure/drasi-azd-extension/#5-operate-and-troubleshoot-a-running-deployment" class="hash-link" aria-label="Direct link to 5. Operate and Troubleshoot a Running Deployment" title="Direct link to 5. Operate and Troubleshoot a Running Deployment" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi status</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi status </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--kind</span><span class="token plain"> continuousquery </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--output</span><span class="token plain"> json</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi logs </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--kind</span><span class="token plain"> continuousquery </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--component</span><span class="token plain"> order-changes</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi diagnose</span><br></div></code></pre></div></div>
<p>The <code>diagnose</code> command is especially useful when something is failing across auth, cluster connectivity, or runtime dependencies.</p>
<p><img decoding="async" loading="lazy" alt="azd drasi status" src="https://luke.geek.nz/assets/images/azd_drasi_troubleshooting-f2dc72ad3a68e9ecb8ce60a3a5c62b0c.jpg" width="509" height="405" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="6-teardown-with-guardrails">6. Teardown with Guardrails<a href="https://luke.geek.nz/azure/drasi-azd-extension/#6-teardown-with-guardrails" class="hash-link" aria-label="Direct link to 6. Teardown with Guardrails" title="Direct link to 6. Teardown with Guardrails" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)"># Components only</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi teardown </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--force</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Components + infrastructure</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd drasi teardown </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--force</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--infrastructure</span><br></div></code></pre></div></div>
<blockquote>
<p><strong>Cleanup note:</strong> If infrastructure remains provisioned, AKS and Log Analytics can continue incurring cost. Use <code>azd drasi teardown --force --infrastructure</code> (or <code>azd down</code> when applicable) to clean up fully.</p>
</blockquote>
<p><img decoding="async" loading="lazy" alt="azd drasi teardown --force" src="https://luke.geek.nz/assets/images/drasiazdextensionteardown-cf271425ca6740f17357203c43c51ee9.gif" width="1005" height="327" class="img_ev3q"></p>
<p>This is force-gated by design so you are less likely to accidentally wipe an environment.</p>
<p>And a normal <code>azd down</code> works:</p>
<p><img decoding="async" loading="lazy" alt="azd down" src="https://luke.geek.nz/assets/images/azd_drasi_azddown-6fed6692267175272ba840e4195b2d2b.jpg" width="642" height="335" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="day-2-operations-notes">Day-2 Operations Notes<a href="https://luke.geek.nz/azure/drasi-azd-extension/#day-2-operations-notes" class="hash-link" aria-label="Direct link to Day-2 Operations Notes" title="Direct link to Day-2 Operations Notes" translate="no">​</a></h2>
<p>Some practical notes after using this in repeated demo cycles:</p>
<ul>
<li class="">Prefer <code>--environment</code> consistently, even in dev, so context switching is explicit.</li>
<li class="">Use <code>--output json</code> in automation jobs where you need a machine-readable state.</li>
<li class="">Keep secrets in Key Vault references and out of repo config.</li>
<li class="">Use <code>validate --strict</code> as a pre-deploy gate in CI.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="gotchas-i-found">Gotchas I Found<a href="https://luke.geek.nz/azure/drasi-azd-extension/#gotchas-i-found" class="hash-link" aria-label="Direct link to Gotchas I Found" title="Direct link to Gotchas I Found" translate="no">​</a></h2>
<p><strong>Kube context confusion still happens.</strong> If your local context points at the wrong cluster, operations commands can surprise you. Prefer explicit environment targeting where possible.</p>
<p><strong>Validation is not a replacement for live diagnostics.</strong> <code>validate</code> catches config-level issues early, but connectivity/auth/runtime checks still belong to <code>diagnose</code> on a live target.</p>
<p><strong>Teardown is intentionally friction-filled.</strong> You must use <code>--force</code>, and that is a good thing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="who-this-is-for">Who This Is For<a href="https://luke.geek.nz/azure/drasi-azd-extension/#who-this-is-for" class="hash-link" aria-label="Direct link to Who This Is For" title="Direct link to Who This Is For" translate="no">​</a></h2>
<p>This extension is useful if you:</p>
<ul>
<li class="">Deploy Drasi repeatedly across multiple environments</li>
<li class="">Want a reusable bootstrap path for sources/queries/reactions</li>
<li class="">Need cleaner team handover (same commands, same flow)</li>
<li class="">Prefer AZD-native workflows over custom one-off scripts</li>
</ul>
<p>If you only run one tiny local experiment once, this may feel like overkill. For anything beyond that, consistency pays for itself quickly.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrapping-up">Wrapping Up<a href="https://luke.geek.nz/azure/drasi-azd-extension/#wrapping-up" class="hash-link" aria-label="Direct link to Wrapping Up" title="Direct link to Wrapping Up" translate="no">​</a></h2>
<p>The main goal of <code>azure.drasi</code> is simple: remove the repetitive plumbing and make Drasi delivery predictable.</p>
<p>Instead of rebuilding the same script stack every time, you can use one AZD extension workflow to scaffold, validate, provision, deploy, operate, and clean up.</p>
<p>I will add more walkthrough GIFs and scenario demos over time, but the extension is already usable today for practical Drasi workflows.</p>
<blockquote>
<p>Code: <a href="https://github.com/lukemurraynz/azd.extensions.drasi" target="_blank" rel="noopener noreferrer" class="">lukemurraynz/azd.extensions.drasi</a></p>
</blockquote>
<p>If you try <code>azure.drasi</code>, I’d love your feedback:</p>
<ul>
<li class="">Issues: <a href="https://github.com/lukemurraynz/azd.extensions.drasi/issues" target="_blank" rel="noopener noreferrer" class="">Report bugs or request features</a></li>
</ul>]]></content:encoded>
            <category>Azure</category>
        </item>
        <item>
            <title><![CDATA[Remove Build-Time Environment Variables with Azure App Configuration with Front Door for Static Web Apps]]></title>
            <link>https://luke.geek.nz/azure/appconfig-frontdoor-spa/</link>
            <guid>https://luke.geek.nz/azure/appconfig-frontdoor-spa/</guid>
            <pubDate>Sat, 04 Apr 2026 04:11:36 GMT</pubDate>
            <description><![CDATA[Discover how to eliminate build-time environment variables in SPAs using Azure App Configuration with Front Door for seamless deployments.]]></description>
            <content:encoded><![CDATA[<p>Today, we are going to look at a preview feature that solves one of the most common pain points in SPA (single page application) or Static Web App deployments - build-time environment variable injection - using <a href="https://learn.microsoft.com/azure/azure-app-configuration/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure App Configuration</a> with <a href="https://learn.microsoft.com/azure/frontdoor/front-door-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Front Door</a>.</p>
<p>If you have ever had to rebuild a React or Vue app just because the API URL changed between staging and production, this one is for you.</p>
<!-- -->
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p>This article walks through a proof of concept using preview SDKs. The pattern is production-applicable, but the Azure Front Door integration for App Configuration is currently in <a href="https://learn.microsoft.com/azure/azure-app-configuration/concept-hyperscale-client-configuration?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">public preview</a>. SDK versions and APIs may change before GA.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-problem-everyone-has-hit">The Problem Everyone Has Hit<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#the-problem-everyone-has-hit" class="hash-link" aria-label="Direct link to The Problem Everyone Has Hit" title="Direct link to The Problem Everyone Has Hit" translate="no">​</a></h2>
<p>Every Vite, React, Next.js, or Vue developer knows this pattern:</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain"># Build stage - config is compiled INTO the JavaScript</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">ARG VITE_API_URL</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">ENV VITE_API_URL=$VITE_API_URL</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">RUN npm run build</span><br></div></code></pre></div></div>
<p>Vite replaces <code>import.meta.env.VITE_API_URL</code> with the literal string value at build time. The output JavaScript file contains <code>"https://api-staging.example.com"</code> as a hardcoded constant. To point at production, you rebuild the entire application.</p>
<p>This causes real problems:</p>
<ul>
<li class=""><strong>One build per environment</strong> - staging, UAT, production each need their own Docker image or pipeline run</li>
<li class=""><strong>Leaked URLs</strong> - a staging API hostname baked into a production bundle is a common incident</li>
<li class=""><strong>CI/CD coupling</strong> - your frontend pipeline needs to know infrastructure details at build time</li>
<li class=""><strong>No runtime changes</strong> - updating a feature flag or API version requires a full rebuild and redeploy</li>
</ul>
<p>Because of this issue, I developed my own Copilot skill dedicated entirely to diagnosing <code>ERR_NAME_NOT_RESOLVED</code> errors caused by incorrect build-time URLs. The fact that this needs its own troubleshooting guide tells you something about how often it goes wrong.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-changed">What Changed<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#what-changed" class="hash-link" aria-label="Direct link to What Changed" title="Direct link to What Changed" translate="no">​</a></h2>
<p>In late 2025, Azure App Configuration added <a href="https://learn.microsoft.com/azure/azure-app-configuration/concept-hyperscale-client-configuration?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Front Door integration</a>. The idea is straightforward: serve your configuration through a CDN endpoint that browsers can call directly, without authentication.</p>
<p>The architecture shift looks like this:</p>
<p><strong>Before (build-time injection):</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">Build Pipeline → injects VITE_API_URL → npm run build → baked into JS bundle</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                                                              ↓</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                                              One artifact per environment</span><br></div></code></pre></div></div>
<p><strong>After (runtime fetch via CDN):</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">npm run build → single artifact (no config baked in)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                         ↓</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">Browser loads app → JS calls Front Door CDN endpoint (HTTPS GET, no auth)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                         ↓</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">Front Door → (managed identity) → App Configuration store → returns JSON</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                         ↓</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">App receives { "ApiUrl": "https://api-prod.example.com", "Theme": "dark" }</span><br></div></code></pre></div></div>
<p>The built JavaScript bundle is identical across dev, staging, and production. Configuration arrives as an HTTP response at runtime, not as compiled constants.</p>
<p><img decoding="async" loading="lazy" alt="Runtime configuration and feature flags flowing from App Configuration through Front Door to the SPA" src="https://luke.geek.nz/assets/images/RuntimeVariablesFeatureFlagAppConfigFrontDoor-cb59cbe87dffac4facada6ae92ffde30.gif" width="1897" height="962" class="img_ev3q"></p>
<p><em>Runtime config and feature flags are delivered at request time via Front Door, not compiled into the bundle.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-front-door-can-i-just-use-app-configuration-directly">Why Front Door? Can I Just Use App Configuration Directly?<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#why-front-door-can-i-just-use-app-configuration-directly" class="hash-link" aria-label="Direct link to Why Front Door? Can I Just Use App Configuration Directly?" title="Direct link to Why Front Door? Can I Just Use App Configuration Directly?" translate="no">​</a></h2>
<p>This is the first question I had. Azure App Configuration already has a JavaScript SDK <a href="https://learn.microsoft.com/javascript/api/overview/azure/app-configuration-readme?view=azure-node-latest&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">(@azure/app-configuration)</a>. Why add Front Door in the middle?</p>
<blockquote>
<p>The answer is authentication. App Configuration requires credentials to access - either a connection string or a Microsoft Entra ID token. An SPA running in a browser cannot securely hold either of these. You cannot embed a connection string in JavaScript that ships to the client. And you cannot run <code>DefaultAzureCredential</code> in a browser - there is no managed identity context.</p>
</blockquote>
<p>Front Door solves this by acting as an authentication proxy:</p>
<table><thead><tr><th></th><th>App Configuration Direct</th><th>App Configuration + Front Door</th></tr></thead><tbody><tr><td><strong>Client auth required</strong></td><td>Yes (connection string or Entra token)</td><td>No (unauthenticated HTTPS GET)</td></tr><tr><td><strong>Works in browser/SPA</strong></td><td>No (cannot hold secrets)</td><td>Yes</td></tr><tr><td><strong>Works server-side</strong></td><td>Yes (managed identity)</td><td>Yes (but overkill)</td></tr><tr><td><strong>CDN caching</strong></td><td>No</td><td>Yes (global edge, DDoS protection)</td></tr><tr><td><strong>Scoped exposure</strong></td><td>N/A (full access with credentials)</td><td>Yes (only configured key filters served)</td></tr><tr><td><strong>Feature flags</strong></td><td>Yes</td><td>Yes</td></tr><tr><td><strong>Cost</strong></td><td>App Config only</td><td>App Config + Front Door Standard/Premium</td></tr></tbody></table>
<p><strong>The rule is simple:</strong> server-side apps (APIs, Functions, background workers) use App Configuration directly with managed identity. Client-side apps (SPAs, mobile) that cannot hold secrets use App Configuration through Front Door.</p>
<p>This is not a replacement for server-side App Configuration. It is the missing piece for browser-based clients that previously had no safe way to consume runtime configuration.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="does-this-work-on-azure-static-web-apps">Does This Work on Azure Static Web Apps?<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#does-this-work-on-azure-static-web-apps" class="hash-link" aria-label="Direct link to Does This Work on Azure Static Web Apps?" title="Direct link to Does This Work on Azure Static Web Apps?" translate="no">​</a></h2>
<p>Yes. This is one of the strongest use cases.</p>
<p><a href="https://learn.microsoft.com/azure/static-web-apps/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Static Web Apps</a> serves pre-built static files from a global CDN. There is no server-side runtime to inject environment variables at request time. Today, if you need a different config per environment (staging vs production), you either:</p>
<ol>
<li class="">Rebuild the app per environment with different <code>VITE_*</code> build args</li>
<li class="">Use a workaround like a <code>/config.json</code> file served from the API backend</li>
<li class="">Use Static Web Apps <a href="https://learn.microsoft.com/azure/static-web-apps/application-settings?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">environment variables</a> injected at build time (same rebuild problem)</li>
</ol>
<p>With App Configuration + Front Door, none of this is needed. The built JavaScript makes an HTTPS <code>fetch()</code> call to the Front Door CDN endpoint when the app loads. It works the same way whether the app is hosted on Static Web Apps, Blob Storage with a CDN, or Nginx in a container. The hosting platform does not matter because the config fetch is a standard browser HTTP request.</p>
<p><img decoding="async" loading="lazy" alt="Using the Front Door URL succeeds while the direct Static Web App hostname is blocked for this pattern" src="https://luke.geek.nz/assets/images/RuntimeVariablesShowCaseFrontDoorWorkingvsStaticWebAppDirectNotWorking-77a04f4b7afbb7d4d552c6ba9d4f71e2.gif" width="1915" height="874" class="img_ev3q"></p>
<p><em>In this demo, accessing via the Front Door endpoint is the intended path; the direct Static Web App hostname is intentionally not the runtime-config path.</em></p>
<p>The deployment flow becomes:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">GitHub Actions → npm run build → deploy to Static Web App (once)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                                        ↓</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">              The same artifact serves staging AND production</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">              Config values differ per App Configuration store/labels</span><br></div></code></pre></div></div>
<p>No rebuild per environment. No pipeline secrets leaking into static assets.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-scenario">The Scenario<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#the-scenario" class="hash-link" aria-label="Direct link to The Scenario" title="Direct link to The Scenario" translate="no">​</a></h2>
<p>To demonstrate this, I built a simple weather dashboard SPA. It has three settings that traditionally would be build-time environment variables:</p>
<p>If you want the full deployable implementation (Vite app + Bicep + <code>azd</code> workflows), the companion repository is here: <a href="https://github.com/lukemurraynz/appconfig-frontdoor-spa-demo" target="_blank" rel="noopener noreferrer" class="">lukemurraynz/appconfig-frontdoor-spa-demo</a>.</p>
<table><thead><tr><th>Setting</th><th>Purpose</th><th>Traditional Approach</th></tr></thead><tbody><tr><td><code>WeatherDashboard:ApiUrl</code></td><td>Backend API endpoint</td><td><code>VITE_API_URL</code> build arg</td></tr><tr><td><code>WeatherDashboard:RefreshIntervalSeconds</code></td><td>Data refresh frequency</td><td>Hardcoded or <code>VITE_REFRESH_INTERVAL</code></td></tr><tr><td><code>WeatherDashboard:Theme</code></td><td>UI theme (light/dark)</td><td><code>VITE_THEME</code> or CSS variable</td></tr></tbody></table>
<p>It also has a feature flag - <code>WeatherDashboard.ExtendedForecast</code> - that toggles an extended forecast section on and off without a code change or redeploy. This is the kind of thing you would normally hardcode or gate behind a build-time flag.</p>
<p>With App Configuration + Front Door, all three settings and the feature flag become runtime-fetched values that can be changed in the Azure portal without touching the deployed application.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="setting-up-the-azure-resources">Setting Up the Azure Resources<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#setting-up-the-azure-resources" class="hash-link" aria-label="Direct link to Setting Up the Azure Resources" title="Direct link to Setting Up the Azure Resources" translate="no">​</a></h2>
<p>You need two Azure resources: an App Configuration store and an Azure Front Door profile.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-1-create-the-app-configuration-store">Step 1: Create the App Configuration Store<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#step-1-create-the-app-configuration-store" class="hash-link" aria-label="Direct link to Step 1: Create the App Configuration Store" title="Direct link to Step 1: Create the App Configuration Store" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">az appconfig create </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token plain"> appconfig-weather-demo </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  --resource-group rg-appconfig-demo </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--location</span><span class="token plain"> australiaeast </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--sku</span><span class="token plain"> Standard</span><br></div></code></pre></div></div>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The Free tier works for testing, but Standard is required for production workloads (replicas, Private Link, higher request limits).</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-2-add-configuration-values">Step 2: Add Configuration Values<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#step-2-add-configuration-values" class="hash-link" aria-label="Direct link to Step 2: Add Configuration Values" title="Direct link to Step 2: Add Configuration Values" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">az appconfig kv </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token plain"> appconfig-weather-demo </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--key</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard:ApiUrl"</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--value</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"https://api.open-meteo.com/v1/forecast"</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-y</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">az appconfig kv </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token plain"> appconfig-weather-demo </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--key</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard:RefreshIntervalSeconds"</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--value</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"300"</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-y</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">az appconfig kv </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token plain"> appconfig-weather-demo </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--key</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard:Theme"</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--value</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"light"</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-y</span><br></div></code></pre></div></div>
<p>I am using the <a href="https://open-meteo.com/" target="_blank" rel="noopener noreferrer" class="">Open-Meteo API</a> here because it is free, requires no API key, and returns real weather data. This keeps the demo self-contained with no additional service dependencies.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="add-a-feature-flag">Add a Feature Flag<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#add-a-feature-flag" class="hash-link" aria-label="Direct link to Add a Feature Flag" title="Direct link to Add a Feature Flag" translate="no">​</a></h4>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">az appconfig feature </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token plain"> appconfig-weather-demo </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--feature</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard.ExtendedForecast"</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--description</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"Show extended 3-day forecast section"</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-y</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">az appconfig feature </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">enable</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--name</span><span class="token plain"> appconfig-weather-demo </span><span class="token punctuation" style="color:rgb(248, 248, 242)">\</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--feature</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard.ExtendedForecast"</span><span class="token plain"> </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">-y</span><br></div></code></pre></div></div>
<p>Feature flags in App Configuration are stored as key-values with a reserved prefix (<code>.appconfig.featureflag/</code>). When you configure the Front Door endpoint, the <strong>Key of feature flag filter</strong> field controls which flags are exposed. Set it to <code>WeatherDashboard.*</code> to match our flag.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-3-connect-azure-front-door">Step 3: Connect Azure Front Door<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#step-3-connect-azure-front-door" class="hash-link" aria-label="Direct link to Step 3: Connect Azure Front Door" title="Direct link to Step 3: Connect Azure Front Door" translate="no">​</a></h3>
<p>In the Azure portal:</p>
<ol>
<li class="">
<p>Navigate to your App Configuration store</p>
</li>
<li class="">
<p>Under <strong>Settings</strong>, select <strong>Azure Front Door (preview)</strong></p>
</li>
<li class="">
<p>Select <strong>Create new</strong> profile</p>
</li>
<li class="">
<p>Configure:</p>
<ul>
<li class=""><strong>Profile name</strong>: <code>afd-weather-config</code></li>
<li class=""><strong>Pricing tier</strong>: Standard</li>
<li class=""><strong>Endpoint name</strong>: <code>weather-config</code></li>
<li class=""><strong>Origin host name</strong>: select your App Configuration store</li>
<li class=""><strong>Identity type</strong>: System-assigned managed identity</li>
<li class=""><strong>Cache Duration</strong>: 10 minutes</li>
<li class=""><strong>Key filter</strong>: <code>WeatherDashboard:*</code></li>
<li class=""><strong>Feature flag filter</strong>: <code>WeatherDashboard.*</code></li>
</ul>
</li>
<li class="">
<p>Select <strong>Create &amp; Connect</strong></p>
</li>
</ol>
<p>The portal automatically assigns the <strong>App Configuration Data Reader</strong> role to the managed identity.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>The key filter you configure on the Front Door endpoint must <strong>exactly match</strong> the selector in your application code. If your app requests <code>WeatherDashboard:*</code> but Front Door is configured for <code>Weather:*</code>, the request will be rejected. This is the most common setup mistake.</p></div></div>
<p>After creation, note your Front Door endpoint URL from the <strong>Existing endpoints</strong> table. It looks like: <code>https://weather-config-xxxxxxxxx.z01.azurefd.net</code></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-this-looks-like-in-iac-from-my-demo-repo">What This Looks Like in IaC (from my demo repo)<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#what-this-looks-like-in-iac-from-my-demo-repo" class="hash-link" aria-label="Direct link to What This Looks Like in IaC (from my demo repo)" title="Direct link to What This Looks Like in IaC (from my demo repo)" translate="no">​</a></h3>
<p>The demo also codifies the App Configuration-to-Front Door relationship in Bicep, so it is reproducible across environments. I had to reverse engineer the ARM template here: <a href="https://github.com/azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.appconfiguration/app-configuration-afd" target="_blank" rel="noopener noreferrer" class="">App Configuration integration with Azure Front Door</a>.</p>
<p><strong>1. App Configuration resource linked to Front Door profile</strong> (<code>infra/main.bicep</code>):</p>
<div class="language-bicep codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bicep codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">resource</span><span class="token plain"> appConfig </span><span class="token string" style="color:rgb(255, 121, 198)">'Microsoft.AppConfiguration/configurationStores@2025-06-01-preview'</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token property">name</span><span class="token operator">:</span><span class="token plain"> appConfigName</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token property">location</span><span class="token operator">:</span><span class="token plain"> location</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token property">sku</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token property">name</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'standard'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token property">properties</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token property">azureFrontDoor</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token property">resourceId</span><span class="token operator">:</span><span class="token plain"> frontDoorProfileRef</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">id</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><br></div></code></pre></div></div>
<p><strong>2. AFD managed identity auth scope for App Configuration origin</strong> (<code>infra/modules/frontdoor-environment.bicep</code>):</p>
<div class="language-bicep codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bicep codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">resource</span><span class="token plain"> configOriginGroup </span><span class="token string" style="color:rgb(255, 121, 198)">'Microsoft.Cdn/profiles/originGroups@2025-06-01'</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token property">parent</span><span class="token operator">:</span><span class="token plain"> frontDoorProfile</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token property">name</span><span class="token operator">:</span><span class="token plain"> configOriginGroupName</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token property">properties</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token property">authentication</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token property">type</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'SystemAssignedIdentity'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token property">scope</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'https://appconfig.azure.com/.default'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><br></div></code></pre></div></div>
<p>That <code>scope</code> value is the AFD token audience for App Configuration. Combined with <code>App Configuration Data Reader</code> role assignment, Front Door can fetch config on behalf of the browser while keeping credentials out of client code.</p>
<p><img decoding="async" loading="lazy" alt="Feature flags and runtime values loaded through Front Door with environment-specific behavior" src="https://luke.geek.nz/assets/images/RuntimeVariablesFeatureFlagAppConfigFrontDoorShowMultipleEnvironmentsSharedFrontDoor-29f9eb00ad69809f3e318f957ae25b0d.gif" width="1371" height="874" class="img_ev3q"></p>
<p><em>This is the live outcome: runtime values and feature flags can differ by environment without rebuilding the SPA.</em></p>
<p>If you want to deploy exactly this setup, use the repo's <code>azd up</code> flow and scripts documented in <a href="https://github.com/lukemurraynz/appconfig-frontdoor-spa-demo/blob/main/README.md" target="_blank" rel="noopener noreferrer" class="">the demo README</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="building-the-weather-dashboard">Building the Weather Dashboard<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#building-the-weather-dashboard" class="hash-link" aria-label="Direct link to Building the Weather Dashboard" title="Direct link to Building the Weather Dashboard" translate="no">​</a></h2>
<p>The demo is a vanilla TypeScript application built with Vite. No framework dependencies beyond what I needed to demonstrate the pattern.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="project-setup">Project Setup<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#project-setup" class="hash-link" aria-label="Direct link to Project Setup" title="Direct link to Project Setup" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">npm</span><span class="token plain"> create vite@latest weather-dashboard -- </span><span class="token parameter variable" style="color:rgb(189, 147, 249);font-style:italic">--template</span><span class="token plain"> vanilla-ts</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">cd</span><span class="token plain"> weather-dashboard</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token function" style="color:rgb(80, 250, 123)">npm</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">install</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token function" style="color:rgb(80, 250, 123)">npm</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">install</span><span class="token plain"> @azure/app-configuration-provider@2.3.0-preview.1</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token function" style="color:rgb(80, 250, 123)">npm</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">install</span><span class="token plain"> @microsoft/feature-management</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-configuration-loader">The Configuration Loader<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#the-configuration-loader" class="hash-link" aria-label="Direct link to The Configuration Loader" title="Direct link to The Configuration Loader" translate="no">​</a></h3>
<p>Create <code>src/config.ts</code>:</p>
<div class="language-typescript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-typescript codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">import</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"> loadFromAzureFrontDoor </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"@azure/app-configuration-provider"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">import</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  FeatureManager</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  ConfigurationMapFeatureFlagProvider</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"@microsoft/feature-management"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">export</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">interface</span><span class="token plain"> </span><span class="token class-name">AppConfig</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  apiUrl</span><span class="token operator">:</span><span class="token plain"> </span><span class="token builtin" style="color:rgb(189, 147, 249)">string</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  refreshIntervalSeconds</span><span class="token operator">:</span><span class="token plain"> </span><span class="token builtin" style="color:rgb(189, 147, 249)">number</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  theme</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"light"</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dark"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  featureManager</span><span class="token operator">:</span><span class="token plain"> FeatureManager</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">const</span><span class="token plain"> </span><span class="token constant" style="color:rgb(189, 147, 249)">AFD_ENDPOINT</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">import</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">meta</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">env</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token constant" style="color:rgb(189, 147, 249)">VITE_AFD_ENDPOINT</span><span class="token plain"> </span><span class="token operator">??</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token string" style="color:rgb(255, 121, 198)">"https://weather-config-xxxxxxxxx.z01.azurefd.net"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">export</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">async</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">function</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">loadConfig</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token operator">:</span><span class="token plain"> </span><span class="token builtin" style="color:rgb(189, 147, 249)">Promise</span><span class="token operator">&lt;</span><span class="token plain">AppConfig</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">const</span><span class="token plain"> settingsMap </span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">await</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">loadFromAzureFrontDoor</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token constant" style="color:rgb(189, 147, 249)">AFD_ENDPOINT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    selectors</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"> keyFilter</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard:*"</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    featureFlagOptions</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"> enabled</span><span class="token operator">:</span><span class="token plain"> </span><span class="token boolean">true</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    refreshOptions</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      enabled</span><span class="token operator">:</span><span class="token plain"> </span><span class="token boolean">true</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      refreshIntervalInMs</span><span class="token operator">:</span><span class="token plain"> </span><span class="token number">60_000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">const</span><span class="token plain"> featureManager </span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">new</span><span class="token plain"> </span><span class="token class-name">FeatureManager</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">new</span><span class="token plain"> </span><span class="token class-name">ConfigurationMapFeatureFlagProvider</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">settingsMap</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">return</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    apiUrl</span><span class="token operator">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      settingsMap</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token function" style="color:rgb(80, 250, 123)">get</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard:ApiUrl"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">??</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token string" style="color:rgb(255, 121, 198)">"https://api.open-meteo.com/v1/forecast"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    refreshIntervalSeconds</span><span class="token operator">:</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">parseInt</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      settingsMap</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token function" style="color:rgb(80, 250, 123)">get</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard:RefreshIntervalSeconds"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">??</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"300"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token number">10</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    theme</span><span class="token operator">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">settingsMap</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token function" style="color:rgb(80, 250, 123)">get</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"WeatherDashboard:Theme"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"light"</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dark"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">??</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token string" style="color:rgb(255, 121, 198)">"light"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    featureManager</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><br></div></code></pre></div></div>
<p>Two things to notice:</p>
<ol>
<li class=""><code>featureFlagOptions: { enabled: true }</code> tells the provider to load feature flags alongside key-values. Feature flags use the reserved <code>.appconfig.featureflag/</code> prefix, which the provider handles automatically.</li>
<li class=""><code>ConfigurationMapFeatureFlagProvider</code> wraps the settings map so <code>FeatureManager</code> can evaluate flags. You then use <code>featureManager.isEnabled("WeatherDashboard.ExtendedForecast")</code> anywhere in your app.</li>
</ol>
<p>The only "baked in" value is the Front Door endpoint URL itself. This URL is stable per environment and rarely changes, unlike API endpoints, feature flags, and display settings. You could also inject it as a single build arg or serve it from a <code>/config.json</code> on the same host.</p>
<p>The feature flag evaluation happens at runtime on every refresh cycle. Toggle <code>WeatherDashboard.ExtendedForecast</code> on or off in the Azure portal, and the extended forecast section appears or disappears on the next refresh - no rebuild, no redeploy.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="running-it">Running It<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#running-it" class="hash-link" aria-label="Direct link to Running It" title="Direct link to Running It" translate="no">​</a></h2>
<p>Open the deployed website. You should see:</p>
<ol>
<li class="">A brief "Loading configuration from Azure Front Door..." message</li>
<li class="">The weather card populated with real Auckland weather data</li>
<li class="">A footer showing the config source: <code>Config loaded at runtime via CDN | API: https://api.open-meteo.com/v1/forecast | Refresh: 300s | Theme: light</code></li>
</ol>
<p>Now go to the Azure portal and try two things:</p>
<ol>
<li class="">Change <code>WeatherDashboard:Theme</code> from <code>light</code> to <code>dark</code> - the app switches themes on the next refresh</li>
<li class="">Disable the <code>WeatherDashboard.ExtendedForecast</code> feature flag - the 3-day forecast section disappears</li>
</ol>
<p>Both changes take effect without a rebuild or redeploy. The status bar shows the feature flag state so you can confirm it is working.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-docker-build---one-artifact-every-environment">The Docker Build - One Artifact, Every Environment<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#the-docker-build---one-artifact-every-environment" class="hash-link" aria-label="Direct link to The Docker Build - One Artifact, Every Environment" title="Direct link to The Docker Build - One Artifact, Every Environment" translate="no">​</a></h2>
<p>Here is where the value becomes concrete. The Dockerfile no longer needs environment-specific build args:</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM node:22-alpine AS build</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">WORKDIR /app</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">COPY package*.json .</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">RUN npm ci</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">COPY . .</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">RUN npm run build</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM nginx:alpine</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">COPY --from=build /app/dist /usr/share/nginx/html</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">EXPOSE 80</span><br></div></code></pre></div></div>
<p>No <code>ARG VITE_API_URL</code>. No <code>ENV VITE_API_URL</code>. The same image runs in dev, staging, and production.</p>
<p>The only environment-specific value is the Front Door endpoint URL, which you can inject via a single environment variable or serve from a static <code>/config.json</code> on the same origin. Everything else - API URLs, refresh intervals, themes, feature flags - comes from App Configuration through Front Door at runtime.</p>
<p><img decoding="async" loading="lazy" alt="Single build across multiple environments with shared Front Door and isolated configuration" src="https://luke.geek.nz/assets/images/RuntimeVariablesShowCaseSharedFrontDoorMultipleEnvironmentsStandaloneConfig-e84802266dc6607a748046f0d80f4675.gif" width="1904" height="963" class="img_ev3q"></p>
<p><em>One artifact, multiple environments: shared Front Door profile, separate endpoints/stores, isolated runtime config.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="security-considerations">Security Considerations<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#security-considerations" class="hash-link" aria-label="Direct link to Security Considerations" title="Direct link to Security Considerations" translate="no">​</a></h2>
<p>The Front Door endpoint is unauthenticated. Any browser (or <code>curl</code>) can hit it. This is the same threat model as any public CDN asset.</p>
<p><strong>What is safe to serve through this channel:</strong></p>
<ul>
<li class="">UI themes and display strings</li>
<li class="">Public API base URLs (these are already visible in your JS bundle today)</li>
<li class="">Feature flags for non-sensitive features</li>
<li class="">Version numbers and refresh intervals</li>
</ul>
<p><strong>What should never go through this channel:</strong></p>
<ul>
<li class="">API keys, tokens, or connection strings</li>
<li class="">Internal service URLs that reveal infrastructure</li>
<li class="">Business-critical pricing or logic config that competitors should not see</li>
</ul>
<p>Sensitive configuration stays server-side with managed identity authentication. The Front Door channel is for config that is already effectively public in your shipped JavaScript bundle.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="gotchas-i-found">Gotchas I Found<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#gotchas-i-found" class="hash-link" aria-label="Direct link to Gotchas I Found" title="Direct link to Gotchas I Found" translate="no">​</a></h2>
<p><strong>Filter matching is character-exact.</strong> The <code>keyFilter</code> in your JavaScript must match the filter configured on the Front Door endpoint character-for-character. <code>WeatherDashboard:*</code> in code with <code>WeatherDashboard*</code> (no colon) in Front Door equals a rejected request with no useful error message.</p>
<p><strong>No sentinel key refresh.</strong> Unlike server-side App Configuration, you cannot use a sentinel key to trigger refresh. The SDK uses "monitor all selected keys" mode, which checks all keys for changes on the refresh interval.</p>
<p><strong>Cache TTL matters.</strong> Front Door caches responses. If you set a 10-minute cache TTL, config changes take up to 10 minutes to reach clients. Setting it too low increases origin requests and risks throttling your App Configuration store.</p>
<p><strong>Language support is limited.</strong> As of April 2026, only JavaScript (<code>@azure/app-configuration-provider</code> v2.3.0-preview) and .NET (<code>Microsoft.Extensions.Configuration.AzureAppConfiguration</code> v8.5.0-preview) have Front Door support. Java, Python, and Go are listed as "work in progress."</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-use-this-pattern">When to Use This Pattern<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#when-to-use-this-pattern" class="hash-link" aria-label="Direct link to When to Use This Pattern" title="Direct link to When to Use This Pattern" translate="no">​</a></h2>
<p>This pattern makes sense when:</p>
<ul>
<li class="">You deploy the same SPA to multiple environments and are tired of rebuilding per environment</li>
<li class="">You want to change feature flags or display settings without a CI/CD run</li>
<li class="">Your SPA currently uses <code>VITE_*</code> or <code>NEXT_PUBLIC_*</code> build args for configuration that changes between environments</li>
<li class="">You need CDN-level performance for config delivery (global latency, DDoS protection)</li>
</ul>
<p>It is less suited for:</p>
<ul>
<li class="">Server-rendered applications (use server-side App Configuration with managed identity instead)</li>
<li class="">Apps with only one or two config values that genuinely never change</li>
<li class="">Configurations containing secrets (these must stay server-side)</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrapping-up">Wrapping Up<a href="https://luke.geek.nz/azure/appconfig-frontdoor-spa/#wrapping-up" class="hash-link" aria-label="Direct link to Wrapping Up" title="Direct link to Wrapping Up" translate="no">​</a></h2>
<p>Build-time environment variable injection for SPAs is a pattern that works until it does not. The moment you need multiple environments, runtime config changes, or deploy the same artifact across regions, the rebuild-per-environment model becomes a liability.</p>
<p>Azure App Configuration with Front Door moves SPA configuration from compile-time constants to runtime-fetched data, delivered through a CDN. The trade-off is clear: you accept eventual consistency (cache TTL) and a public endpoint (no per-client auth) in exchange for a single build artifact and runtime configuration changes.</p>
<p>The feature is still in preview, and the SDK support is limited to JavaScript and .NET. But the architectural pattern - fetch config as data, not compile it as code - is sound and worth exploring now.</p>
<blockquote>
<p>Want to deploy this exact walkthrough end-to-end? Start with the companion repo: <a href="https://github.com/lukemurraynz/appconfig-frontdoor-spa-demo" target="_blank" rel="noopener noreferrer" class="">lukemurraynz/appconfig-frontdoor-spa-demo</a> (includes Bicep, <code>azd</code> provisioning, and runtime config/feature-flag demo scripts).</p>
<p>You can also check the official Microsoft samples on GitHub: <a href="https://github.com/Azure-Samples/appconfig-javascript-clientapp-with-afd" target="_blank" rel="noopener noreferrer" class="">JavaScript SPA sample</a> (a full React chatbot with A/B testing across LLM models) and <a href="https://github.com/Azure-Samples/appconfig-maui-app-with-afd" target="_blank" rel="noopener noreferrer" class="">.NET MAUI sample</a>.</p>
</blockquote>]]></content:encoded>
            <category>Azure</category>
        </item>
        <item>
            <title><![CDATA[NimbusIQ: Multi-Agent Azure Drift Remediation]]></title>
            <link>https://luke.geek.nz/azure/nimbusiq/</link>
            <guid>https://luke.geek.nz/azure/nimbusiq/</guid>
            <pubDate>Sun, 15 Mar 2026 10:24:44 GMT</pubDate>
            <description><![CDATA[A deep dive into NimbusIQ, my AI Dev Days Hackathon project for Azure estate analysis, drift detection, prioritised remediation, and reviewable IaC generation.]]></description>
            <content:encoded><![CDATA[<p>As the AI Dev Days Hackathon comes to an end, I want to share my submission.</p>
<p>Today, I want to walk through something I have been building over the last wee while - a project called <strong>NimbusIQ</strong>. It is my submission for the <a href="https://developer.microsoft.com/en-us/reactor/events/26647/" target="_blank" rel="noopener noreferrer" class="">AI Dev Days Hackathon</a>, and it sits across the <strong>Best Multi-Agent System</strong> and <strong>Best Enterprise Solution</strong> categories - NimbusIQ.</p>
<p>At its core, NimbusIQ is built on <a href="https://learn.microsoft.com/en-us/agent-framework/overview/?pivots=programming-language-csharp&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Agent Framework</a> - Microsoft's orchestration layer for composing multi-agent pipelines in .NET. It gives you a <code>WorkflowBuilder</code> pattern for wiring agents together with explicit edges, lifecycle management via <code>InProcessExecution</code>, and the structure needed to run ten specialised agents in a coordinated sequence without the whole thing becoming a tangle of custom plumbing.</p>
<!-- -->
<p><img decoding="async" loading="lazy" alt="NimbusIQ Dashboard" src="https://luke.geek.nz/assets/images/NimbusIQDashboard-fe2f92c126c81e4f18a1af4053322173.png" width="1891" height="950" class="img_ev3q"></p>
<p><img decoding="async" loading="lazy" alt="Nimbus IQ - Recommendations Blade" src="https://luke.geek.nz/assets/images/NimbusIQRecommendPaneDisplay-bd2425f4c5d346bb663bb7bb2a2050a3.png" width="1890" height="903" class="img_ev3q"></p>
<p>I spend some time working with Azure environments - helping teams understand their estates, finding configuration drift, catching orphaned resources, and figuring out what to fix first. If you have done any of that work, you will know the pain. Azure gives you no shortage of signals: <a href="https://learn.microsoft.com/azure/advisor/advisor-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Advisor</a>, <a href="https://learn.microsoft.com/azure/governance/resource-graph/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Resource Graph</a>, <a href="https://learn.microsoft.com/azure/cost-management-billing/costs/overview-cost-management?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Cost Management</a>, <a href="https://azure.github.io/PSRule.Rules.Azure/" target="_blank" rel="noopener noreferrer" class="">PSRule for Azure</a>, <a href="https://azure.github.io/azqr/docs/" target="_blank" rel="noopener noreferrer" class="">Azure Quick Review</a>, Policy, Monitor - the list goes on. The problem is not a lack of data. The problem is that all of these signals live in different dashboards, different exports, and different tools. Nobody is joining them up.</p>
<p>So I thought to myself: what if I could build something that does the bit that currently requires a human cloud architect? Not the detection - Azure already does that well enough - but the reasoning, prioritisation, and remediation planning that happens after detection - scoped per <a href="https://learn.microsoft.com/en-us/azure/governance/service-groups/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Service Group</a>.</p>
<blockquote>
<p>That is what NimbusIQ aimed to do.</p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-nimbusiq">What is NimbusIQ?<a href="https://luke.geek.nz/azure/nimbusiq/#what-is-nimbusiq" class="hash-link" aria-label="Direct link to What is NimbusIQ?" title="Direct link to What is NimbusIQ?" translate="no">​</a></h2>
<p>In short, NimbusIQ is a multi-agent AI platform that continuously discovers your Azure estate, detects drift and policy violations, reasons across cost, reliability, sustainability, and governance signals, and produces remediation plans that a human can review and approve before anything gets applied.</p>
<p>It uses:</p>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/agent-framework/overview/?pivots=programming-language-csharp&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Agent Framework</a> for agent orchestration</li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/foundry/what-is-foundry?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Foundry</a> (GPT-4) for the reasoning and narrative generation</li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/developer/azure-mcp-server/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure MCP</a> for grounded Azure capability discovery</li>
<li class=""><a href="https://learn.microsoft.com/azure/container-apps/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Container Apps</a>, PostgreSQL, Key Vault, managed identity, and OpenTelemetry for the runtime</li>
</ul>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>View the source</div><div class="admonitionContent_BuS1"><p>The full source code is on GitHub: <strong><a href="https://github.com/lukemurraynz/NimbusIQ" target="_blank" rel="noopener noreferrer" class="">github.com/lukemurraynz/NimbusIQ</a></strong> - feel free to explore, fork, or open PRs.</p></div></div>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>This was created purely for the Hackathon, with a fair amount of hypervelocity engineering effort, although I have done my best to wrap production logic - ie security and resilience/circuit breakers/fallback endpoints etc. It is missing Entra ID authentication and various other functions - and of course support so use at your own risk.</p></div></div>
<p>The whole thing deploys with <code>azd up</code>.</p>
<p><img decoding="async" loading="lazy" alt="NimbusIQ platform overview" src="https://luke.geek.nz/assets/images/NimbusIQOverview-427bee6951747c2aa4a77fa6ed6b7fd9.gif" width="1897" height="998" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-problem-i-was-trying-to-solve">The problem I was trying to solve<a href="https://luke.geek.nz/azure/nimbusiq/#the-problem-i-was-trying-to-solve" class="hash-link" aria-label="Direct link to The problem I was trying to solve" title="Direct link to The problem I was trying to solve" translate="no">​</a></h2>
<p>If you manage Azure estates at any sort of scale, you have probably lived this loop:</p>
<ol>
<li class="">Gather evidence from multiple Azure tools</li>
<li class="">Interpret what actually changed and whether it matters</li>
<li class="">Decide whether cost, reliability, compliance, or architecture should take priority</li>
<li class="">Draft a remediation plan</li>
<li class="">Route it through approval</li>
<li class="">Hope the action actually improved things</li>
</ol>
<p>That loop is manual, slow, and happens in spreadsheets or meeting rooms. The tools tell you <strong>what</strong> is wrong, but very few of them can tell you <strong>why</strong> it matters for a specific workload, <strong>what</strong> you should fix first, <strong>how</strong> to remediate it safely, or <strong>whether</strong> the change you made actually delivered value.</p>
<p>NimbusIQ automates that decision-support loop.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-nimbusiq-differs-from-existing-tools">How NimbusIQ differs from existing tools<a href="https://luke.geek.nz/azure/nimbusiq/#how-nimbusiq-differs-from-existing-tools" class="hash-link" aria-label="Direct link to How NimbusIQ differs from existing tools" title="Direct link to How NimbusIQ differs from existing tools" translate="no">​</a></h2>
<p>I want to be clear - NimbusIQ is not a replacement for Azure Advisor, PSRule for Azure, or Azure Quick Review. Those are solid detection and standards tools, and NimbusIQ actually uses their rule sets internally. What NimbusIQ adds is the orchestration and decision-support layer that sits above them.</p>
<table><thead><tr><th>Capability</th><th>Azure Advisor</th><th>PSRule</th><th>Azure Quick Review</th><th>NimbusIQ</th></tr></thead><tbody><tr><td>Detect configuration violations</td><td>✓</td><td>✓</td><td>✓</td><td>✓</td></tr><tr><td>Continuous drift trending</td><td>✗</td><td>✗</td><td>✗</td><td>✓</td></tr><tr><td>AI-powered reasoning across signals</td><td>✗</td><td>✗</td><td>✗</td><td>✓ (6 LLM agents)</td></tr><tr><td>Workload-scoped analysis</td><td>✗</td><td>✗</td><td>✗</td><td>✓ (Azure Service Groups)</td></tr><tr><td>Generate deployable IaC (Bicep/Terraform)</td><td>✗</td><td>✗</td><td>✗</td><td>✓</td></tr><tr><td>Dual-control approval workflow</td><td>✗</td><td>✗</td><td>✗</td><td>✓</td></tr><tr><td>Explain WHY issues exist</td><td>~Basic</td><td>~Pattern-based</td><td>~Checklist-based</td><td>✓ (AI narrative)</td></tr><tr><td>Track value realisation</td><td>✗</td><td>✗</td><td>✗</td><td>✓</td></tr><tr><td>Auditable agent-to-agent lineage</td><td>✗</td><td>✗</td><td>✗</td><td>✓ (A2A tracing)</td></tr></tbody></table>
<p>The way I think about it: if Azure Advisor is a dashboard, NimbusIQ is a cloud architect in the loop.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture">The architecture<a href="https://luke.geek.nz/azure/nimbusiq/#the-architecture" class="hash-link" aria-label="Direct link to The architecture" title="Direct link to The architecture" translate="no">​</a></h2>
<p>NimbusIQ has three services:</p>
<ol>
<li class=""><strong>Frontend</strong> - React with <a href="https://storybooks.fluentui.dev/react/" target="_blank" rel="noopener noreferrer" class="">Fluent UI v9</a>, showing a service graph, recommendations, approval workflow, and drift timeline</li>
<li class=""><strong>Control Plane API</strong> - ASP.NET Core (.NET 10) handling service groups, analysis runs, decisions, and RFC 9457 error responses</li>
<li class=""><strong>Agent Orchestrator</strong> - a .NET 10 background worker that runs the multi-agent pipeline using <a href="https://learn.microsoft.com/en-us/agent-framework/overview/?pivots=programming-language-csharp&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Agent Framework</a></li>
</ol>
<p>All three run on Azure Container Apps with managed identity everywhere. No secrets in config files - just <code>DefaultAzureCredential</code> and RBAC.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">┌──────────────────────────────────────────────────────────────────┐</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│  Frontend (React + Fluent UI v9)                                  │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│  Service graph · Recommendations · Approval workflow · Timeline   │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">└─────────────────────────┬────────────────────────────────────────┘</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                          │ REST / JWT (Entra ID - planned, not yet implemented)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">┌─────────────────────────▼────────────────────────────────────────┐</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│  Control Plane API (.NET 10 / ASP.NET Core)                       │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│  Service groups · Analysis runs · Decisions · RFC 9457 errors     │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">└──────────┬──────────────────────────────┬────────────────────────┘</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">           │ PostgreSQL (EF Core)          │ Agent messages</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">┌──────────▼──────────────────────────────▼────────────────────────┐</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│  Agent Orchestrator (.NET 10 background worker)                   │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                                                                   │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│  DiscoveryWorkflow ──► MultiAgentOrchestrator (Microsoft MAF)    │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│    Resource Graph        │                                        │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│    Cost Management       ├─ ServiceIntelligenceAgent              │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│    Log Analytics         ├─ BestPracticeEngine (700+ rules)      │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          ├─ DriftDetectionAgent                   │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          ├─ WellArchitectedAssessmentAgent       │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          ├─ FinOpsOptimizerAgent                 │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          ├─ CloudNativeMaturityAgent             │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          ├─ ArchitectureAgent                    │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          ├─ ReliabilityAgent                     │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          ├─ SustainabilityAgent                  │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                          └─ GovernanceNegotiationAgent           │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│                                                                   │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│  IacGenerationWorkflow (Foundry-powered Bicep/Terraform)         │</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">└──────────────────────────────────────────────────────────────────┘</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">           All on Azure Container Apps + PostgreSQL Flexible Server</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">           Managed Identity · OpenTelemetry · Key Vault</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="NimbusIQ Deployment Architecture on Azure" src="https://luke.geek.nz/assets/images/nimbusiq-hackathon-submission-Deployment%20Architecture-1575ed17fc8aaa4981e6034c0a139810.jpg" width="1010" height="742" class="img_ev3q"></p>
<p><img decoding="async" loading="lazy" alt="NimbusIQ Dashboard in action" src="https://luke.geek.nz/assets/images/NimbusIQDashboard-8c584b811383fab926c7912a9636f22c.gif" width="1883" height="927" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-ten-agents">The ten agents<a href="https://luke.geek.nz/azure/nimbusiq/#the-ten-agents" class="hash-link" aria-label="Direct link to The ten agents" title="Direct link to The ten agents" translate="no">​</a></h2>
<p>This is the bit I am most pleased with. NimbusIQ runs ten specialised agents, each with a distinct responsibility. Six of them use Microsoft Foundry (GPT-4) for reasoning; four are deterministic rule-based evaluators.</p>
<p><img decoding="async" loading="lazy" alt="NimbusIQ Agent Orchestration Flow" src="https://luke.geek.nz/assets/images/nimbusiq-hackathon-submission-AgentOrchestrationFlow-9d779793097690cf27fbba418ac5af8b.jpg" width="1052" height="751" class="img_ev3q"></p>
<p>Here is how they are wired up using <a href="https://learn.microsoft.com/en-us/agent-framework/overview/?pivots=programming-language-csharp&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Agent Framework</a>'s <code>WorkflowBuilder</code>:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">WorkflowBuilder builder = new(executorBindings[0]);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">builder.WithName("nimbusiq-sequential");</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">builder.WithDescription(</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    "NimbusIQ multi-agent orchestration workflow powered by Microsoft Agent Framework.");</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">for (var index = 0; index &lt; executorBindings.Count - 1; index++)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    builder.AddEdge(executorBindings[index], executorBindings[index + 1]);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">builder.WithOutputFrom(executorBindings[^1]);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">var workflow = builder.Build(validateOrphans: true);</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">await using Run run = await InProcessExecution.RunAsync(</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    workflow, executionState, session.SessionId, cancellationToken);</span><br></div></code></pre></div></div>
<p>Each agent is registered with a clear name and purpose:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">_agents = new Dictionary&lt;string, AIAgent&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    ["ServiceIntelligence"] = CreateDeterministicAgent(</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "service-intelligence-agent",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "Service Intelligence",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "Calculates service-group intelligence scores.",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        (context, _, _) =&gt; Task.FromResult&lt;object&gt;(</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            serviceIntelligenceAgent.CalculateScores(context.Snapshot))),</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    ["BestPractice"] = CreateDeterministicAgent(</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "best-practice-agent",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "Best Practice",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "Evaluates best-practice rules against discovered resources.",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        async (context, _, ct) =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            await bestPracticeEngine.EvaluateAsync(context.Snapshot, ct)),</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    ["DriftDetection"] = CreateDeterministicAgent(</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "drift-detection-agent",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "Drift Detection",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        "Detects drift across service resources and best-practice violations.",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        async (context, _, ct) =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            await driftDetectionAgent.AnalyzeDriftAsync(context.Snapshot, null, ct)),</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    // ... WellArchitected, FinOps, CloudNative, Architecture,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    //     Reliability, Sustainability, Governance agents follow</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">};</span><br></div></code></pre></div></div>
<p>The <code>BestPracticeEngine</code> sits at the heart of the deterministic layer. It packages over 700 rules sourced from Azure Well-Architected Framework, PSRule for Azure, Azure Quick Review, and the Azure Architecture Centre. The AI agents then reason over those normalised results rather than making things up from scratch.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Why hybrid?</div><div class="admonitionContent_BuS1"><p>I deliberately kept four agents as pure rule-based evaluators. Not everything needs an LLM - drift scoring, cloud-native maturity checks, and best-practice rule evaluation are deterministic operations where you want consistent, reproducible results. The AI agents handle the subjective bits: explaining trade-offs, generating narratives, and producing remediation code.</p></div></div>
<p><img decoding="async" loading="lazy" alt="NimbusIQ Conflict and Governance pane" src="https://luke.geek.nz/assets/images/NimbusIQConflictGovernancePane-bebbd427064a0ced4c5f37cd737e3ef8.gif" width="1883" height="927" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="drift-detection">Drift detection<a href="https://luke.geek.nz/azure/nimbusiq/#drift-detection" class="hash-link" aria-label="Direct link to Drift detection" title="Direct link to Drift detection" translate="no">​</a></h2>
<p>One of the features I spent the most time on is continuous drift detection. NimbusIQ does not just compare two ARM templates - it evaluates the current state of your resources against the full rule set and produces a severity-weighted score.</p>
<p>The scoring works like this:</p>
<table><thead><tr><th>Severity</th><th>Weight</th></tr></thead><tbody><tr><td>Critical</td><td>10</td></tr><tr><td>High</td><td>5</td></tr><tr><td>Medium</td><td>2</td></tr><tr><td>Low</td><td>1</td></tr></tbody></table>
<p><img decoding="async" loading="lazy" alt="NimbusIQ Timeline and Drift pane" src="https://luke.geek.nz/assets/images/NimbusIQTimelineDriftPane-456f912a8bf5785be8957ccc8db8f2b9.gif" width="1900" height="888" class="img_ev3q"></p>
<p>Each analysis run produces a drift snapshot with a score, category breakdown, and trend direction (<code>stable</code>, <code>degrading</code>, or <code>improving</code>). The dashboard shows those trends over time, so you can see whether your estate is getting better or worse.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="iac-generation">IaC generation<a href="https://luke.geek.nz/azure/nimbusiq/#iac-generation" class="hash-link" aria-label="Direct link to IaC generation" title="Direct link to IaC generation" translate="no">​</a></h2>
<p>When a recommendation is approved, NimbusIQ calls Microsoft Foundry with structured context - the action type, target SKU, cost impact, and confidence - and generates Bicep or Terraform code. A rollback plan is generated alongside every change.</p>
<p><img decoding="async" loading="lazy" alt="NimbusIQ Recommendations and approval workflow" src="https://luke.geek.nz/assets/images/NimbusIQRecommendationsPane-b9d9ccabce9f638601b653fe408317e8.gif" width="1897" height="998" class="img_ev3q"></p>
<p>If Foundry is unavailable (because these things happen), it falls back to built-in code templates rather than failing silently. Every generated plan goes through the dual-control approval workflow before anything is applied.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>NimbusIQ generates IaC and presents it for review. It does not apply changes automatically. Every remediation requires explicit human approval through an idempotent state machine. This is a deliberate design choice - enterprise governance requires that a human is always in the loop for infrastructure changes.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="observability">Observability<a href="https://luke.geek.nz/azure/nimbusiq/#observability" class="hash-link" aria-label="Direct link to Observability" title="Direct link to Observability" translate="no">​</a></h2>
<p>The entire agent pipeline is instrumented with <a href="https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry</a>. Every agent step, every Foundry call, every MCP tool invocation gets a trace with correlation IDs. You get traces that look like this:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">atlas-control-plane-api</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    └── AnalysisRun: Execute (3200ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         ├── Atlas.AgentOrchestrator.MultiAgent: RunAnalysis (2800ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         │    ├── ServiceIntelligence: CalculateScores (45ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         │    ├── BestPractice: Evaluate (320ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         │    ├── DriftDetection: AnalyzeDrift (180ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         │    ├── WellArchitected: Assess (520ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         │    │    └── Atlas.AgentOrchestrator.Azure.AIFoundry: GenerateNarrative (340ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         │    ├── FinOps: Analyze (410ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         │    └── Governance: Negotiate (290ms)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">         └── Atlas.AgentOrchestrator.DriftPersistence: PersistSnapshot (15ms)</span><br></div></code></pre></div></div>
<p>That level of visibility matters. When an agent produces a questionable recommendation, you can trace exactly what data it saw, what rules fired, and what the LLM was asked.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="deployment">Deployment<a href="https://luke.geek.nz/azure/nimbusiq/#deployment" class="hash-link" aria-label="Direct link to Deployment" title="Direct link to Deployment" translate="no">​</a></h2>
<p>The whole thing deploys with <a href="https://learn.microsoft.com/azure/developer/azure-developer-cli/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Developer CLI</a>:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd init</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd </span><span class="token function" style="color:rgb(80, 250, 123)">env</span><span class="token plain"> </span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">set</span><span class="token plain"> NIMBUSIQ_POSTGRES_ADMIN_PASSWORD </span><span class="token string" style="color:rgb(255, 121, 198)">"YourSecurePassword123!"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">azd up</span><br></div></code></pre></div></div>
<p>The infrastructure is defined in Bicep using <a href="https://azure.github.io/Azure-Verified-Modules/" target="_blank" rel="noopener noreferrer" class="">Azure Verified Modules</a> where available. It provisions:</p>
<ul>
<li class="">Azure Container Apps (all three services)</li>
<li class="">Azure Container Registry</li>
<li class="">PostgreSQL Flexible Server</li>
<li class="">Key Vault</li>
<li class="">Microsoft Foundry with GPT-4 deployment</li>
<li class="">Log Analytics workspace</li>
<li class="">Managed identities with least-privilege RBAC</li>
<li class="">Optional VNet integration and Network Security Perimeter</li>
</ul>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>If you want to try it yourself, clone the repo and run <code>azd up</code>. You will need an Azure subscription, Docker Desktop, .NET 10 SDK, and Node.js 20+. The deployment takes about 15–20 minutes.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-i-learned-building-this">What I learned building this<a href="https://luke.geek.nz/azure/nimbusiq/#what-i-learned-building-this" class="hash-link" aria-label="Direct link to What I learned building this" title="Direct link to What I learned building this" translate="no">​</a></h2>
<p>A few things stood out:</p>
<p><strong>Microsoft Agent Framework is genuinely useful for orchestration.</strong> The <code>WorkflowBuilder</code> pattern gives you a clean way to compose agents with explicit edges and validation. The <code>InProcessExecution</code> runner handles the lifecycle well. I would not want to build this kind of multi-agent pipeline without it.</p>
<p><strong>Microsoft Foundry works well when you scope it tightly.</strong> The key is not giving the LLM free rein - it is providing structured context (rule results, resource metadata, cost data) and asking it to reason over that context. When you do that, the outputs are useful. When you do not, you get platitudes.</p>
<p><strong>Grounding through Azure MCP makes a real difference.</strong> Without MCP, the LLM would be making recommendations based on its training data, which might be months out of date. With Azure MCP and Learn MCP, the agents can check current Azure capabilities and documentation before recommending changes.</p>
<p><img decoding="async" loading="lazy" alt="NimbusIQ AI Chat pane" src="https://luke.geek.nz/assets/images/NimbusIQAIChatPane-e9a18b899cf3e0e432970d815506032a.gif" width="1900" height="888" class="img_ev3q"></p>
<p><strong>Managed identity simplifies everything.</strong> No connection strings, no key rotation, no secrets in environment variables. Just <code>DefaultAzureCredential</code>, RBAC role assignments in Bicep, and everything wires up. This is how Azure services should be connected.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrapping-up">Wrapping up<a href="https://luke.geek.nz/azure/nimbusiq/#wrapping-up" class="hash-link" aria-label="Direct link to Wrapping up" title="Direct link to Wrapping up" translate="no">​</a></h2>
<p>NimbusIQ is my attempt at building the thing I wish existed when I am helping teams sort out their Azure estates. Not another dashboard with red/amber/green indicators, but something that actually reasons across the signals, explains what matters and why, and generates remediation plans that a human can review and approve.</p>
<blockquote>
<p>The code is on GitHub: <strong><a href="https://github.com/lukemurraynz/NimbusIQ" target="_blank" rel="noopener noreferrer" class="">github.com/lukemurraynz/NimbusIQ</a></strong></p>
</blockquote>
<p>If you have questions or want to chat about the architecture, feel free to reach out.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Change-Driven Architecture on Azure with Drasi]]></title>
            <link>https://luke.geek.nz/azure/change-driven-architecture/</link>
            <guid>https://luke.geek.nz/azure/change-driven-architecture/</guid>
            <pubDate>Wed, 04 Mar 2026 21:47:48 GMT</pubDate>
            <description><![CDATA[A practical look at change-driven architecture on Azure with Drasi and PostgreSQL CDC, based on an Emergency Alert System proof of concept.]]></description>
            <content:encoded><![CDATA[<p>Today, we are going to look at change-driven architecture on Azure using <a href="https://drasi.io/" target="_blank" rel="noopener noreferrer" class="">Drasi</a>, and why it matters from a <a href="https://learn.microsoft.com/azure/well-architected?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Well-Architected</a> perspective.</p>
<p>If you have ever built a system that polls a database every few seconds, asking, "Has anything changed?" - this one is for you.</p>
<blockquote>
<p>I recently built an <a href="https://github.com/lukemurraynz/EmergencyAlertSystem" target="_blank" rel="noopener noreferrer" class="">Emergency Alert System</a> and <a href="https://github.com/lukemurraynz/SantaDigitalShowcase25" target="_blank" rel="noopener noreferrer" class="">Santa Digital Workshop</a> and <a href="https://luke.geek.nz/azure/drasi-bastion-rbac-automation/" target="_blank" rel="noopener noreferrer" class="">Automate Azure Bastion with Drasi Realtime RBAC Monitoring</a> proof of concepts on Azure that use Drasi for reactive data processing. One of the most interesting things I discovered was that change-driven architecture fundamentally shifts how you think about reliability, cost, and operational efficiency.</p>
</blockquote>
<!-- -->
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p>This article explores architectural patterns from a proof of concept. The patterns are production-applicable, but the implementation itself is a learning exercise.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-polling-problem">The Polling Problem<a href="https://luke.geek.nz/azure/change-driven-architecture/#the-polling-problem" class="hash-link" aria-label="Direct link to The Polling Problem" title="Direct link to The Polling Problem" translate="no">​</a></h2>
<p>Most event-driven systems I have worked on follow the same pattern: a background service queries the database on a timer, checks for changes, and then acts on them.</p>
<p>It works, but it has some well-known problems:</p>
<ul>
<li class=""><strong>Wasted compute</strong> - 99% of polls return "nothing changed"</li>
<li class=""><strong>Latency</strong> - you only detect changes at the poll interval (1 second, 5 seconds, 30 seconds?)</li>
<li class=""><strong>Race conditions</strong> - if multiple instances poll simultaneously, you need distributed locks</li>
<li class=""><strong>Scaling challenges</strong> - more instances means more database load, not faster detection</li>
</ul>
<p>From a <a href="https://learn.microsoft.com/azure/well-architected/cost-optimization/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Well-Architected Cost Optimization</a> perspective, polling is paying for compute that mostly does nothing.</p>
<p>From a <a href="https://learn.microsoft.com/azure/well-architected/reliability/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Reliability</a> perspective, poll intervals create a detection floor - you simply cannot react faster than your timer.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="enter-change-data-capture">Enter Change Data Capture<a href="https://luke.geek.nz/azure/change-driven-architecture/#enter-change-data-capture" class="hash-link" aria-label="Direct link to Enter Change Data Capture" title="Direct link to Enter Change Data Capture" translate="no">​</a></h2>
<p><a href="https://learn.microsoft.com/azure/postgresql/flexible-server/concepts-logical?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Change Data Capture (CDC)</a> flips this model. Instead of asking the database whether something has changed, the database tells you when it does.</p>
<p>PostgreSQL Flexible Server <em>(just one of <a href="https://drasi.io/concepts/sources/" target="_blank" rel="noopener noreferrer" class="">Drasi sources</a>)</em> supports logical replication natively, which streams every <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> as it happens.</p>
<p>Drasi sits on top of this CDC stream and runs <a href="https://drasi.io/concepts/continuous-queries/" target="_blank" rel="noopener noreferrer" class="">continuous queries</a> - written in Cypher - that evaluate incoming changes against patterns you define. When a pattern matches, Drasi fires a reaction <em>(in my case, an HTTP callback to an API)</em>.</p>
<p>The architecture follows a simple flow: <strong>Source → Queries → Reactions</strong>.</p>
<p><img decoding="async" loading="lazy" alt="Change-Driven Architecture: Polling vs CDC with Drasi" src="https://luke.geek.nz/assets/images/Polling-vs-cdc-architecture-PollingvsCDC-61dd1868f8317d002eec4803eb8ebbdd.jpg" width="1322" height="831" class="img_ev3q"></p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)"># Drasi CDC Source Configuration</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> v1</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> Source</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> postgres</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">alerts</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">spec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> PostgreSQL</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">properties</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">host</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> $</span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain">POSTGRES_HOST</span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">port</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> $</span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain">POSTGRES_PORT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">user</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> $</span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain">POSTGRES_USER</span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">password</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> $</span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain">POSTGRES_PASSWORD</span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">database</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> $</span><span class="token punctuation" style="color:rgb(248, 248, 242)">{</span><span class="token plain">POSTGRES_DATABASE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">ssl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token boolean important">true</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">tables</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.alerts</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.areas</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.recipients</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.delivery_attempts</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.approval_records</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.correlation_events</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.area_signals</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.weather_observations</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> emergency_alerts.road_maintenance</span><br></div></code></pre></div></div>
<p>This source watches nine tables. Every change to any of these tables flows into the continuous query engine.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="which-drasi-mode-should-you-use">Which Drasi Mode Should You Use?<a href="https://luke.geek.nz/azure/change-driven-architecture/#which-drasi-mode-should-you-use" class="hash-link" aria-label="Direct link to Which Drasi Mode Should You Use?" title="Direct link to Which Drasi Mode Should You Use?" translate="no">​</a></h2>
<p>One useful design decision early on is picking the right Drasi runtime for your workload. Drasi is available in three forms with the same core model (<strong>Sources → Continuous Queries → Reactions</strong>), but different operational trade-offs.</p>
<p><img decoding="async" loading="lazy" alt="Drasi mode comparison across D4K8s, Server, and Library" src="https://luke.geek.nz/assets/images/drasi_sku_types-c88027159af99f2845acff8519138128.png" width="1142" height="651" class="img_ev3q"></p>
<ul>
<li class=""><strong><a href="https://drasi.io/drasi-kubernetes/" target="_blank" rel="noopener noreferrer" class="">Drasi for Kubernetes (D4K8s)</a></strong> - best for production-scale, cloud-native platforms where you want Kubernetes-native scaling, observability, and operational controls.</li>
<li class=""><strong><a href="https://drasi.io/drasi-server/" target="_blank" rel="noopener noreferrer" class="">Drasi Server</a></strong> - best for local development, Docker Compose, edge, and non-Kubernetes environments where you still want full Drasi capabilities in a single process/container.</li>
<li class=""><strong><a href="https://drasi.io/drasi-lib/" target="_blank" rel="noopener noreferrer" class="">drasi-lib</a></strong> - best when building a Rust app and you want in-process change detection with no separate Drasi infrastructure.</li>
</ul>
<p>A practical path I have found useful: start with <strong>Server</strong> to iterate quickly, move to <strong>D4K8s</strong> as reliability/scale requirements grow, and choose <strong>drasi-lib</strong> when your change logic should live directly inside a Rust service.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="continuous-queries---the-logic-layer">Continuous Queries - The Logic Layer<a href="https://luke.geek.nz/azure/change-driven-architecture/#continuous-queries---the-logic-layer" class="hash-link" aria-label="Direct link to Continuous Queries - The Logic Layer" title="Direct link to Continuous Queries - The Logic Layer" translate="no">​</a></h2>
<p>Here is where it gets interesting.</p>
<p>A continuous query is not a one-off SQL statement. It is a standing query that continuously evaluates against the stream of changes (it could be one or across multiple sources).</p>
<p>For example, the delivery trigger query fires when an alert transitions to <code>Approved</code> with a <code>Pending</code> delivery status:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> v1</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> ContinuousQuery</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> delivery</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">trigger</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token key atrule">spec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">mode</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> query</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">sources</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    </span><span class="token key atrule">subscriptions</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> </span><span class="token key atrule">id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> postgres</span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain">alerts</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        </span><span class="token key atrule">nodes</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">          </span><span class="token punctuation" style="color:rgb(248, 248, 242)">-</span><span class="token plain"> </span><span class="token key atrule">sourceLabel</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> alerts</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">  </span><span class="token key atrule">query</span><span class="token punctuation" style="color:rgb(248, 248, 242)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">|</span><span class="token scalar string" style="color:rgb(255, 121, 198)"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">    MATCH (a:alerts)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">    WHERE a.status = 'Approved' AND a.delivery_status = 'Pending'</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">    RETURN</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      a.alert_id AS alertId,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      a.headline AS headline,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      a.severity AS severity,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      a.sent_at AS approvedAt,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token scalar string" style="color:rgb(255, 121, 198)">      drasi.changeDateTime(a) AS triggeredAt</span><br></div></code></pre></div></div>
<p>No polling. No timers.</p>
<p>The moment a row changes in the <code>alerts</code> table and matches these conditions, Drasi fires the reaction.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-well-architected-impact">The Well-Architected Impact<a href="https://luke.geek.nz/azure/change-driven-architecture/#the-well-architected-impact" class="hash-link" aria-label="Direct link to The Well-Architected Impact" title="Direct link to The Well-Architected Impact" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="reliability">Reliability<a href="https://luke.geek.nz/azure/change-driven-architecture/#reliability" class="hash-link" aria-label="Direct link to Reliability" title="Direct link to Reliability" translate="no">​</a></h3>
<p>Change-driven architecture eliminates the detection gap.</p>
<p>In a polling model, if your timer runs every 5 seconds, a critical SLA breach might sit undetected for up to 5 seconds. With CDC, detection is near-instantaneous.</p>
<p>In my proof of concept, I run 15+ continuous queries simultaneously - including SLA-breach detection every 60 seconds, approval-timeout detection every 5 minutes, cross-region correlation, and severity-escalation tracking.</p>
<p>Each query runs independently, and if one fails, the others continue operating. This aligns with the Well-Architected <a href="https://learn.microsoft.com/azure/well-architected/reliability/failure-mode-analysis?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">failure mode analysis</a> guidance - decompose your detection logic so a failure in one area does not cascade.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="cost-optimization">Cost Optimization<a href="https://luke.geek.nz/azure/change-driven-architecture/#cost-optimization" class="hash-link" aria-label="Direct link to Cost Optimization" title="Direct link to Cost Optimization" translate="no">​</a></h3>
<p>No idle compute cycles polling an unchanged database.</p>
<p>The compute only activates when data actually changes. For workloads with bursty change patterns <em>(like an emergency alert system)</em>, this can significantly reduce steady-state cost compared to a fleet of polling workers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="operational-excellence">Operational Excellence<a href="https://luke.geek.nz/azure/change-driven-architecture/#operational-excellence" class="hash-link" aria-label="Direct link to Operational Excellence" title="Direct link to Operational Excellence" translate="no">​</a></h3>
<p>Each continuous query is a declarative YAML file, version-controlled alongside the infrastructure.</p>
<p>Adding a new detection pattern means writing a new query file and deploying it - no code changes to the application, no new background services, no additional infrastructure.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">infrastructure/drasi/queries/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">├── sla-monitoring/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── delivery-sla-breach.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── approval-timeout.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   └── expiry-warning.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">├── risk-detection/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── geographic-correlation.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── regional-hotspot.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   ├── severity-escalation.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">│   └── duplicate-suppression.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">└── recommendations/</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    ├── delivery-trigger.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    ├── all-clear-suggestion.yaml</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    └── area-expansion-suggestion.yaml</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-use-this-pattern">When to Use This Pattern<a href="https://luke.geek.nz/azure/change-driven-architecture/#when-to-use-this-pattern" class="hash-link" aria-label="Direct link to When to Use This Pattern" title="Direct link to When to Use This Pattern" translate="no">​</a></h2>
<p>Change-driven architecture is a good fit when:</p>
<ul>
<li class=""><strong>Low-latency detection matters</strong> - SLA monitoring, fraud detection, security alerts</li>
<li class=""><strong>Multiple detection rules run in parallel</strong> - you need 10+ independent queries watching the same data</li>
<li class=""><strong>The write-to-read ratio is low</strong> - changes happen infrequently relative to how often you would poll</li>
<li class=""><strong>You already use PostgreSQL or another source containing CDC</strong> - CDC comes free with logical replication</li>
</ul>
<p>It is less suited for:</p>
<ul>
<li class=""><strong>High-frequency OLTP</strong> - if every row changes every second, you are essentially processing the full table continuously</li>
<li class=""><strong>Simple CRUD</strong> - if you just need "notify me when a row is inserted," a database trigger or Event Grid integration might be simpler</li>
<li class=""><strong>Teams unfamiliar with Cypher</strong> - the learning curve for graph-style queries is real</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="getting-started">Getting Started<a href="https://luke.geek.nz/azure/change-driven-architecture/#getting-started" class="hash-link" aria-label="Direct link to Getting Started" title="Direct link to Getting Started" translate="no">​</a></h2>
<p>If you want to try this pattern, you need:</p>
<ol>
<li class=""><a href="https://learn.microsoft.com/azure/aks/what-is-aks?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Kubernetes Service (AKS)</a> - Drasi currently runs on Kubernetes <em>(or a local KIND cluster you can run in a devcontainer for testing)</em></li>
<li class=""><a href="https://learn.microsoft.com/azure/postgresql/flexible-server/overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">PostgreSQL Flexible Server</a> with logical replication enabled</li>
<li class="">The <a href="https://drasi.io/drasi-server/getting-started/" target="_blank" rel="noopener noreferrer" class="">Drasi CLI</a> installed in your cluster</li>
</ol>
<p>The Drasi documentation covers installation well. The key Azure-specific step is to enable logical replication on your PostgreSQL Flexible Server - set <code>wal_level = logical</code> and configure <code>max_replication_slots</code> to match the number of sources you plan to run.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p>If you are using Bicep to deploy PostgreSQL Flexible Server, set <code>azure.extensions = postgis</code> as a server parameter if you need spatial queries. The CDC source does not require PostGIS, but if your queries reference spatial data, the extension must be installed before running migrations.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrapping-up">Wrapping Up<a href="https://luke.geek.nz/azure/change-driven-architecture/#wrapping-up" class="hash-link" aria-label="Direct link to Wrapping Up" title="Direct link to Wrapping Up" translate="no">​</a></h2>
<p>Change-driven architecture addresses several Well-Architected concerns simultaneously:</p>
<ul>
<li class="">It reduces wasted compute (<strong>Cost Optimization</strong>)</li>
<li class="">It eliminates detection gaps (<strong>Reliability</strong>)</li>
<li class="">It keeps detection logic declarative and version-controlled (<strong>Operational Excellence</strong>)</li>
</ul>
<p>Drasi makes this pattern accessible on Azure without writing custom CDC consumers or managing Kafka/Debezium infrastructure yourself.</p>
<p>The shift from "ask the database" to "let the database tell you" is subtle, but the architectural implications are significant.</p>
<blockquote>
<p>You can find the full proof of concept on GitHub: <a href="https://github.com/lukemurraynz/EmergencyAlertSystem" target="_blank" rel="noopener noreferrer" class="">lukemurraynz/EmergencyAlertSystem</a>.</p>
</blockquote>]]></content:encoded>
            <category>Azure</category>
        </item>
        <item>
            <title><![CDATA[Container Security Hardening for Azure Container Apps]]></title>
            <link>https://luke.geek.nz/azure/container-security-hardening-checklist/</link>
            <guid>https://luke.geek.nz/azure/container-security-hardening-checklist/</guid>
            <pubDate>Wed, 04 Mar 2026 07:33:14 GMT</pubDate>
            <description><![CDATA[A practical checklist for hardening containerised .NET workloads on Azure Container Apps, based on patterns implemented in NimbusIQ.]]></description>
            <content:encoded><![CDATA[<p>Every time I see a production container running as root, I wince.</p>
<p>It is one of those things that is easy to fix but gets overlooked because the app "works fine" without it. But container security is not just about non-root users. It is about the full stack: image build, runtime configuration, network policy, input validation, and rate limiting.</p>
<p>In this post, I will walk through a checklist I used to harden a .NET project running on <a href="https://learn.microsoft.com/azure/container-apps/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Container Apps</a>.</p>
<!-- -->
<p><img decoding="async" loading="lazy" alt="Container Security" src="https://luke.geek.nz/assets/images/container-security-60bbc4b5abee4b5bbcead3cc9524e206.jpg" width="1121" height="651" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-non-root-containers">1. Non-root containers<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#1-non-root-containers" class="hash-link" aria-label="Direct link to 1. Non-root containers" title="Direct link to 1. Non-root containers" translate="no">​</a></h2>
<p>Running as root inside a container means that if an attacker exploits a vulnerability in your application, they inherit root privileges within the container. In some scenarios, that can be leveraged for container escape.</p>
<p>The fix is straightforward. In your Dockerfile:</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM mcr.microsoft.com/dotnet/aspnet:10.0 AS runtime</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">WORKDIR /app</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">COPY --from=build /app/publish .</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">ENV ASPNETCORE_HTTP_PORTS=8080</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">EXPOSE 8080</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"># Switch to non-root user</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">USER $APP_UID</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    CMD curl -f http://localhost:8080/health/ready || exit 1</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">ENTRYPOINT ["dotnet", "App.ControlPlane.Api.dll"]</span><br></div></code></pre></div></div>
<p>Key points:</p>
<ul>
<li class="">For <a href="https://devblogs.microsoft.com/dotnet/securing-containers-with-rootless/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">official Microsoft .NET <strong>Linux</strong> images (.NET 8+)</a>, you do <strong>not</strong> need to create your own user. The images already include a non-root <code>app</code> user.</li>
<li class="">Use <code>USER app</code> or <code>USER $APP_UID</code> (<code>$APP_UID</code> is UID <code>1654</code>). I prefer <code>USER $APP_UID</code> because it also works cleanly with Kubernetes <code>runAsNonRoot</code> checks.</li>
<li class="">The image is <strong>non-root capable</strong>, but it is not automatically non-root unless you set <code>USER</code> explicitly.</li>
<li class="">Place <code>USER</code> after <code>COPY</code> so the app files are copied first and then executed as non-root.</li>
<li class="">Use port <code>8080</code> (not 80/443). Non-privileged ports avoid root requirements, and moving back to port <code>80</code> means you cannot run as non-root.</li>
</ul>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>If you are using a base image that does <strong>not</strong> provide a non-root user (or you have custom filesystem write paths), create/chown a dedicated runtime user for those paths before switching away from root.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-multi-stage-builds">2. Multi-stage builds<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#2-multi-stage-builds" class="hash-link" aria-label="Direct link to 2. Multi-stage builds" title="Direct link to 2. Multi-stage builds" translate="no">​</a></h2>
<p>Multi-stage Docker builds keep build tools (SDK, compilers, npm dev dependencies) out of the runtime image. This reduces the attack surface and image size.</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain"># Build stage — SDK and build toolchain</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM mcr.microsoft.com/dotnet/sdk:10.0 AS build</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">WORKDIR /src</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">COPY . .</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">RUN dotnet restore src/Api/App.ControlPlane.Api.csproj</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">RUN dotnet publish src/Api/App.ControlPlane.Api.csproj -c Release -o /app/publish /p:UseAppHost=false</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"># Runtime stage — minimal runtime only</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM mcr.microsoft.com/dotnet/aspnet:10.0 AS runtime</span><br></div></code></pre></div></div>
<p>For frontend workloads, the pattern is similar:</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain"># Build stage with Node.js</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM node:20-alpine AS build</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"># ... npm ci, vite build</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain"># Runtime stage with production dependencies only</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM node:20-alpine AS runtime</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">RUN npm ci --only=production</span><br></div></code></pre></div></div>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>Use <code>--only=production</code> (or <code>--omit=dev</code> in npm 9+) in runtime stages so TypeScript, ESLint, Vite, and other dev tooling are not shipped to production.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-pin-base-image-versions">3. Pin base image versions<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#3-pin-base-image-versions" class="hash-link" aria-label="Direct link to 3. Pin base image versions" title="Direct link to 3. Pin base image versions" translate="no">​</a></h2>
<p>Never use <code>latest</code> in production images.</p>
<p>❌ Bad — unpredictable</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM mcr.microsoft.com/dotnet/aspnet:latest</span><br></div></code></pre></div></div>
<p>✅ Good — deterministic and reproducible</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">FROM mcr.microsoft.com/dotnet/aspnet:10.0</span><br></div></code></pre></div></div>
<p>Pinning to major.minor gives you a solid balance between stability and patch cadence. If you need strict reproducibility, pin to an image digest.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-health-probes-that-bypass-auth">4. Health probes that bypass auth<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#4-health-probes-that-bypass-auth" class="hash-link" aria-label="Direct link to 4. Health probes that bypass auth" title="Direct link to 4. Health probes that bypass auth" translate="no">​</a></h2>
<p>Health endpoints should bypass authentication middleware. If readiness requires a JWT, the platform cannot accurately determine service health.</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">app.MapGet("/health/ready", () =&gt; Results.Ok(new</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    Status = "Healthy",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    Timestamp = DateTime.UtcNow,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    Service = "app-control-plane-api",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    Version = "1.0.0"</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">}));</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">app.MapGet("/health/live", () =&gt; Results.Ok(new</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    Status = "Alive",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    Timestamp = DateTime.UtcNow</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">}));</span><br></div></code></pre></div></div>
<p>In practice, map these endpoints before strict authorization rules, or explicitly bypass auth for <code>/health/*</code>.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Configure both liveness and readiness. Liveness answers "is the process alive?" Readiness answers "Can it safely receive traffic?"</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="5-rate-limiting">5. Rate limiting<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#5-rate-limiting" class="hash-link" aria-label="Direct link to 5. Rate limiting" title="Direct link to 5. Rate limiting" translate="no">​</a></h2>
<p>The API uses <a href="https://learn.microsoft.com/aspnet/core/performance/rate-limit?view=aspnetcore-10.0&amp;WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">ASP.NET Core rate limiting middleware</a> with a fixed-window policy:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">builder.Services.AddRateLimiter(options =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    options.GlobalLimiter = PartitionedRateLimiter.Create&lt;HttpContext, string&gt;(httpContext =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        RateLimitPartition.GetFixedWindowLimiter(</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            partitionKey: httpContext.Connection.RemoteIpAddress?.ToString() ?? "anonymous",</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            factory: _ =&gt; new FixedWindowRateLimiterOptions</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            {</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                PermitLimit = 100,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                Window = TimeSpan.FromMinutes(1),</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                QueueProcessingOrder = QueueProcessingOrder.OldestFirst,</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">                QueueLimit = 0</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">            }));</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">});</span><br></div></code></pre></div></div>
<p>This gives a clear policy: 100 requests per minute per IP, fail fast with <code>429</code>, and no queuing.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>In multi-replica environments (including Azure Container Apps), in-memory rate limiting is per instance. For true global limits across replicas, use a distributed store such as <a href="https://learn.microsoft.com/azure/azure-cache-for-redis/cache-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Cache for Redis</a>.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="6-input-validation-at-the-api-boundary">6. Input validation at the API boundary<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#6-input-validation-at-the-api-boundary" class="hash-link" aria-label="Direct link to 6. Input validation at the API boundary" title="Direct link to 6. Input validation at the API boundary" translate="no">​</a></h2>
<p>Input validation should happen at the edge of the API, before expensive processing.</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">// Validate input length to prevent abuse</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">const int MaxMessageLength = 4000;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">if (userMessage.Length &gt; MaxMessageLength)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    // Return 400 Bad Request with specific error</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></div></code></pre></div></div>
<p>This is a small change that helps with:</p>
<ul>
<li class="">Prompt injection attempts using oversized payloads</li>
<li class="">Resource exhaustion from unbounded request bodies</li>
<li class="">Token/cost control for downstream AI calls</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="7-authentication-with-entra-id-jwt-bearer">7. Authentication with Entra ID JWT bearer<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#7-authentication-with-entra-id-jwt-bearer" class="hash-link" aria-label="Direct link to 7. Authentication with Entra ID JWT bearer" title="Direct link to 7. Authentication with Entra ID JWT bearer" translate="no">​</a></h2>
<p>If you have a system, such as an API use <a href="https://learn.microsoft.com/entra/identity-platform/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Microsoft Entra ID</a> bearer tokens for authentication:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">builder.Services.AddAuthentication(JwtBearerDefaults.AuthenticationScheme)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    .AddMicrosoftIdentityWebApi(builder.Configuration.GetSection("AzureAd"));</span><br></div></code></pre></div></div>
<p>Authorization policies then control operation-level access:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">[Authorize(Policy = "AnalysisRead")]</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">public async Task AgentChat([FromBody] AgentChatRequest request, ...)</span><br></div></code></pre></div></div>
<p>Mutating endpoints are authenticated. Health probes remain the only unauthenticated paths.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="8-restrictive-cors">8. Restrictive CORS<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#8-restrictive-cors" class="hash-link" aria-label="Direct link to 8. Restrictive CORS" title="Direct link to 8. Restrictive CORS" translate="no">​</a></h2>
<p>Configure Cross-Origin Resource Sharing (CORS) for known frontend origins only:</p>
<div class="language-csharp codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-csharp codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">builder.Services.AddCors(options =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    options.AddPolicy("AllowFrontend", policy =&gt;</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    {</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">        policy.WithOrigins(allowedOrigins)</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">              .AllowAnyHeader()</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">              .AllowAnyMethod()</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">              .AllowCredentials();</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">    });</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">});</span><br></div></code></pre></div></div>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>If allowed origins are sourced from config, remember most apps load this at startup. Update config and restart the deployment to apply changes.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="9-https-termination-at-ingress-not-inside-container">9. HTTPS termination at ingress (not inside container)<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#9-https-termination-at-ingress-not-inside-container" class="hash-link" aria-label="Direct link to 9. HTTPS termination at ingress (not inside container)" title="Direct link to 9. HTTPS termination at ingress (not inside container)" translate="no">​</a></h2>
<p>For Azure Container Apps, TLS is terminated at ingress. Your container should listen on HTTP internally:</p>
<div class="language-dockerfile codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-dockerfile codeBlock_bY9V thin-scrollbar" style="color:#F8F8F2;background-color:#282A36"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#F8F8F2"><span class="token plain">ENV ASPNETCORE_HTTP_PORTS=8080</span><br></div><div class="token-line" style="color:#F8F8F2"><span class="token plain">EXPOSE 8080</span><br></div></code></pre></div></div>
<p>If you force HTTPS in-container (<code>https://+:443</code>) without mounting certificates, startup failures are expected.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="practical-hardening-checklist">Practical hardening checklist<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#practical-hardening-checklist" class="hash-link" aria-label="Direct link to Practical hardening checklist" title="Direct link to Practical hardening checklist" translate="no">​</a></h2>
<p>Use this in PR reviews:</p>
<table><thead><tr><th>Check</th><th>Status</th></tr></thead><tbody><tr><td>Non-root user in Dockerfile</td><td>✅</td></tr><tr><td>Multi-stage build (no SDK in runtime)</td><td>✅</td></tr><tr><td>Pinned base image version (not <code>latest</code>)</td><td>✅</td></tr><tr><td>Health probes bypass auth</td><td>✅</td></tr><tr><td>Liveness and readiness probes configured</td><td>✅</td></tr><tr><td>Rate limiting enabled</td><td>✅</td></tr><tr><td>Input validation at API boundary</td><td>✅</td></tr><tr><td>Entra ID JWT authentication</td><td>✅</td></tr><tr><td>CORS restricted to known origins</td><td>✅</td></tr><tr><td>HTTP (not HTTPS) inside container</td><td>✅</td></tr><tr><td><code>imagePullPolicy: Always</code> in manifests</td><td>✅</td></tr><tr><td>No secrets in Dockerfile or image layers</td><td>✅</td></tr><tr><td><code>HEALTHCHECK</code> instruction in Dockerfile</td><td>✅</td></tr></tbody></table>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thoughts">Final thoughts<a href="https://luke.geek.nz/azure/container-security-hardening-checklist/#final-thoughts" class="hash-link" aria-label="Direct link to Final thoughts" title="Direct link to Final thoughts" translate="no">​</a></h2>
<p>Container security is not a single switch.</p>
<p>It is a set of patterns that compound: non-root containers, deterministic builds, probe hygiene, rate limiting, input validation, and clear auth boundaries. Applied together, they significantly reduce risk for workloads running on Azure Container Apps.</p>
<blockquote>
<p>And don't forget <a href="https://learn.microsoft.com/azure/container-registry/key-concept-continuous-patching?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Azure Container Registry Continuous Patching</a> and <a href="https://learn.microsoft.com/azure/security/container-secure-supply-chain/articles/container-secure-supply-chain-implementation/containers-secure-supply-chain-overview?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Containers Supply Chain Framework</a>.</p>
</blockquote>
<p>If you want to map this to broader platform guidance, review the <a href="https://learn.microsoft.com/azure/well-architected/security/?WT.mc_id=AZ-MVP-5004796" target="_blank" rel="noopener noreferrer" class="">Security pillar of the Azure Well-Architected Framework</a>.</p>]]></content:encoded>
            <category>Azure</category>
        </item>
    </channel>
</rss>