January 22, 2026

Azure SRE Agent on AKS: Practical SRE with Agentic DevOps

Cloud Automation

AKS Cloud-Native DevOps Kubernetes AI Architecture Artificial Intelligence Automation Containers Development Tools DevEx Innovation K8S Platform Platform Engineering Technology Optimization Administration

Running production on Azure Kubernetes Service (AKS) is not hard because Kubernetes is complex. It is hard because operations never stay in one place. Metrics are in one tool, logs in another, alerts somewhere else, and the real knowledge lives in people’s heads.

As a professional, I see this pattern again and again. Teams invest in observability, automation, and incident tooling, but still depend on humans to connect the dots. During an incident, this becomes very visible. People jump between dashboards, CLI tools, runbooks, and Teams or Slack messages. Time is lost, not because data is missing, but because context is fragmented.

Azure SRE Agent tries to solve exactly this problem. Not by replacing engineers, but by acting as a reliability assistant that understands Azure, can reason over signals, and can execute operational work in a controlled way.

In this article, I want to look at Azure SRE Agent without selling it to you. We will first look at why this kind of agent exists, how it fits into SRE thinking, and how the architecture works. After that, we will build one step by step and connect it to a realistic AKS environment.

Why SRE Agents Exist

Classic automation in ops is mostly script-based. Bash scripts, PowerShell, Terraform, Helm, pipelines. This works fine for known problems. Restart a pod, scale a deployment, rotate a secret.

But real incidents are rarely that clean. In AKS, for example, a slowdown can come from:

CPU pressure on nodes
Bad resource requests
Network policy changes
A failing dependency outside the cluster
A deployment from yesterday that nobody remembers

Scripts do not reason. They only execute. Humans do reason, but humans are slow, tired, and not always available. This is where SRE agents come in. The idea is not new. What is new, is that an agent can:

Understand the environment (Azure, AKS, monitoring)
Ask the right follow-up questions
Test multiple hypotheses
And only then suggest or perform actions

From an SRE point of view, this is about reducing toil, not removing responsibility.

What Azure SRE Agent Actually Is

Azure SRE Agent is an Azure-native service that acts as an AI-powered operational assistant. It has built-in knowledge of Azure services and can interact with them using Azure CLI and REST APIs.

Important to say clearly that the agent does nothing without permission, and every action is either reviewed by a human or explicitly allowed through an incident response plan. In practice, this means that you can ask questions in natural language, the agent can query metrics, logs, and resource state, it can correlate signals across services, it can propose remediation steps, and if allowed, execute them. For cloud and platform teams, this becomes interesting when Kubernetes is only part of the system, not the whole system.

How the Agent Fits In Architecture

At a high level, Azure SRE Agent sits next to your workloads, not inside them. When you create an agent, Azure automatically creates:

A managed identity
An Application Insights instance
A Log Analytics workspace

The agent uses these to read telemetry, understand alerts, and track its own actions. From an AKS perspective, the important part is that the agent does not run as a pod in your cluster. It accesses AKS through Azure control-plane APIs. This means that if your cluster has strict network restrictions, access may be limited. It also means that Kubernetes objects are visible only if Azure APIs allow it, and RBAC and Azure permissions matter a lot. This is very different from in-cluster operators or controllers, and you should design with that in mind.

SRE Agent and AKS: Where It Helps, Where It Doesn’t

Let’s be honest. Azure SRE Agent is not a Kubernetes replacement, and it is not an Open Source SRE tool. It shines in the areas of cross-service visibility (AKS + App Gateway + SQL + Storage), incident triage using Azure Monitor data, understanding recent changes and deployments, and guiding humans during incidents. It is weaker when you need deep pod-level debugging, network policies block all control-plane access, and you expect full autonomy without guardrails. This is fine. In SRE thinking, tools have boundaries. The trick is knowing them.

Agentic DevOps in Practice

Now let’s move to the hands-on part. Before you begin this tutorial, you will need the following:

An active Azure subscription with an existing Azure Kubernetes Service (AKS) cluster, and sufficient Azure permissions.
An existing Azure Kubernetes Service (AKS) cluster.

Creating the Azure SRE Agent

We start in the Azure portal. Yes, portal. This is still a preview service, and this is where the control lives.

You create a new Azure SRE Agent resource. During creation, you select:

A subscription
A resource group for the agent itself
A region (for now, East US 2 is common)
One or more resource groups to monitor

This choice is important. Do not select everything.

In SRE, scope is safety. Start with the resource group that contains your AKS cluster and its direct dependencies. You can always add more later.

After creation, Azure provisions the agent and connects it to monitoring automatically.

Talking to the Agent

Once the agent is ready, you open it and start a chat.

The first questions I usually ask are very boring, but very useful:

“What resources are you managing?”
“What alerts are currently active?”
“What changed in the last 24 hours?”

This does two things. First, it validates access. Second, it builds trust. You quickly see what the agent can and cannot see. If your AKS cluster is visible, you can go further. Ask about node pools, ask about failing workloads, and/or ask about recent scaling events. This already replaces a lot of manual clicking.

Investigating a Real AKS Problem

Let’s say users report that an API is slow. Instead of guessing, you ask: “Why is my API slow?” The agent starts with a standard investigation:

It looks at request metrics
It checks error rates
It reviews recent deployments
It correlates this with infrastructure signals

For simple issues, this is often enough. But production is rarely simple. This is where deep investigation mode becomes useful. When enabled, the agent does not jump to conclusions. It creates hypotheses. For example high CPU usage, network latency, or backend dependency issues. It then validates or rejects these paths one by one. This feels very similar to a real incident war room, but without humans doing all the mechanical work.

This is the part I like most. Not because it is “AI”, but because it enforces structured thinking.

Automation with Control

Now comes the dangerous part: automation. Azure SRE Agent allows you to define incident response plans. These plans decide which incidents are handled, how serious they must be, and whether actions are automatic or reviewed. My strong advice is to always start in review mode. In AKS environments, one wrong automated action can make things worse. Let the agent diagnose and propose first. Watch how it behaves. Learn its patterns. Only later, when trust is built, you can allow more autonomy.

Why One Agent Is Never Enough

In real systems, no single engineer knows everything. The same applies to agents. Subagents allow you to split responsibilities. For example:

One subagent focused on AKS workloads
One focused on databases
One on networking

This maps very nicely to how most SRE teams already work. You can enrich these subagents with your own runbooks, architecture notes, and postmortems. This is where your operational knowledge finally becomes reusable.

Closing Words

Azure SRE Agent is not magic. It will not fix bad architecture, missing alerts, or unclear ownership. But used correctly, it reduces noise, speeds up understanding, and supports SRE thinking instead of fighting it. If you treat it as a teammate, not as a miracle, it fits surprisingly well into modern platform engineering.

From an AKS and Open Source perspective, that is exactly what we should ask from tooling. Azure SRE Agent is not Open Source, but it does not fight it either. You still use Prometheus, Grafana, Kubernetes-native tooling, GitOps, and Open standards. The agent sits above this, as a reasoning layer. If you expect it to replace your stack, you will be disappointed. If you use it to connect the dots, it adds real value. SRE is not about tools. It is about reducing uncertainty under pressure. Azure SRE Agent does not remove that pressure, but it helps structure it. And that alone can already make a real difference.

Thank you for taking the time to go through this post and making it to the end. Stay tuned, because we’ll keep continuing providing more content on topics like this in the future.

Author: Rolf Schutten

Posted on: January 22, 2026