From Runbooks to Reasoning: The Shift to Agentic Network Automation

Alex Cronin
Jan 13
9 min read

Updated: Jan 15

Network operations teams have spent decades building runbooks that encode step-by-step instructions for every failure mode. These decision trees and scripts automate the predictable, but networks don't fail predictably. When an adjacency drops, the runbook says to check the interface. If the interface looks fine, the engineer starts hunting through logs, counters, neighbor state, and protocol timers. Minutes turn into an hour as they piece together what happened.

This is where a category shift is happening in network operations. The answer isn't better scripts or smarter dashboards. It's agents that can reason over evidence and investigate problems the way experienced engineers do.

Agentic network automation represents a fundamental change in how we approach network troubleshooting. Instead of encoding every possible path through a problem, you give an agent the ability to observe, hypothesize, and investigate based on what it finds.

The agent doesn't follow a flowchart. It asks what data would help it understand the situation, then goes and gets that data. Based on the results, it decides what to look at next.

Why Scripts Hit a Ceiling

Traditional automation excels at handling known failure modes. If an interface flaps, the script runs five predetermined commands. If a BGP session goes down, it checks three specific things. If SNMP stops responding, it applies a remediation playbook.

The fundamental problem is a combinatorial explosion. A modern multi-vendor network might run IOS-XE on edge routers, IOS-XR in the core, and SONiC on data center switches. Each platform has different CLI syntax, different command semantics, and different failure modes. Encoding every path through every failure on every platform isn't just tedious; it becomes impossible to maintain as complexity grows.

Scripts also struggle with ambiguity in ways that limit their usefulness. When someone reports that the network feels slow, that's not a trigger condition a script can act on. Neither is the complaint that something changed and now customers are experiencing issues.

Scripts require precise inputs to function. Crucially, they treat the absence of data as an error. If a script tries to connect to a device and times out, the script crashes or halts.

But in networking, silence is a signal. A timeout isn't just an error; it’s a clue that the problem is likely physical or infrastructure-related rather than a configuration mismatch. This mismatch explains why so much troubleshooting time is still spent on manual investigation: humans know how to interpret 'no signal,' but traditional tools do not.

What Makes an Agent Different

An agent doesn't need a predefined path through every problem. Instead, it needs tools and the ability to reason about what those tools reveal.

When you give an agent access to device CLIs, telemetry databases, and log streams, it can observe what's happening across the network. It then decides what to investigate next based on what it finds, adjusting its approach as new evidence emerges.

This is fundamentally different from a large language model that simply generates CLI commands from a prompt. That approach breaks the moment syntax varies across platforms or when the investigation requires adaptive reasoning.

Real agentic automation requires several capabilities working together. The agent needs multi-vendor command translation that happens automatically rather than through static lookup tables. It needs the ability to investigate multiple devices in parallel without requiring the operator to specify which ones are relevant. It needs evidence-based reasoning that adjusts the investigation based on findings, and confidence scoring that flags uncertainty instead of overclaiming results.

Perhaps most importantly, the agent needs security guardrails that prevent dangerous operations even when the underlying model makes mistakes. An agent with read-only access and human-in-the-loop approval for changes lets engineering teams move faster without compromising safety.

The Three-Level Mental Model

One of the insights that separates working agents from impressive demos is understanding how network protocols expose information at different levels.

Every routing protocol organizes its data into three distinct levels. State information tells you what neighbors exist and whether adjacencies are currently up. Configuration information tells you about roles, timers, and interface assignments. Statistics tell you about counters, error rates, and historical behavior patterns.

These different types of information live in different commands. On different platforms, those commands have different names and different syntax, but the conceptual organization remains consistent.

An agent that treats "show isis neighbors" and "show isis interface" as interchangeable will miss crucial information like DIS election status when investigating flooding issues. An agent that understands the three-level model can reason its way to the right command even if it has never encountered that exact question before, because it understands what type of data it needs.

This distinction between pattern matching and genuine understanding determines whether an agent can handle novel situations or only repeat what it has seen in training.

Multi-Vendor Reality

The majority of networks nowadays are multi-vendor. Acquisitions, cost optimization initiatives, and the rise of open networking mean that most environments contain equipment from multiple vendors running different operating systems.

As an example, SONiC presents a particularly interesting challenge for automation. It runs on commodity switches, uses FRRouting for BGP and OSPF routing protocols, and exposes three distinct CLI syntax layers: the native CLI commands for system operations, the FRRouting vtysh shell for routing commands, and standard Linux commands for everything else.

An agent that tries to run "show bgp summary" directly on a SONiC switch will fail because that command doesn't exist in the native CLI. The correct invocation is "sudo vtysh -c 'show bgp summary'" to reach the FRRouting layer. This detail isn't well documented in most training data, so it has to be learned empirically through actual device interaction.

The same pattern of platform-specific quirks repeats across the industry. IOS-XR uses "show bgp summary" while IOS-XE requires "show ip bgp summary" for the same information. Virtual router images used in lab environments lack hardware health commands that exist on physical gear. Some commands hang indefinitely instead of failing cleanly with an error message.

These quirks aren't rare edge cases that can be ignored. They represent the daily reality of network operations in heterogeneous environments.

The Nanites View

Nanites approaches agentic network automation with five core capabilities that address these challenges directly.

Natural language orchestration allows operators to describe what they want to know rather than specifying which commands to run. The system translates intent into platform-appropriate CLI syntax automatically, handling the translation layer so engineers can focus on the problem rather than syntax details.

Parallel multi-device investigation enables the system to fan out sub-agents simultaneously when a health check spans multiple devices across different operating systems. Results from all devices synthesize into a unified view that highlights issues and correlates findings across the network.

Confidence scoring and self-critique ensure that the system tracks data gaps and incomplete coverage honestly. If telemetry data has an 18-minute hole during the investigation window, the confidence score reflects that uncertainty instead of claiming certainty about conclusions.

Defense-in-depth safety architecture enforces read-only operation through multiple independent security layers. Device-level AAA provides a final backstop that prevents write operations even if all application-layer controls somehow fail. The agent proposes changes for human review rather than executing them directly.

Evidence-gated investigation triggers automatic follow-up when initial findings suggest a root cause without definitively proving it. Symptoms lead to hypotheses, and hypotheses lead to targeted data collection that either confirms or refutes them.

Feature	Automated Runbooks / Scripts	Nanites Agentic AI
Primary Goal	Reduce Coordination Time	Reduce Investigation Time
Logic Model	Linear (Step 1 → Step 2 → Step 3)	Adaptive (Observe → Hypothesize → Pivot)
Handling Failure	Stops and pages human ("Step 2 Failed")	Pivots logic ("Step 2 Failed, trying Step 2b")
Setup Required	High (Must write every step explicitly)	Low (Learns topology & commands dynamically)
Best For	Known, repetitive tasks (e.g., Restart Pod)	Unknown, complex analysis (e.g., "Network is slow")

Real Example of Fully Autonomous Agentic Workflow

The difference between a script and an agent becomes clear when the first approach fails. During routine internal testing, an unexpected Grafana alert fired, indicating a critical BGP neighbor down event on R2-XE. The agent's initial reasoning step took 0.11 seconds: analyze interface status, BGP neighbor state, logs from the last 60 minutes, and device health to determine root cause.

It spawned a sub-agent for R2-XE and executed four diagnostic commands in parallel: show bgp neighbors, show ip interface brief, show logging, and show platform resources. All four failed. The device was unreachable.

This is where a traditional automation script would typically stop and page an engineer. The agent did something different.

Before taking another action, the agent paused to reason. It explicitly stated what it knew and what it needed to find out: "All direct diagnostics failed due to loss of connectivity to R2-XE, so I am escalating to indirect evidence gathering from its BGP peer." It noted that BGP neighbor 10.10.1.1 was the loopback of R1-IOSXR in the same routing domain. It formed a hypothesis that the BGP session was down due to connectivity or operational failure on R2-XE, possibly affecting both data and management planes.

This is the pivotal moment that distinguishes an agent from a playbook. There was no pre-written rule in the system that said, 'If R2 is unreachable, check R1.' In a complex mesh, writing such rules for every neighbor pair is mathematically impossible.

Instead, the agent derived this strategy from first principles. It understood the topology, realized it was blind to the victim (R2), and autonomously decided to interview the witness (R1). It performed 'negative fault isolation', using the failure of its own tools to narrow down the blast radius.

Then it listed what evidence would confirm or refute that hypothesis: Is the BGP session also down from R1-IOSXR's perspective? Are there interface or protocol errors on the peer? Are there relevant log messages?

The agent spawned a second sub-agent for R1-IOSXR and ran three more commands. When those also failed, it had enough information to conclude: both routers in the cisco-as100 domain were simultaneously unreachable, suggesting a broader infrastructure outage rather than a protocol-specific BGP issue.

Total time from alert to accurate hypothesis: 89 seconds. Confidence score: 90%, with a self-critique noting it could have more explicitly highlighted the management-layer failure pattern.

The reasoning chain is what matters here. The agent didn't follow a predefined path. It observed, hypothesized, gathered evidence, adapted when that evidence was unavailable, and reached a conclusion that matched what manual verification later confirmed: both virtual machines had powered off unexpectedly.¹ This was fully autonomous. No human input, no scripts, no playbooks. Alert received, safe investigation started, correct hypothesis produced; all in under 2 minutes. Watch the full investigation:

https://video.wixstatic.com/video/c99bea_8c9f1f1566bf423792757702e625352b/1080p/mp4/file.mp4

The Implications for Network Teams

This technology doesn't replace network engineers. Instead, it changes what they spend their time doing each day.

The investigative grind of logging into devices, running show commands, correlating outputs across multiple sessions, and switching syntax between vendors consumes roughly 80% of troubleshooting time. Automating that investigative work allows engineers to focus on architecture decisions, capacity planning, and the genuinely difficult problems that require human judgment.

This capability creates a 90%+ reduction in Mean Time To Innocence (MTTI). Often, the network is blamed for application failures. An agent that can instantly verify path health, even when devices are uncooperative, allows network teams to prove 'it's not the network', or validate that it is, in seconds rather than hours. It stops the cross-team finger-pointing faster.

There's also a knowledge-capture dimension worth considering. Senior engineers carry decades of intuition about which commands reveal which problems and how different symptoms correlate with root causes. That expertise often leaves the organization when they do. An agentic system can encode that reasoning in a way that persists across personnel changes and scales across the team.

The read-only constraint in Nanites is a deliberate feature rather than a limitation. The majority of mean-time-to-repair is spent on diagnosis, not remediation. Once you identify the root cause, the fix is usually straightforward. The agent handles the detective work while humans make the actual changes with full understanding of what they're doing and why.

Task	Manual	Agentic	Reduction
Adjacency failure diagnosis	30-60 min	<3 min	~95%
Network health assessment	30-60 min	<90 sec	~97%
Interface/telemetry queries	5-10 min	<20 sec	~95%
Alert triage	15-30 min	<60 sec	~95%
Multi-vendor diagnostics	10-20 min	<60 sec	~95%

Metrics are based on internal validation across Cisco IOS-XE, IOS-XR, and SONiC devices.¹

What Comes Next

Agentic network automation is still in its early stages of adoption. The underlying technology works reliably, but widespread deployment requires building trust through demonstrated performance.

That trust develops through transparency in how the system operates. Confidence scores that openly admit uncertainty help operators understand when to dig deeper. Security architectures that fail closed rather than open provide assurance that mistakes won't cascade into outages. Human-in-the-loop designs keep operators in control of their networks while still providing the benefits of automated investigation.

Networks are not getting any simpler. Multi-cloud deployments, edge computing, disaggregated hardware, and open-source operating systems continue adding complexity faster than teams can hire to address it. Agentic network automation addresses these concerns and closes the gap.

Endnotes

¹ Benchmarked on internal validation testing in controlled lab and pilot environments using EVE-NG and Cisco CML testbeds. Test scenarios include Cisco IOS-XE (CSR1000V, cat8000v), Cisco IOS-XR (XRv 9000), and SONiC-VS across 46 validated test categories. Time comparisons reflect manual effort estimates from experienced network engineers versus measured agent completion times. "Production" status indicates capabilities currently enabled in the platform; "Validated, disabled" indicates capabilities proven functional but disabled by policy pending additional safety validation.

About Nanites

Nanites is a system of specialized AI agents on call 24/7, helping resolve network issues in minutes and making networks easier to operate. We are working with leading companies and organizations to build the world’s first AI network autopilot, and has been a featured speaker at CableLabs and NetworkX, mentioned by Cisco, and was a Top 12 finalist in the 2025 T-Mobile T-Challenge. Nanites is in early access with select design partners. We’re validating reliability, guardrails, and workflows in controlled lab and pilot environments before broader production availability.