Agent Architecture

The agent_architecture metric evaluates agent and orchestration architecture in codebases that use tool-calling agents and multi-step workflows.

Overview and Why It Matters

Agent systems fail in ways that are different from classic service code:

Unbounded loops and retries can trigger runaway cost and unstable behavior.
Weak tool governance can expose dangerous capabilities.
Missing observability and eval harnesses make regressions invisible.
Poor coordination and state handling can cause deadlocks or cross-session leaks.

This metric provides actionable scores and findings so teams can gate these risks in CI.

What It Measures

The metric uses four axes: Reliability, Governance, Safety, and Coordination. Each axis produces a 0–100 score; overall health combines them.

Reliability

Loop guard coverage and budget propagation
Memory bounds and retention controls
Retry/backoff behavior
Observability and eval (under Reliability): step-level trace completeness, runtime SLO coverage, agent eval harness coverage, and adversarial/stochastic run presence

Governance

Tool policy presence and scope controls
Input/output schema validation
Tool result validation

Safety

Tool execution safety: sandbox, approval gates, human-in-the-loop, prompt-injection and PII defenses
MCP safety: auth, OAuth binding, tool annotations, poisoning/rug-pull risks
A2A safety: agent cards, task state machines, webhook auth, handoff guardrails

Coordination and Concurrency

Routing and planner/executor coordination risks
Fanout/deadlock/callback depth risks
Instruction boundary and state isolation risks

Composite Scores and Interpretation

Metric	Range	Interpretation
`agent_architecture.agent_reliability_score`	`0..100`	Higher is better. Reliability: loop guards, memory, retries, observability, eval.
`agent_architecture.governance_readiness`	`0..100`	Higher is better. Governance: tool policy, schema validation, tool result validation.
`agent_architecture.safety_protocol_score`	`0..100`	Higher is better. Safety: tool execution, MCP, and A2A sub-scores.
`agent_architecture.coordination_maturity_score`	`0..100`	Higher is better. Coordination: routing, deadlock/fanout, state isolation.
`agent_architecture.overall_agent_health`	`0..1`	Weighted average of the four axes (0.25 each).
`agent_architecture.weakest_axis_min`	`0..1`	Minimum of the four axis scores (as 0–1); use for “worst axis” reporting.

Practical reading:

High reliability + low governance: stable behavior but unsafe controls.
Low reliability + high governance: safer controls but fragile operations.
High safety + low coordination: good protocol controls but coordination risks (e.g. deadlock, state leaks).
High all four: healthy and production-ready posture.

Config Quick Reference

metrics:
  - id: agent_architecture
    enabled: true
    config:
      require_loop_guards: true
      require_tool_policy: true
      require_eval_harness: true
      languages: ["python", "typescript"]
      eval_path_patterns: ["tests/agents", "evals", "agent_specs"]
      scoring:
        governance_weights:
          tool_policy: 0.35
          schema_validation: 0.35
          tool_result_validation: 0.30
        reliability_weights:
          loop_guard: 0.17
          memory: 0.11
          retry: 0.11
          trace_linkage: 0.10
          runtime_slo: 0.08
          eval_maturity: 0.08
          trace_eval: 0.07
          cost_budget: 0.06
          checkpoint_durability: 0.06
          interrupt_resume: 0.04
          otel_semconv: 0.04
          otel_event_coverage: 0.04
          decision_observability: 0.04
          circuit_breaker: 0.06
          output_validation: 0.05
          hallucination_propagation: 0.04
        coordination_weights:
          coordination_risk: 0.143
          routing_pattern: 0.143
          instruction_boundary: 0.143
          deadlock: 0.143
          fanout_control: 0.143
          state_isolation: 0.143
          callback_depth: 0.143
        safety_protocol_weights:
          tool_execution: 0.40
          mcp: 0.30
          a2a: 0.30

Config Semantics

languages: extension-based filter over detected agent call sites. Empty means no filter.
eval_path_patterns: extra directories to scan for eval/trace harnesses in addition to defaults.
scoring.governance_weights, scoring.reliability_weights, scoring.coordination_weights: each group is normalized independently when sum is > 0; if a group sums to 0, defaults are restored.
scoring.safety_protocol_weights: normalized across tool_execution, mcp, and a2a; if all are 0, defaults 0.40/0.30/0.30 are restored.
require_loop_guards, require_tool_policy, require_eval_harness: when enabled, severity is escalated to Critical if the mapped risk score is > 0.2.

Policy Quick Start

metrics:
  - id: agent_architecture

policy:
  invariants:
    - metric: agent_architecture.loop_guard_absence
      op: "<="
      value: 0.20
      message: "Agent loops must have max-steps or timeout guards"
    - metric: agent_architecture.governance_readiness
      op: ">="
      value: 80
      message: "Tool governance and schema controls must be strong"
    - metric: agent_architecture.agent_reliability_score
      op: ">="
      value: 75
      message: "Agent reliability baseline not met"

Agent Architecture

Agent Architecture

Overview and Why It Matters

What It Measures

Reliability

Governance

Safety

Coordination and Concurrency

Composite Scores and Interpretation

Config Quick Reference

Config Semantics

Policy Quick Start

Read Next