Skip to content
Arxo Arxo

Agent Architecture

The agent_architecture metric evaluates agent and orchestration architecture in codebases that use tool-calling agents and multi-step workflows.

Agent systems fail in ways that are different from classic service code:

  • Unbounded loops and retries can trigger runaway cost and unstable behavior.
  • Weak tool governance can expose dangerous capabilities.
  • Missing observability and eval harnesses make regressions invisible.
  • Poor coordination and state handling can cause deadlocks or cross-session leaks.

This metric provides actionable scores and findings so teams can gate these risks in CI.

The metric uses four axes: Reliability, Governance, Safety, and Coordination. Each axis produces a 0–100 score; overall health combines them.

  • Loop guard coverage and budget propagation
  • Memory bounds and retention controls
  • Retry/backoff behavior
  • Observability and eval (under Reliability): step-level trace completeness, runtime SLO coverage, agent eval harness coverage, and adversarial/stochastic run presence
  • Tool policy presence and scope controls
  • Input/output schema validation
  • Tool result validation
  • Tool execution safety: sandbox, approval gates, human-in-the-loop, prompt-injection and PII defenses
  • MCP safety: auth, OAuth binding, tool annotations, poisoning/rug-pull risks
  • A2A safety: agent cards, task state machines, webhook auth, handoff guardrails
  • Routing and planner/executor coordination risks
  • Fanout/deadlock/callback depth risks
  • Instruction boundary and state isolation risks
MetricRangeInterpretation
agent_architecture.agent_reliability_score0..100Higher is better. Reliability: loop guards, memory, retries, observability, eval.
agent_architecture.governance_readiness0..100Higher is better. Governance: tool policy, schema validation, tool result validation.
agent_architecture.safety_protocol_score0..100Higher is better. Safety: tool execution, MCP, and A2A sub-scores.
agent_architecture.coordination_maturity_score0..100Higher is better. Coordination: routing, deadlock/fanout, state isolation.
agent_architecture.overall_agent_health0..1Weighted average of the four axes (0.25 each).
agent_architecture.weakest_axis_min0..1Minimum of the four axis scores (as 0–1); use for “worst axis” reporting.

Practical reading:

  • High reliability + low governance: stable behavior but unsafe controls.
  • Low reliability + high governance: safer controls but fragile operations.
  • High safety + low coordination: good protocol controls but coordination risks (e.g. deadlock, state leaks).
  • High all four: healthy and production-ready posture.
metrics:
- id: agent_architecture
enabled: true
config:
require_loop_guards: true
require_tool_policy: true
require_eval_harness: true
languages: ["python", "typescript"]
eval_path_patterns: ["tests/agents", "evals", "agent_specs"]
scoring:
governance_weights:
tool_policy: 0.35
schema_validation: 0.35
tool_result_validation: 0.30
reliability_weights:
loop_guard: 0.17
memory: 0.11
retry: 0.11
trace_linkage: 0.10
runtime_slo: 0.08
eval_maturity: 0.08
trace_eval: 0.07
cost_budget: 0.06
checkpoint_durability: 0.06
interrupt_resume: 0.04
otel_semconv: 0.04
otel_event_coverage: 0.04
decision_observability: 0.04
circuit_breaker: 0.06
output_validation: 0.05
hallucination_propagation: 0.04
coordination_weights:
coordination_risk: 0.143
routing_pattern: 0.143
instruction_boundary: 0.143
deadlock: 0.143
fanout_control: 0.143
state_isolation: 0.143
callback_depth: 0.143
safety_protocol_weights:
tool_execution: 0.40
mcp: 0.30
a2a: 0.30
  • languages: extension-based filter over detected agent call sites. Empty means no filter.
  • eval_path_patterns: extra directories to scan for eval/trace harnesses in addition to defaults.
  • scoring.governance_weights, scoring.reliability_weights, scoring.coordination_weights: each group is normalized independently when sum is > 0; if a group sums to 0, defaults are restored.
  • scoring.safety_protocol_weights: normalized across tool_execution, mcp, and a2a; if all are 0, defaults 0.40/0.30/0.30 are restored.
  • require_loop_guards, require_tool_policy, require_eval_harness: when enabled, severity is escalated to Critical if the mapped risk score is > 0.2.
metrics:
- id: agent_architecture
policy:
invariants:
- metric: agent_architecture.loop_guard_absence
op: "<="
value: 0.20
message: "Agent loops must have max-steps or timeout guards"
- metric: agent_architecture.governance_readiness
op: ">="
value: 80
message: "Tool governance and schema controls must be strong"
- metric: agent_architecture.agent_reliability_score
op: ">="
value: 75
message: "Agent reliability baseline not met"