Agent Architecture
Agent Architecture
Section titled “Agent Architecture”The agent_architecture metric evaluates agent and orchestration architecture in codebases that use tool-calling agents and multi-step workflows.
Overview and Why It Matters
Section titled “Overview and Why It Matters”Agent systems fail in ways that are different from classic service code:
- Unbounded loops and retries can trigger runaway cost and unstable behavior.
- Weak tool governance can expose dangerous capabilities.
- Missing observability and eval harnesses make regressions invisible.
- Poor coordination and state handling can cause deadlocks or cross-session leaks.
This metric provides actionable scores and findings so teams can gate these risks in CI.
What It Measures
Section titled “What It Measures”The metric uses four axes: Reliability, Governance, Safety, and Coordination. Each axis produces a 0–100 score; overall health combines them.
Reliability
Section titled “Reliability”- Loop guard coverage and budget propagation
- Memory bounds and retention controls
- Retry/backoff behavior
- Observability and eval (under Reliability): step-level trace completeness, runtime SLO coverage, agent eval harness coverage, and adversarial/stochastic run presence
Governance
Section titled “Governance”- Tool policy presence and scope controls
- Input/output schema validation
- Tool result validation
Safety
Section titled “Safety”- Tool execution safety: sandbox, approval gates, human-in-the-loop, prompt-injection and PII defenses
- MCP safety: auth, OAuth binding, tool annotations, poisoning/rug-pull risks
- A2A safety: agent cards, task state machines, webhook auth, handoff guardrails
Coordination and Concurrency
Section titled “Coordination and Concurrency”- Routing and planner/executor coordination risks
- Fanout/deadlock/callback depth risks
- Instruction boundary and state isolation risks
Composite Scores and Interpretation
Section titled “Composite Scores and Interpretation”| Metric | Range | Interpretation |
|---|---|---|
agent_architecture.agent_reliability_score | 0..100 | Higher is better. Reliability: loop guards, memory, retries, observability, eval. |
agent_architecture.governance_readiness | 0..100 | Higher is better. Governance: tool policy, schema validation, tool result validation. |
agent_architecture.safety_protocol_score | 0..100 | Higher is better. Safety: tool execution, MCP, and A2A sub-scores. |
agent_architecture.coordination_maturity_score | 0..100 | Higher is better. Coordination: routing, deadlock/fanout, state isolation. |
agent_architecture.overall_agent_health | 0..1 | Weighted average of the four axes (0.25 each). |
agent_architecture.weakest_axis_min | 0..1 | Minimum of the four axis scores (as 0–1); use for “worst axis” reporting. |
Practical reading:
- High reliability + low governance: stable behavior but unsafe controls.
- Low reliability + high governance: safer controls but fragile operations.
- High safety + low coordination: good protocol controls but coordination risks (e.g. deadlock, state leaks).
- High all four: healthy and production-ready posture.
Config Quick Reference
Section titled “Config Quick Reference”metrics: - id: agent_architecture enabled: true config: require_loop_guards: true require_tool_policy: true require_eval_harness: true languages: ["python", "typescript"] eval_path_patterns: ["tests/agents", "evals", "agent_specs"] scoring: governance_weights: tool_policy: 0.35 schema_validation: 0.35 tool_result_validation: 0.30 reliability_weights: loop_guard: 0.17 memory: 0.11 retry: 0.11 trace_linkage: 0.10 runtime_slo: 0.08 eval_maturity: 0.08 trace_eval: 0.07 cost_budget: 0.06 checkpoint_durability: 0.06 interrupt_resume: 0.04 otel_semconv: 0.04 otel_event_coverage: 0.04 decision_observability: 0.04 circuit_breaker: 0.06 output_validation: 0.05 hallucination_propagation: 0.04 coordination_weights: coordination_risk: 0.143 routing_pattern: 0.143 instruction_boundary: 0.143 deadlock: 0.143 fanout_control: 0.143 state_isolation: 0.143 callback_depth: 0.143 safety_protocol_weights: tool_execution: 0.40 mcp: 0.30 a2a: 0.30Config Semantics
Section titled “Config Semantics”languages: extension-based filter over detected agent call sites. Empty means no filter.eval_path_patterns: extra directories to scan for eval/trace harnesses in addition to defaults.scoring.governance_weights,scoring.reliability_weights,scoring.coordination_weights: each group is normalized independently when sum is> 0; if a group sums to0, defaults are restored.scoring.safety_protocol_weights: normalized acrosstool_execution,mcp, anda2a; if all are0, defaults0.40/0.30/0.30are restored.require_loop_guards,require_tool_policy,require_eval_harness: when enabled, severity is escalated toCriticalif the mapped risk score is> 0.2.
Policy Quick Start
Section titled “Policy Quick Start”metrics: - id: agent_architecture
policy: invariants: - metric: agent_architecture.loop_guard_absence op: "<=" value: 0.20 message: "Agent loops must have max-steps or timeout guards" - metric: agent_architecture.governance_readiness op: ">=" value: 80 message: "Tool governance and schema controls must be strong" - metric: agent_architecture.agent_reliability_score op: ">=" value: 75 message: "Agent reliability baseline not met"