Skip to content
Arxo Arxo

Scoring and Keys

This page documents the exact scoring behavior and all emitted metric keys for agent_architecture.

agent_architecture.governance_readiness combines a weighted base axis and an additive extended axis:

base_governance =
(
(1 - tool_policy_absence) * tool_policy_w +
(1 - schema_validation_gap) * schema_w +
(1 - tool_result_validation_gap) * tool_result_w
)
extended_governance =
(
(1 - mcp_auth_gap) +
(1 - mcp_oauth_resource_binding_gap) +
(1 - tool_sandbox_enforcement_gap) +
(1 - tool_approval_bypass_risk) +
(1 - guardrail_hook_absence) +
(1 - a2a_webhook_auth_gap)
) / 6
governance_readiness =
(base_governance * 0.65 + extended_governance * 0.35) * 100

Weights come from governance_weights and are normalized to sum to 1.0.

  • tool_policy
  • schema_validation
  • tool_result_validation

If all governance weights are configured to 0, defaults are used:

  • tool_policy = 0.40
  • schema_validation = 0.35
  • tool_result_validation = 0.25

agent_architecture.agent_reliability_score combines a weighted base axis and two additive reliability axes:

base_reliability =
(
(1 - loop_guard_absence) * loop_guard_w +
(1 - memory_unbounded) * memory_w +
(1 - retry_storm_risk) * retry_w +
(1 - agent_observability_gap) * observability_w +
(1 - agent_eval_absence) * eval_w
)
durable_reliability =
(
(1 - checkpoint_durability_gap) * 0.6 +
(1 - interrupt_resume_contract_gap) * 0.4
)
observability_semconv =
(
(1 - otel_genai_semconv_gap) * 0.6 +
(1 - otel_genai_event_coverage_gap) * 0.4
)
agent_reliability_score =
(base_reliability * 0.7 + durable_reliability * 0.15 + observability_semconv * 0.15) * 100

Weights come from reliability_weights and are normalized to sum to 1.0.

  • loop_guard
  • memory
  • retry
  • observability
  • eval

If all reliability weights are configured to 0, defaults are used:

  • loop_guard = 0.25
  • memory = 0.20
  • retry = 0.20
  • observability = 0.20
  • eval = 0.15
overall_agent_health = min(agent_reliability_score / 100, governance_readiness / 100)

Published as agent_architecture.overall_agent_health in 0..1.

agent_architecture starts from detected agent call sites. If no agent call sites are detected:

  • Risk/gap metrics are emitted as 0 with Good severity.
  • Composite outputs can appear fully healthy (agent_reliability_score = 100, governance_readiness = 100, overall_agent_health = 1.0).
  • Findings are usually absent because no detector evidence is produced.

Interpret this state as “not applicable/no detected agent orchestration”, not as proof that agent architecture is production-ready.

If a require_* flag is enabled and corresponding absence risk is high, severity is escalated to Critical.

Threshold behavior:

  • Escalate only when score is > 0.2
  • require_loop_guards -> loop_guard_absence
  • require_tool_policy -> tool_policy_absence
  • require_eval_harness -> agent_eval_absence
Metric KeyRange / TypeDirection
agent_architecture.loop_guard_absence0..1Lower is better
agent_architecture.effective_step_budget_ratio0..1Higher is better
agent_architecture.budget_propagation_coverage0..1Higher is better
agent_architecture.max_nested_loop_depthNumberLower is better
agent_architecture.memory_unbounded0..1Lower is better
agent_architecture.context_memory_limits_score0..1Higher is better
agent_architecture.tool_state_limits_score0..1Higher is better
agent_architecture.long_term_memory_retention_score0..1Higher is better
agent_architecture.tool_policy_absence0..1Lower is better
agent_architecture.dangerous_tool_countNumberLower is better
agent_architecture.scoped_tool_ratio0..1Higher is better
agent_architecture.schema_validation_gap0..1Lower is better
agent_architecture.tool_input_schema_coverage0..1Higher is better
agent_architecture.tool_output_schema_coverage0..1Higher is better
agent_architecture.retry_storm_risk0..1Lower is better
agent_architecture.agent_observability_gap0..1Lower is better
agent_architecture.step_trace_completeness_score0..1Higher is better
agent_architecture.agent_eval_absence0..1Lower is better
agent_architecture.trajectory_eval_coverage0..1Higher is better
agent_architecture.adversarial_eval_present0 or 1Higher is better
agent_architecture.stochastic_runs_present0 or 1Higher is better
agent_architecture.agent_reliability_score0..100Higher is better
agent_architecture.governance_readiness0..100Higher is better
agent_architecture.overall_agent_health0..1Higher is better
agent_architecture.human_approval_absence0..1Lower is better
agent_architecture.idempotency_gap0..1Lower is better
agent_architecture.fanout_control_absence0..1Lower is better
agent_architecture.state_isolation_risk0..1Lower is better
agent_architecture.instruction_boundary_violation0..1Lower is better
agent_architecture.deadlock_risk0..1Lower is better
agent_architecture.routing_pattern_risk0..1Lower is better
agent_architecture.coordination_risk0..1Lower is better
agent_architecture.callback_depth_risk0..1Lower is better
agent_architecture.tool_result_validation_gap0..1Lower is better
agent_architecture.handoff_input_filter_gap0..1Lower is better
agent_architecture.guardrail_hook_absence0..1Lower is better
agent_architecture.checkpoint_durability_gap0..1Lower is better
agent_architecture.interrupt_resume_contract_gap0..1Lower is better
agent_architecture.otel_genai_semconv_gap0..1Lower is better
agent_architecture.otel_genai_event_coverage_gap0..1Lower is better
agent_architecture.agent_shell_capable0..1Lower is better
agent_architecture.agent_files_with_processNumberLower is better
agent_architecture.agent_tools_external_api0..1Lower is better
agent_architecture.agent_files_with_external_apiNumberLower is better
agent_architecture.mcp_auth_gap0..1Lower is better
agent_architecture.mcp_oauth_resource_binding_gap0..1Lower is better
agent_architecture.mcp_tool_annotation_gap0..1Lower is better
agent_architecture.mcp_structured_output_gap0..1Lower is better
agent_architecture.tool_sandbox_enforcement_gap0..1Lower is better
agent_architecture.tool_approval_bypass_risk0..1Lower is better
agent_architecture.untrusted_tool_output_boundary_gap0..1Lower is better
agent_architecture.trace_eval_coverage0..1Higher is better
agent_architecture.trace_eval_regression_risk0..1Lower is better
agent_architecture.a2a_agent_card_gap0..1Lower is better
agent_architecture.a2a_task_state_machine_gap0..1Lower is better
agent_architecture.a2a_webhook_auth_gap0..1Lower is better
agent_architecture.handoff_cycle_risk0..1Lower is better

agent_architecture emits SARIF-style findings by converting detector evidence into Finding records.

  • Findings are created from metric evidence in metrics_to_findings.
  • One finding is emitted per evidence item.
  • severity is inherited from the detector MetricScore.
  • rule_id uses stable arxo/agent-* identifiers (for example, arxo/agent-loop-guard-absence, arxo/agent-mcp-auth-gap).
  • If line is known, evidence is emitted as Evidence::CodeSpan(path, line, message).
  • If line is missing, evidence is emitted as Evidence::Text("path: reason").
  • Findings are emitted only when evidence exists; when no findings exist, MetricResult.findings is omitted.
  • Output order is deterministic (sorted by rule_id, title, and location).

agent_architecture is currently documented against engine metric version 3.0.0.

Breaking note in 3.0.0:

  • Renamed A2A keys:
    • agent_architecture.agent_card_absence -> agent_architecture.a2a_agent_card_gap
    • agent_architecture.task_lifecycle_contract_gap -> agent_architecture.a2a_task_state_machine_gap

Historical note:

  • Legacy compatibility aliases are not emitted in current releases:
    • agent_architecture.reliability_score
    • agent_architecture.control_plane_score
  • Use:
    • agent_architecture.agent_reliability_score
    • agent_architecture.governance_readiness