Scoring and Keys
Scoring and Keys
Section titled “Scoring and Keys”This page documents the exact scoring behavior and all emitted metric keys for agent_architecture.
Scoring Formulas
Section titled “Scoring Formulas”Governance Axis
Section titled “Governance Axis”agent_architecture.governance_readiness combines a weighted base axis and an additive extended axis:
base_governance = ( (1 - tool_policy_absence) * tool_policy_w + (1 - schema_validation_gap) * schema_w + (1 - tool_result_validation_gap) * tool_result_w )
extended_governance = ( (1 - mcp_auth_gap) + (1 - mcp_oauth_resource_binding_gap) + (1 - tool_sandbox_enforcement_gap) + (1 - tool_approval_bypass_risk) + (1 - guardrail_hook_absence) + (1 - a2a_webhook_auth_gap) ) / 6
governance_readiness = (base_governance * 0.65 + extended_governance * 0.35) * 100Weights come from governance_weights and are normalized to sum to 1.0.
tool_policyschema_validationtool_result_validation
If all governance weights are configured to 0, defaults are used:
tool_policy = 0.40schema_validation = 0.35tool_result_validation = 0.25
Reliability Axis
Section titled “Reliability Axis”agent_architecture.agent_reliability_score combines a weighted base axis and two additive reliability axes:
base_reliability = ( (1 - loop_guard_absence) * loop_guard_w + (1 - memory_unbounded) * memory_w + (1 - retry_storm_risk) * retry_w + (1 - agent_observability_gap) * observability_w + (1 - agent_eval_absence) * eval_w )
durable_reliability = ( (1 - checkpoint_durability_gap) * 0.6 + (1 - interrupt_resume_contract_gap) * 0.4 )
observability_semconv = ( (1 - otel_genai_semconv_gap) * 0.6 + (1 - otel_genai_event_coverage_gap) * 0.4 )
agent_reliability_score = (base_reliability * 0.7 + durable_reliability * 0.15 + observability_semconv * 0.15) * 100Weights come from reliability_weights and are normalized to sum to 1.0.
loop_guardmemoryretryobservabilityeval
If all reliability weights are configured to 0, defaults are used:
loop_guard = 0.25memory = 0.20retry = 0.20observability = 0.20eval = 0.15
Overall Health
Section titled “Overall Health”overall_agent_health = min(agent_reliability_score / 100, governance_readiness / 100)Published as agent_architecture.overall_agent_health in 0..1.
No Agent Sites Detected (N/A Semantics)
Section titled “No Agent Sites Detected (N/A Semantics)”agent_architecture starts from detected agent call sites. If no agent call sites are detected:
- Risk/gap metrics are emitted as
0withGoodseverity. - Composite outputs can appear fully healthy (
agent_reliability_score = 100,governance_readiness = 100,overall_agent_health = 1.0). - Findings are usually absent because no detector evidence is produced.
Interpret this state as “not applicable/no detected agent orchestration”, not as proof that agent architecture is production-ready.
Require-Policy Severity Escalation
Section titled “Require-Policy Severity Escalation”If a require_* flag is enabled and corresponding absence risk is high, severity is escalated to Critical.
Threshold behavior:
- Escalate only when score is
> 0.2 require_loop_guards->loop_guard_absencerequire_tool_policy->tool_policy_absencerequire_eval_harness->agent_eval_absence
Metric Key Contract (Emitted Keys)
Section titled “Metric Key Contract (Emitted Keys)”| Metric Key | Range / Type | Direction |
|---|---|---|
agent_architecture.loop_guard_absence | 0..1 | Lower is better |
agent_architecture.effective_step_budget_ratio | 0..1 | Higher is better |
agent_architecture.budget_propagation_coverage | 0..1 | Higher is better |
agent_architecture.max_nested_loop_depth | Number | Lower is better |
agent_architecture.memory_unbounded | 0..1 | Lower is better |
agent_architecture.context_memory_limits_score | 0..1 | Higher is better |
agent_architecture.tool_state_limits_score | 0..1 | Higher is better |
agent_architecture.long_term_memory_retention_score | 0..1 | Higher is better |
agent_architecture.tool_policy_absence | 0..1 | Lower is better |
agent_architecture.dangerous_tool_count | Number | Lower is better |
agent_architecture.scoped_tool_ratio | 0..1 | Higher is better |
agent_architecture.schema_validation_gap | 0..1 | Lower is better |
agent_architecture.tool_input_schema_coverage | 0..1 | Higher is better |
agent_architecture.tool_output_schema_coverage | 0..1 | Higher is better |
agent_architecture.retry_storm_risk | 0..1 | Lower is better |
agent_architecture.agent_observability_gap | 0..1 | Lower is better |
agent_architecture.step_trace_completeness_score | 0..1 | Higher is better |
agent_architecture.agent_eval_absence | 0..1 | Lower is better |
agent_architecture.trajectory_eval_coverage | 0..1 | Higher is better |
agent_architecture.adversarial_eval_present | 0 or 1 | Higher is better |
agent_architecture.stochastic_runs_present | 0 or 1 | Higher is better |
agent_architecture.agent_reliability_score | 0..100 | Higher is better |
agent_architecture.governance_readiness | 0..100 | Higher is better |
agent_architecture.overall_agent_health | 0..1 | Higher is better |
agent_architecture.human_approval_absence | 0..1 | Lower is better |
agent_architecture.idempotency_gap | 0..1 | Lower is better |
agent_architecture.fanout_control_absence | 0..1 | Lower is better |
agent_architecture.state_isolation_risk | 0..1 | Lower is better |
agent_architecture.instruction_boundary_violation | 0..1 | Lower is better |
agent_architecture.deadlock_risk | 0..1 | Lower is better |
agent_architecture.routing_pattern_risk | 0..1 | Lower is better |
agent_architecture.coordination_risk | 0..1 | Lower is better |
agent_architecture.callback_depth_risk | 0..1 | Lower is better |
agent_architecture.tool_result_validation_gap | 0..1 | Lower is better |
agent_architecture.handoff_input_filter_gap | 0..1 | Lower is better |
agent_architecture.guardrail_hook_absence | 0..1 | Lower is better |
agent_architecture.checkpoint_durability_gap | 0..1 | Lower is better |
agent_architecture.interrupt_resume_contract_gap | 0..1 | Lower is better |
agent_architecture.otel_genai_semconv_gap | 0..1 | Lower is better |
agent_architecture.otel_genai_event_coverage_gap | 0..1 | Lower is better |
agent_architecture.agent_shell_capable | 0..1 | Lower is better |
agent_architecture.agent_files_with_process | Number | Lower is better |
agent_architecture.agent_tools_external_api | 0..1 | Lower is better |
agent_architecture.agent_files_with_external_api | Number | Lower is better |
agent_architecture.mcp_auth_gap | 0..1 | Lower is better |
agent_architecture.mcp_oauth_resource_binding_gap | 0..1 | Lower is better |
agent_architecture.mcp_tool_annotation_gap | 0..1 | Lower is better |
agent_architecture.mcp_structured_output_gap | 0..1 | Lower is better |
agent_architecture.tool_sandbox_enforcement_gap | 0..1 | Lower is better |
agent_architecture.tool_approval_bypass_risk | 0..1 | Lower is better |
agent_architecture.untrusted_tool_output_boundary_gap | 0..1 | Lower is better |
agent_architecture.trace_eval_coverage | 0..1 | Higher is better |
agent_architecture.trace_eval_regression_risk | 0..1 | Lower is better |
agent_architecture.a2a_agent_card_gap | 0..1 | Lower is better |
agent_architecture.a2a_task_state_machine_gap | 0..1 | Lower is better |
agent_architecture.a2a_webhook_auth_gap | 0..1 | Lower is better |
agent_architecture.handoff_cycle_risk | 0..1 | Lower is better |
Findings Contract
Section titled “Findings Contract”agent_architecture emits SARIF-style findings by converting detector evidence into Finding records.
- Findings are created from metric evidence in
metrics_to_findings. - One finding is emitted per evidence item.
severityis inherited from the detectorMetricScore.rule_iduses stablearxo/agent-*identifiers (for example,arxo/agent-loop-guard-absence,arxo/agent-mcp-auth-gap).- If line is known, evidence is emitted as
Evidence::CodeSpan(path, line, message). - If line is missing, evidence is emitted as
Evidence::Text("path: reason"). - Findings are emitted only when evidence exists; when no findings exist,
MetricResult.findingsis omitted. - Output order is deterministic (sorted by
rule_id, title, and location).
Version Compatibility Note
Section titled “Version Compatibility Note”agent_architecture is currently documented against engine metric version 3.0.0.
Breaking note in 3.0.0:
- Renamed A2A keys:
agent_architecture.agent_card_absence->agent_architecture.a2a_agent_card_gapagent_architecture.task_lifecycle_contract_gap->agent_architecture.a2a_task_state_machine_gap
Historical note:
- Legacy compatibility aliases are not emitted in current releases:
agent_architecture.reliability_scoreagent_architecture.control_plane_score
- Use:
agent_architecture.agent_reliability_scoreagent_architecture.governance_readiness