Scoring and Keys
Scoring and Keys
Section titled “Scoring and Keys”This page documents the exact scoring behavior and emitted metric key contract for agent_architecture.
Scoring Formulas
Section titled “Scoring Formulas”Governance Axis
Section titled “Governance Axis”agent_architecture.governance_readiness (0–100) is the weighted complement of the three governance metrics:
tool_policy_absenceschema_validation_gaptool_result_validation_gap
Weights come from config.scoring.governance_weights (e.g. tool_policy, schema_validation, tool_result_validation) and are normalized to sum to 1.0. Defaults: tool_policy = 0.35, schema_validation = 0.35, tool_result_validation = 0.30. MCP, A2A, and tool-execution safety are in the Safety axis, not Governance.
Reliability Axis
Section titled “Reliability Axis”agent_architecture.agent_reliability_score (0–100) is the weighted complement of reliability gap metrics, including:
- Loop guards, memory bounds, retry storm risk
- Observability:
trace_linkage_gap,runtime_slo_coverage_gap - Eval:
agent_eval_maturity_gap(graded eval maturity) - Cost budget, trace eval regression, checkpoint durability, interrupt/resume, OTel GenAI semconv and event coverage, decision observability, circuit breaker absence, output validation, hallucination propagation
Weights are configurable via config.scoring.reliability_weights and normalized per the engine.
Note: context_memory_limits_score, tool_state_limits_score, and long_term_memory_retention_score are emitted scalar outputs, but they are not directly weighted into composite axes.
Safety Axis
Section titled “Safety Axis”agent_architecture.safety_protocol_score (0–100) combines three sub-scores:
- Tool execution: sandbox, approval bypass, human approval, shell/external API, untrusted output boundary, prompt injection, sensitive data, idempotency, memory poisoning, code execution sandbox, goal integrity
- MCP: auth, OAuth resource binding, tool annotations, structured output, tool poisoning, rug pull, supply chain provenance, credential scoping, shadow server risk
- A2A: agent card, task state machine, webhook auth, handoff cycle, handoff input filter, guardrail hooks
Formula: safety_protocol_score = 100 * (0.40 * tool_execution + 0.30 * mcp + 0.30 * a2a) (each sub-score is the complement average of its gap metrics).
Coordination Axis
Section titled “Coordination Axis”agent_architecture.coordination_maturity_score (0–100) is the weighted complement of coordination risk metrics: coordination_risk, routing_pattern_risk, instruction_boundary_violation, deadlock_risk, fanout_control_absence, state_isolation_risk, callback_depth_risk.
Overall Health
Section titled “Overall Health”overall_agent_health = 0.25 * (agent_reliability_score + governance_readiness + safety_protocol_score + coordination_maturity_score) / 100Published as agent_architecture.overall_agent_health in 0..1.
agent_architecture.weakest_axis_min is the minimum of the four axis scores (as 0–1); use it for “worst axis” reporting.
No Agent Sites Detected (N/A Semantics)
Section titled “No Agent Sites Detected (N/A Semantics)”agent_architecture starts from detected agent call sites. If no agent call sites are detected:
- Risk/gap metrics are emitted as
0withGoodseverity. - Composite outputs can appear fully healthy (all four axis scores 100,
overall_agent_health = 1.0). - Findings are usually absent because no detector evidence is produced.
Interpret this state as “not applicable/no detected agent orchestration”, not as proof that agent architecture is production-ready.
Require-Policy Severity Escalation
Section titled “Require-Policy Severity Escalation”If a require_* flag is enabled and corresponding absence risk is high, severity is escalated to Critical.
Threshold behavior:
- Escalate only when score is
> 0.2 require_loop_guards->loop_guard_absencerequire_tool_policy->tool_policy_absencerequire_eval_harness->agent_eval_maturity_gap
Metric Key Contract (Emitted Keys)
Section titled “Metric Key Contract (Emitted Keys)”The metric emits:
- 56 scalar detector keys
- 6 composite keys
- 25 supplemental signal-derived numeric keys
- 3 summary table keys
Gap/risk keys are 0..1 (lower is better); the three memory-limit score keys and composite keys are higher-is-better.
Scalar keys (56)
Section titled “Scalar keys (56)”| Metric Key | Range | Direction |
|---|---|---|
agent_architecture.loop_guard_absence | 0..1 | Lower is better |
agent_architecture.memory_unbounded | 0..1 | Lower is better |
agent_architecture.context_memory_limits_score | 0..1 | Higher is better |
agent_architecture.tool_state_limits_score | 0..1 | Higher is better |
agent_architecture.long_term_memory_retention_score | 0..1 | Higher is better |
agent_architecture.tool_policy_absence | 0..1 | Lower is better |
agent_architecture.schema_validation_gap | 0..1 | Lower is better |
agent_architecture.retry_storm_risk | 0..1 | Lower is better |
agent_architecture.trace_linkage_gap | 0..1 | Lower is better |
agent_architecture.runtime_slo_coverage_gap | 0..1 | Lower is better |
agent_architecture.agent_eval_maturity_gap | 0..1 | Lower is better |
agent_architecture.cost_budget_enforcement_gap | 0..1 | Lower is better |
agent_architecture.coordination_risk | 0..1 | Lower is better |
agent_architecture.routing_pattern_risk | 0..1 | Lower is better |
agent_architecture.idempotency_gap | 0..1 | Lower is better |
agent_architecture.instruction_boundary_violation | 0..1 | Lower is better |
agent_architecture.deadlock_risk | 0..1 | Lower is better |
agent_architecture.fanout_control_absence | 0..1 | Lower is better |
agent_architecture.state_isolation_risk | 0..1 | Lower is better |
agent_architecture.callback_depth_risk | 0..1 | Lower is better |
agent_architecture.tool_result_validation_gap | 0..1 | Lower is better |
agent_architecture.human_approval_absence | 0..1 | Lower is better |
agent_architecture.handoff_input_filter_gap | 0..1 | Lower is better |
agent_architecture.guardrail_hook_absence | 0..1 | Lower is better |
agent_architecture.checkpoint_durability_gap | 0..1 | Lower is better |
agent_architecture.interrupt_resume_contract_gap | 0..1 | Lower is better |
agent_architecture.otel_genai_semconv_gap | 0..1 | Lower is better |
agent_architecture.otel_genai_event_coverage_gap | 0..1 | Lower is better |
agent_architecture.decision_observability_gap | 0..1 | Lower is better |
agent_architecture.agent_shell_capable | 0..1 | Lower is better |
agent_architecture.agent_tools_external_api | 0..1 | Lower is better |
agent_architecture.mcp_auth_gap | 0..1 | Lower is better |
agent_architecture.mcp_oauth_resource_binding_gap | 0..1 | Lower is better |
agent_architecture.mcp_tool_annotation_gap | 0..1 | Lower is better |
agent_architecture.mcp_structured_output_gap | 0..1 | Lower is better |
agent_architecture.mcp_tool_poisoning_risk | 0..1 | Lower is better |
agent_architecture.mcp_rug_pull_risk | 0..1 | Lower is better |
agent_architecture.tool_sandbox_enforcement_gap | 0..1 | Lower is better |
agent_architecture.tool_approval_bypass_risk | 0..1 | Lower is better |
agent_architecture.untrusted_tool_output_boundary_gap | 0..1 | Lower is better |
agent_architecture.prompt_injection_defense_gap | 0..1 | Lower is better |
agent_architecture.sensitive_data_exposure_gap | 0..1 | Lower is better |
agent_architecture.trace_eval_regression_risk | 0..1 | Lower is better |
agent_architecture.a2a_agent_card_gap | 0..1 | Lower is better |
agent_architecture.a2a_task_state_machine_gap | 0..1 | Lower is better |
agent_architecture.a2a_webhook_auth_gap | 0..1 | Lower is better |
agent_architecture.handoff_cycle_risk | 0..1 | Lower is better |
agent_architecture.circuit_breaker_absence | 0..1 | Lower is better |
agent_architecture.memory_poisoning_defense_gap | 0..1 | Lower is better |
agent_architecture.supply_chain_provenance_gap | 0..1 | Lower is better |
agent_architecture.agent_code_execution_sandbox_gap | 0..1 | Lower is better |
agent_architecture.output_validation_gap | 0..1 | Lower is better |
agent_architecture.credential_scoping_gap | 0..1 | Lower is better |
agent_architecture.mcp_shadow_server_risk | 0..1 | Lower is better |
agent_architecture.goal_integrity_defense_gap | 0..1 | Lower is better |
agent_architecture.hallucination_propagation_risk | 0..1 | Lower is better |
Composite keys
Section titled “Composite keys”| Metric Key | Range | Direction |
|---|---|---|
agent_architecture.agent_reliability_score | 0..100 | Higher is better |
agent_architecture.governance_readiness | 0..100 | Higher is better |
agent_architecture.safety_protocol_score | 0..100 | Higher is better |
agent_architecture.coordination_maturity_score | 0..100 | Higher is better |
agent_architecture.overall_agent_health | 0..1 | Higher is better |
agent_architecture.weakest_axis_min | 0..1 | Higher is better |
Supplemental signal keys (numeric)
Section titled “Supplemental signal keys (numeric)”| Metric Key | Range / Type | Semantics |
|---|---|---|
agent_architecture.effective_step_budget_ratio | 0..1 | Loop-guard signal: proportion of sites with effective step budget. |
agent_architecture.budget_propagation_coverage | 0..1 | Loop-guard signal: budget propagation through nested calls. |
agent_architecture.max_nested_loop_depth | Number (depth) | Maximum nested loop depth observed near agent loops. |
agent_architecture.dangerous_tool_count | Number (count) | Count of dangerous tools detected in scope. |
agent_architecture.scoped_tool_ratio | 0..1 | Ratio of tools with explicit scoping/policy markers. |
agent_architecture.tool_input_schema_coverage | 0..1 | Input schema validation coverage for tools. |
agent_architecture.tool_output_schema_coverage | 0..1 | Output schema validation coverage for tools. |
agent_architecture.trace_linkage_coverage_ratio | 0..1 | Trace ID linkage coverage around agent execution. |
agent_architecture.step_span_coverage_ratio | 0..1 | Step/span linkage coverage around agent execution. |
agent_architecture.runtime_slo_coverage_ratio | 0..1 | Runtime SLO instrumentation coverage. |
agent_architecture.latency_coverage_ratio | 0..1 | Latency instrumentation coverage. |
agent_architecture.error_coverage_ratio | 0..1 | Error/outcome instrumentation coverage. |
agent_architecture.cost_accounting_coverage_ratio | 0..1 | Token/cost accounting coverage. |
agent_architecture.cost_budget_coverage_ratio | 0..1 | Budget enforcement coverage near agent sites. |
agent_architecture.agent_eval_maturity_score | 0..1 | Raw eval maturity signal (before gap inversion). |
agent_architecture.trajectory_eval_coverage | 0..1 | Trajectory eval coverage signal. |
agent_architecture.adversarial_eval_present | 0..1 | Presence marker for adversarial eval checks. |
agent_architecture.stochastic_runs_present | 0..1 | Presence marker for stochastic eval runs. |
agent_architecture.decision_point_logging_coverage_ratio | 0..1 | Coverage of decision-point logging. |
agent_architecture.state_transition_coverage_ratio | 0..1 | Coverage of state transition logging. |
agent_architecture.confidence_routing_logging_coverage_ratio | 0..1 | Coverage of routing/confidence logging. |
agent_architecture.outcome_tracking_coverage_ratio | 0..1 | Coverage of outcome tracking instrumentation. |
agent_architecture.agent_files_with_process | Number (count) | Count of agent files capable of process/shell execution. |
agent_architecture.agent_files_with_external_api | Number (count) | Count of agent files calling external APIs. |
agent_architecture.trace_eval_coverage | 0..1 | Coverage of trace-eval regression checks. |
Summary table keys
Section titled “Summary table keys”| Metric Key | Type | Semantics |
|---|---|---|
agent_architecture.detector_summary | Table | Per-detector summary (risk, severity, confidence, evidence count, impact, effort). |
agent_architecture.top_evidence | Table | Top evidence rows across detectors (file, line, detector, severity, reason). |
agent_architecture.axis_summary | Table | Four-axis summary with detector counts and severity breakdown. |
Findings Contract
Section titled “Findings Contract”agent_architecture emits SARIF-style findings by converting detector evidence into Finding records.
- Findings are created from metric evidence in
metrics_to_findings. - Findings are grouped by
(rule_id, file); each group becomes one finding. severityis inherited from the detectorMetricScore.rule_iduses stablearxo/agent-*identifiers (for example,arxo/agent-loop-guard-absence,arxo/agent-mcp-auth-gap).- If line is known, evidence is emitted as
Evidence::CodeSpan(path, line, message). - If line is missing, evidence is emitted as
Evidence::Path. - Evidence is deduplicated by
(file, line, symbol, reason)and capped to 5 items per finding; overflow is summarized with trailing text evidence. - Findings are emitted only when evidence exists; when no findings exist,
MetricResult.findingsis omitted. - Output order is deterministic (sorted by
rule_id, title, and location).