Skip to content
Arxo Arxo

Scoring and Keys

This page documents the exact scoring behavior and emitted metric key contract for agent_architecture.

agent_architecture.governance_readiness (0–100) is the weighted complement of the three governance metrics:

  • tool_policy_absence
  • schema_validation_gap
  • tool_result_validation_gap

Weights come from config.scoring.governance_weights (e.g. tool_policy, schema_validation, tool_result_validation) and are normalized to sum to 1.0. Defaults: tool_policy = 0.35, schema_validation = 0.35, tool_result_validation = 0.30. MCP, A2A, and tool-execution safety are in the Safety axis, not Governance.

agent_architecture.agent_reliability_score (0–100) is the weighted complement of reliability gap metrics, including:

  • Loop guards, memory bounds, retry storm risk
  • Observability: trace_linkage_gap, runtime_slo_coverage_gap
  • Eval: agent_eval_maturity_gap (graded eval maturity)
  • Cost budget, trace eval regression, checkpoint durability, interrupt/resume, OTel GenAI semconv and event coverage, decision observability, circuit breaker absence, output validation, hallucination propagation

Weights are configurable via config.scoring.reliability_weights and normalized per the engine.

Note: context_memory_limits_score, tool_state_limits_score, and long_term_memory_retention_score are emitted scalar outputs, but they are not directly weighted into composite axes.

agent_architecture.safety_protocol_score (0–100) combines three sub-scores:

  • Tool execution: sandbox, approval bypass, human approval, shell/external API, untrusted output boundary, prompt injection, sensitive data, idempotency, memory poisoning, code execution sandbox, goal integrity
  • MCP: auth, OAuth resource binding, tool annotations, structured output, tool poisoning, rug pull, supply chain provenance, credential scoping, shadow server risk
  • A2A: agent card, task state machine, webhook auth, handoff cycle, handoff input filter, guardrail hooks

Formula: safety_protocol_score = 100 * (0.40 * tool_execution + 0.30 * mcp + 0.30 * a2a) (each sub-score is the complement average of its gap metrics).

agent_architecture.coordination_maturity_score (0–100) is the weighted complement of coordination risk metrics: coordination_risk, routing_pattern_risk, instruction_boundary_violation, deadlock_risk, fanout_control_absence, state_isolation_risk, callback_depth_risk.

overall_agent_health = 0.25 * (agent_reliability_score + governance_readiness + safety_protocol_score + coordination_maturity_score) / 100

Published as agent_architecture.overall_agent_health in 0..1.

agent_architecture.weakest_axis_min is the minimum of the four axis scores (as 0–1); use it for “worst axis” reporting.

agent_architecture starts from detected agent call sites. If no agent call sites are detected:

  • Risk/gap metrics are emitted as 0 with Good severity.
  • Composite outputs can appear fully healthy (all four axis scores 100, overall_agent_health = 1.0).
  • Findings are usually absent because no detector evidence is produced.

Interpret this state as “not applicable/no detected agent orchestration”, not as proof that agent architecture is production-ready.

If a require_* flag is enabled and corresponding absence risk is high, severity is escalated to Critical.

Threshold behavior:

  • Escalate only when score is > 0.2
  • require_loop_guards -> loop_guard_absence
  • require_tool_policy -> tool_policy_absence
  • require_eval_harness -> agent_eval_maturity_gap

The metric emits:

  • 56 scalar detector keys
  • 6 composite keys
  • 25 supplemental signal-derived numeric keys
  • 3 summary table keys

Gap/risk keys are 0..1 (lower is better); the three memory-limit score keys and composite keys are higher-is-better.

Metric KeyRangeDirection
agent_architecture.loop_guard_absence0..1Lower is better
agent_architecture.memory_unbounded0..1Lower is better
agent_architecture.context_memory_limits_score0..1Higher is better
agent_architecture.tool_state_limits_score0..1Higher is better
agent_architecture.long_term_memory_retention_score0..1Higher is better
agent_architecture.tool_policy_absence0..1Lower is better
agent_architecture.schema_validation_gap0..1Lower is better
agent_architecture.retry_storm_risk0..1Lower is better
agent_architecture.trace_linkage_gap0..1Lower is better
agent_architecture.runtime_slo_coverage_gap0..1Lower is better
agent_architecture.agent_eval_maturity_gap0..1Lower is better
agent_architecture.cost_budget_enforcement_gap0..1Lower is better
agent_architecture.coordination_risk0..1Lower is better
agent_architecture.routing_pattern_risk0..1Lower is better
agent_architecture.idempotency_gap0..1Lower is better
agent_architecture.instruction_boundary_violation0..1Lower is better
agent_architecture.deadlock_risk0..1Lower is better
agent_architecture.fanout_control_absence0..1Lower is better
agent_architecture.state_isolation_risk0..1Lower is better
agent_architecture.callback_depth_risk0..1Lower is better
agent_architecture.tool_result_validation_gap0..1Lower is better
agent_architecture.human_approval_absence0..1Lower is better
agent_architecture.handoff_input_filter_gap0..1Lower is better
agent_architecture.guardrail_hook_absence0..1Lower is better
agent_architecture.checkpoint_durability_gap0..1Lower is better
agent_architecture.interrupt_resume_contract_gap0..1Lower is better
agent_architecture.otel_genai_semconv_gap0..1Lower is better
agent_architecture.otel_genai_event_coverage_gap0..1Lower is better
agent_architecture.decision_observability_gap0..1Lower is better
agent_architecture.agent_shell_capable0..1Lower is better
agent_architecture.agent_tools_external_api0..1Lower is better
agent_architecture.mcp_auth_gap0..1Lower is better
agent_architecture.mcp_oauth_resource_binding_gap0..1Lower is better
agent_architecture.mcp_tool_annotation_gap0..1Lower is better
agent_architecture.mcp_structured_output_gap0..1Lower is better
agent_architecture.mcp_tool_poisoning_risk0..1Lower is better
agent_architecture.mcp_rug_pull_risk0..1Lower is better
agent_architecture.tool_sandbox_enforcement_gap0..1Lower is better
agent_architecture.tool_approval_bypass_risk0..1Lower is better
agent_architecture.untrusted_tool_output_boundary_gap0..1Lower is better
agent_architecture.prompt_injection_defense_gap0..1Lower is better
agent_architecture.sensitive_data_exposure_gap0..1Lower is better
agent_architecture.trace_eval_regression_risk0..1Lower is better
agent_architecture.a2a_agent_card_gap0..1Lower is better
agent_architecture.a2a_task_state_machine_gap0..1Lower is better
agent_architecture.a2a_webhook_auth_gap0..1Lower is better
agent_architecture.handoff_cycle_risk0..1Lower is better
agent_architecture.circuit_breaker_absence0..1Lower is better
agent_architecture.memory_poisoning_defense_gap0..1Lower is better
agent_architecture.supply_chain_provenance_gap0..1Lower is better
agent_architecture.agent_code_execution_sandbox_gap0..1Lower is better
agent_architecture.output_validation_gap0..1Lower is better
agent_architecture.credential_scoping_gap0..1Lower is better
agent_architecture.mcp_shadow_server_risk0..1Lower is better
agent_architecture.goal_integrity_defense_gap0..1Lower is better
agent_architecture.hallucination_propagation_risk0..1Lower is better
Metric KeyRangeDirection
agent_architecture.agent_reliability_score0..100Higher is better
agent_architecture.governance_readiness0..100Higher is better
agent_architecture.safety_protocol_score0..100Higher is better
agent_architecture.coordination_maturity_score0..100Higher is better
agent_architecture.overall_agent_health0..1Higher is better
agent_architecture.weakest_axis_min0..1Higher is better
Metric KeyRange / TypeSemantics
agent_architecture.effective_step_budget_ratio0..1Loop-guard signal: proportion of sites with effective step budget.
agent_architecture.budget_propagation_coverage0..1Loop-guard signal: budget propagation through nested calls.
agent_architecture.max_nested_loop_depthNumber (depth)Maximum nested loop depth observed near agent loops.
agent_architecture.dangerous_tool_countNumber (count)Count of dangerous tools detected in scope.
agent_architecture.scoped_tool_ratio0..1Ratio of tools with explicit scoping/policy markers.
agent_architecture.tool_input_schema_coverage0..1Input schema validation coverage for tools.
agent_architecture.tool_output_schema_coverage0..1Output schema validation coverage for tools.
agent_architecture.trace_linkage_coverage_ratio0..1Trace ID linkage coverage around agent execution.
agent_architecture.step_span_coverage_ratio0..1Step/span linkage coverage around agent execution.
agent_architecture.runtime_slo_coverage_ratio0..1Runtime SLO instrumentation coverage.
agent_architecture.latency_coverage_ratio0..1Latency instrumentation coverage.
agent_architecture.error_coverage_ratio0..1Error/outcome instrumentation coverage.
agent_architecture.cost_accounting_coverage_ratio0..1Token/cost accounting coverage.
agent_architecture.cost_budget_coverage_ratio0..1Budget enforcement coverage near agent sites.
agent_architecture.agent_eval_maturity_score0..1Raw eval maturity signal (before gap inversion).
agent_architecture.trajectory_eval_coverage0..1Trajectory eval coverage signal.
agent_architecture.adversarial_eval_present0..1Presence marker for adversarial eval checks.
agent_architecture.stochastic_runs_present0..1Presence marker for stochastic eval runs.
agent_architecture.decision_point_logging_coverage_ratio0..1Coverage of decision-point logging.
agent_architecture.state_transition_coverage_ratio0..1Coverage of state transition logging.
agent_architecture.confidence_routing_logging_coverage_ratio0..1Coverage of routing/confidence logging.
agent_architecture.outcome_tracking_coverage_ratio0..1Coverage of outcome tracking instrumentation.
agent_architecture.agent_files_with_processNumber (count)Count of agent files capable of process/shell execution.
agent_architecture.agent_files_with_external_apiNumber (count)Count of agent files calling external APIs.
agent_architecture.trace_eval_coverage0..1Coverage of trace-eval regression checks.
Metric KeyTypeSemantics
agent_architecture.detector_summaryTablePer-detector summary (risk, severity, confidence, evidence count, impact, effort).
agent_architecture.top_evidenceTableTop evidence rows across detectors (file, line, detector, severity, reason).
agent_architecture.axis_summaryTableFour-axis summary with detector counts and severity breakdown.

agent_architecture emits SARIF-style findings by converting detector evidence into Finding records.

  • Findings are created from metric evidence in metrics_to_findings.
  • Findings are grouped by (rule_id, file); each group becomes one finding.
  • severity is inherited from the detector MetricScore.
  • rule_id uses stable arxo/agent-* identifiers (for example, arxo/agent-loop-guard-absence, arxo/agent-mcp-auth-gap).
  • If line is known, evidence is emitted as Evidence::CodeSpan(path, line, message).
  • If line is missing, evidence is emitted as Evidence::Path.
  • Evidence is deduplicated by (file, line, symbol, reason) and capped to 5 items per finding; overflow is summarized with trailing text evidence.
  • Findings are emitted only when evidence exists; when no findings exist, MetricResult.findings is omitted.
  • Output order is deterministic (sorted by rule_id, title, and location).