Skip to content
Arxo Arxo

Remediation Playbook

Use this playbook to convert findings into concrete fixes.

  • Symptom metric: agent_architecture.loop_guard_absence
  • Typical cause: Agent loop/invoke paths without max_steps, iteration cap, or timeout budget.
  • Minimal fix: Add step budget and wall-clock timeout in orchestration entrypoints and propagate through nested loops.
  • Validation check: loop_guard_absence decreases and effective_step_budget_ratio increases.
  • Symptom metric: agent_architecture.memory_unbounded
  • Typical cause: Conversation/state memory without TTL, size cap, summarization, or retention policy.
  • Minimal fix: Set token/window limits, TTL/eviction for state stores, and periodic summarization for long threads.
  • Validation check: memory_unbounded decreases; limit scores (context_memory_limits_score, tool_state_limits_score, long_term_memory_retention_score) increase.
  • Symptom metric: agent_architecture.tool_policy_absence, agent_architecture.schema_validation_gap, agent_architecture.tool_result_validation_gap
  • Typical cause: Unscoped tools and untyped tool inputs/outputs.
  • Minimal fix: Add allowlists/scope constraints and enforce input/output schemas (including error shapes).
  • Validation check: Gaps decrease, scoped_tool_ratio and schema coverage metrics increase.
  • Symptom metric: agent_architecture.retry_storm_risk, agent_architecture.fanout_control_absence, agent_architecture.deadlock_risk
  • Typical cause: Nested retries without backoff/jitter and unconstrained parallel fanout.
  • Minimal fix: Add exponential backoff + jitter, cap retries, add concurrency limiter, and explicit join/barrier patterns.
  • Validation check: Retry and concurrency risk metrics decrease.
  • Symptom metric: agent_architecture.agent_observability_gap, agent_architecture.agent_eval_absence
  • Typical cause: Missing step-level traces and missing regression/eval trajectories.
  • Minimal fix: Emit trace/span IDs for each agent step; add golden trajectory tests and adversarial/stochastic eval runs.
  • Validation check: step_trace_completeness_score, trajectory_eval_coverage, adversarial_eval_present, and stochastic_runs_present improve.

Instruction Boundaries and State Isolation

Section titled “Instruction Boundaries and State Isolation”
  • Symptom metric: agent_architecture.instruction_boundary_violation, agent_architecture.state_isolation_risk, agent_architecture.idempotency_gap
  • Typical cause: Mixed trust boundaries (system/user/tool outputs), shared mutable state across sessions, non-idempotent side-effect tools.
  • Minimal fix: Separate prompt roles explicitly, scope state by session/user, and enforce idempotency keys for side-effectful tools.
  • Validation check: Boundary/isolation/idempotency risk metrics decrease and overall_agent_health trends upward.
  • Symptom metric: agent_architecture.mcp_auth_gap, agent_architecture.mcp_oauth_resource_binding_gap, agent_architecture.mcp_tool_annotation_gap, agent_architecture.mcp_structured_output_gap, agent_architecture.tool_sandbox_enforcement_gap, agent_architecture.tool_approval_bypass_risk
  • Typical cause: MCP surfaces without auth/resource binding, incomplete tool annotations/structured outputs, or process-capable tools without sandbox and approval gates.
  • Minimal fix: Add explicit auth + resource/audience binding for MCP auth flows, annotate MCP tools with safety metadata, enforce structured output contracts, sandbox process-capable tools, and gate high-risk actions behind explicit approval.
  • Validation check: MCP and tool execution gap metrics decrease; approval/sandbox coverage signals increase.
  • Symptom metric: agent_architecture.checkpoint_durability_gap, agent_architecture.interrupt_resume_contract_gap
  • Typical cause: Multi-step runs without persisted checkpoints or resumable interrupt contracts.
  • Minimal fix: Add checkpoint write/read boundaries around long-running steps and define explicit interrupt/resume semantics for workflow state transitions.
  • Validation check: Durability and interrupt/resume gaps decrease, improving reliability posture.
  • Symptom metric: agent_architecture.otel_genai_semconv_gap, agent_architecture.otel_genai_event_coverage_gap, agent_architecture.trace_eval_regression_risk
  • Typical cause: Missing semantic convention fields/events and weak trace-quality regression checks.
  • Minimal fix: Emit required OTel GenAI semantic attributes/events per step and add trace-eval checks to CI with pass/fail thresholds.
  • Validation check: OTel gap metrics and trace eval regression risk decrease; trace_eval_coverage rises.
  • Symptom metric: agent_architecture.a2a_agent_card_gap, agent_architecture.a2a_task_state_machine_gap, agent_architecture.a2a_webhook_auth_gap, agent_architecture.handoff_input_filter_gap, agent_architecture.guardrail_hook_absence, agent_architecture.handoff_cycle_risk
  • Typical cause: Missing A2A contracts, unauthenticated handoff webhooks, or handoff paths without filtering/guardrails and cycle control.
  • Minimal fix: Define agent cards and task lifecycle contracts, authenticate handoff webhooks, filter handoff inputs, enforce guardrail hooks, and cap/inspect handoff recursion paths.
  • Validation check: A2A/handoff risk metrics decrease and severity distribution shifts away from High/Critical.