Skip to content
Arxo Arxo

Remediation Playbook

Use this playbook to translate ml_architecture findings into concrete engineering fixes.

  • Symptom metric: ml_architecture.train_serve_skew_score
  • Likely cause: training and serving feature pipelines diverged.
  • Minimal fix: extract shared feature transforms into a single reusable module.
  • Validation: score increases and skew evidence count drops.
  • Symptom metric: ml_architecture.skew_test_absence_score
  • Likely cause: no parity or skew regression tests.
  • Minimal fix: add train-vs-serve feature parity tests in CI.
  • Validation: score increases and CI findings include skew test coverage.
  • Symptom metric: ml_architecture.pipeline_complexity_score
  • Likely cause: DAG depth/fanout/cycle growth across pipeline stages.
  • Minimal fix: split oversized stages and remove cyclic dependencies between pipeline artifacts.
  • Validation: lower depth/cycle signals and higher complexity score.
  • Symptom metric: ml_architecture.reproducibility_score
  • Likely cause: missing seed controls and floating dependency versions.
  • Minimal fix: enforce deterministic seeds and exact dependency pinning.
  • Validation: seed/dependency evidence gaps decrease and score rises.
  • Symptom metric: ml_architecture.data_lineage_integrity_score
  • Likely cause: unversioned dataset reads and mutable model load paths.
  • Minimal fix: use immutable dataset/model identifiers (hash/version, registry URIs).
  • Validation: lineage score rises and unversioned path findings shrink.
  • Symptom metric: ml_architecture.experiment_isolation_score
  • Likely cause: shared temp/output paths or global mutable state across runs.
  • Minimal fix: isolate run outputs by run ID and remove module-level shared mutable state.
  • Validation: isolation evidence count declines and score rises.
  • Symptom metric: ml_architecture.eval_integrity_score
  • Likely cause: leakage-prone fit/split ordering and weak split hygiene.
  • Minimal fix: enforce split-before-fit and explicit group/time-aware split strategy.
  • Validation: eval integrity score rises and leakage indicators decline.
  • Symptom metric: ml_architecture.serving_maturity_score
  • Likely cause: model loading in request path, missing warmup/signature checks.
  • Minimal fix: move model load to startup and validate input/output contracts.
  • Validation: serving maturity score improves and cold-start evidence decreases.
  • Symptom metric: ml_architecture.drift_monitoring_score
  • Likely cause: missing drift metrics and alert thresholds.
  • Minimal fix: add feature/prediction drift monitors with thresholded alerts.
  • Validation: monitoring coverage increases and score improves.
  • Symptom metric: ml_architecture.data_validation_score
  • Likely cause: no schema/range/null checks in data ingress paths.
  • Minimal fix: add data validation contracts at ingestion and pre-training boundaries.
  • Validation: validation score rises and missing-control evidence decreases.
  • Symptom metric: ml_architecture.ci_integration_score
  • Likely cause: ML checks not executed in CI.
  • Minimal fix: add ML test/eval stages to CI workflows with fail thresholds.
  • Validation: CI integration score increases and CI-related findings drop.
  • Symptom metric: ml_architecture.fairness_audit_score
  • Likely cause: no fairness metric checks for protected cohorts.
  • Minimal fix: add fairness evaluation suite and enforce thresholds before release.
  • Validation: fairness score improves and audit coverage evidence appears.
  • Symptom metric: ml_architecture.ab_testing_score
  • Likely cause: no controlled rollout path for model versions.
  • Minimal fix: add treatment/control gating with experiment tracking.
  • Validation: A/B score improves and experiment evidence increases.
  • Symptom metric: ml_architecture.shadow_canary_score
  • Likely cause: direct full rollout with no shadow/canary stage.
  • Minimal fix: introduce shadow traffic and phased canary deployment policy.
  • Validation: shadow/canary score rises and rollout-risk findings decrease.
  • Symptom metric: ml_architecture.monitoring_alerting_score
  • Likely cause: incomplete runtime metrics/alerts for serving SLIs.
  • Minimal fix: instrument latency/error/throughput/quality signals and paging policies.
  • Validation: score increases and alerting-gap evidence decreases.
  • Symptom metric: ml_architecture.model_staleness_score
  • Likely cause: no retraining cadence or freshness SLA checks.
  • Minimal fix: define staleness SLO and retrain triggers tied to data/model age.
  • Validation: staleness score improves and stale-model indicators decline.
  • Symptom metric: ml_architecture.serving_ops_score
  • Likely cause: missing health/readiness, graceful shutdown, rollback controls.
  • Minimal fix: add health/readiness probes, safe shutdown hooks, and rollback playbooks.
  • Validation: serving-ops score increases and infra-control evidence improves.
  • Symptom metric: ml_architecture.model_validation_gates_score
  • Likely cause: no explicit promotion gates against baseline quality thresholds.
  • Minimal fix: enforce baseline-vs-candidate checks with fail-closed promotion criteria.
  • Validation: gate-coverage evidence appears and score improves.
  • Symptom metric: ml_architecture.calibration_uncertainty_score
  • Likely cause: confidence outputs are uncalibrated and uncertainty handling is undefined.
  • Minimal fix: add calibration evaluation and a fallback/abstain policy for low-confidence predictions.
  • Validation: calibration-control evidence increases and score rises.
  • Symptom metric: ml_architecture.feature_store_consistency_score
  • Likely cause: offline training features and online serving features are not point-in-time consistent.
  • Minimal fix: enforce online/offline parity checks and point-in-time correctness tests.
  • Validation: consistency evidence appears and parity-gap indicators decline.
  • Symptom metric: ml_architecture.progressive_delivery_analysis_score
  • Likely cause: canary rollouts lack quantitative guardrails and abort automation.
  • Minimal fix: add canary analysis templates with SLO guardrails and automatic rollback triggers.
  • Validation: rollout-analysis evidence appears and score increases.
  • Symptom metric: ml_architecture.provenance_attestation_score
  • Likely cause: model/data artifacts are not accompanied by signed provenance metadata.
  • Minimal fix: generate provenance attestations and bind artifact digests to build/release metadata.
  • Validation: attestation evidence increases and provenance gaps shrink.
  • Symptom metric: ml_architecture.responsible_ai_governance_score
  • Likely cause: model cards, risk assessments, and limitations are missing or incomplete.
  • Minimal fix: publish model governance artifacts and require review before release.
  • Validation: governance-document evidence appears and score rises.
  • Symptom metric: ml_architecture.attestation_enforcement_score
  • Likely cause: deployment admission does not verify signatures/provenance.
  • Minimal fix: enforce deploy-time signature and provenance checks in admission policy.
  • Validation: enforcement evidence appears and bypass paths are reduced.
  • Symptom metric: ml_architecture.model_registry_governance_score
  • Likely cause: mutable aliases and weak approval controls in the registry.
  • Minimal fix: require immutable version references, staged aliases, and approval metadata.
  • Validation: registry-governance evidence increases and score improves.
  • Symptom metric: ml_architecture.lineage_schema_fidelity_score
  • Likely cause: lineage events omit required run/input/output/schema facets.
  • Minimal fix: standardize lineage schema and enforce completeness in pipeline emission.
  • Validation: schema-completeness evidence improves and score rises.
  • Symptom metric: ml_architecture.adversarial_resilience_score
  • Likely cause: no adversarial/poisoning/backdoor evaluations in model validation.
  • Minimal fix: add adversarial robustness tests and promotion thresholds in CI.
  • Validation: resilience-eval evidence appears and score increases.
  • Symptom metric: ml_architecture.post_market_incident_readiness_score
  • Likely cause: incident runbooks, kill switch controls, and retention plans are incomplete.
  • Minimal fix: define post-deployment incident procedures with rollback/kill-switch drills.
  • Validation: incident-readiness evidence increases and score improves.
  • Symptom metric: ml_architecture.genai_telemetry_semconv_score
  • Likely cause: GenAI serving telemetry is missing semantic-convention aligned attributes.
  • Minimal fix: adopt OpenTelemetry GenAI semantic conventions for token/error/latency telemetry.
  • Validation: semconv telemetry evidence appears and score rises.

After category fixes, confirm these together:

  • ml_architecture.overall_score
  • ml_architecture.overall_score_extended
  • detector-level scores for changed categories
  • finding severity trend in high-centrality modules

If overall_score stalls, inspect low-confidence categories and unresolved adjacent bottlenecks from Scoring and Keys.