Remediation Playbook
Remediation Playbook
Section titled “Remediation Playbook”Use this playbook to translate ml_architecture findings into concrete engineering fixes.
Train/Serve Skew
Section titled “Train/Serve Skew”- Symptom metric:
ml_architecture.train_serve_skew_score - Likely cause: training and serving feature pipelines diverged.
- Minimal fix: extract shared feature transforms into a single reusable module.
- Validation: score increases and skew evidence count drops.
Skew Test Absence
Section titled “Skew Test Absence”- Symptom metric:
ml_architecture.skew_test_absence_score - Likely cause: no parity or skew regression tests.
- Minimal fix: add train-vs-serve feature parity tests in CI.
- Validation: score increases and CI findings include skew test coverage.
Pipeline Complexity
Section titled “Pipeline Complexity”- Symptom metric:
ml_architecture.pipeline_complexity_score - Likely cause: DAG depth/fanout/cycle growth across pipeline stages.
- Minimal fix: split oversized stages and remove cyclic dependencies between pipeline artifacts.
- Validation: lower depth/cycle signals and higher complexity score.
Reproducibility
Section titled “Reproducibility”- Symptom metric:
ml_architecture.reproducibility_score - Likely cause: missing seed controls and floating dependency versions.
- Minimal fix: enforce deterministic seeds and exact dependency pinning.
- Validation: seed/dependency evidence gaps decrease and score rises.
Data Lineage Integrity
Section titled “Data Lineage Integrity”- Symptom metric:
ml_architecture.data_lineage_integrity_score - Likely cause: unversioned dataset reads and mutable model load paths.
- Minimal fix: use immutable dataset/model identifiers (hash/version, registry URIs).
- Validation: lineage score rises and unversioned path findings shrink.
Experiment Isolation
Section titled “Experiment Isolation”- Symptom metric:
ml_architecture.experiment_isolation_score - Likely cause: shared temp/output paths or global mutable state across runs.
- Minimal fix: isolate run outputs by run ID and remove module-level shared mutable state.
- Validation: isolation evidence count declines and score rises.
Eval Integrity
Section titled “Eval Integrity”- Symptom metric:
ml_architecture.eval_integrity_score - Likely cause: leakage-prone fit/split ordering and weak split hygiene.
- Minimal fix: enforce split-before-fit and explicit group/time-aware split strategy.
- Validation: eval integrity score rises and leakage indicators decline.
Serving Maturity
Section titled “Serving Maturity”- Symptom metric:
ml_architecture.serving_maturity_score - Likely cause: model loading in request path, missing warmup/signature checks.
- Minimal fix: move model load to startup and validate input/output contracts.
- Validation: serving maturity score improves and cold-start evidence decreases.
Drift Monitoring
Section titled “Drift Monitoring”- Symptom metric:
ml_architecture.drift_monitoring_score - Likely cause: missing drift metrics and alert thresholds.
- Minimal fix: add feature/prediction drift monitors with thresholded alerts.
- Validation: monitoring coverage increases and score improves.
Data Validation
Section titled “Data Validation”- Symptom metric:
ml_architecture.data_validation_score - Likely cause: no schema/range/null checks in data ingress paths.
- Minimal fix: add data validation contracts at ingestion and pre-training boundaries.
- Validation: validation score rises and missing-control evidence decreases.
CI Integration
Section titled “CI Integration”- Symptom metric:
ml_architecture.ci_integration_score - Likely cause: ML checks not executed in CI.
- Minimal fix: add ML test/eval stages to CI workflows with fail thresholds.
- Validation: CI integration score increases and CI-related findings drop.
Fairness Audit
Section titled “Fairness Audit”- Symptom metric:
ml_architecture.fairness_audit_score - Likely cause: no fairness metric checks for protected cohorts.
- Minimal fix: add fairness evaluation suite and enforce thresholds before release.
- Validation: fairness score improves and audit coverage evidence appears.
A/B Testing
Section titled “A/B Testing”- Symptom metric:
ml_architecture.ab_testing_score - Likely cause: no controlled rollout path for model versions.
- Minimal fix: add treatment/control gating with experiment tracking.
- Validation: A/B score improves and experiment evidence increases.
Shadow/Canary
Section titled “Shadow/Canary”- Symptom metric:
ml_architecture.shadow_canary_score - Likely cause: direct full rollout with no shadow/canary stage.
- Minimal fix: introduce shadow traffic and phased canary deployment policy.
- Validation: shadow/canary score rises and rollout-risk findings decrease.
Monitoring and Alerting
Section titled “Monitoring and Alerting”- Symptom metric:
ml_architecture.monitoring_alerting_score - Likely cause: incomplete runtime metrics/alerts for serving SLIs.
- Minimal fix: instrument latency/error/throughput/quality signals and paging policies.
- Validation: score increases and alerting-gap evidence decreases.
Model Staleness
Section titled “Model Staleness”- Symptom metric:
ml_architecture.model_staleness_score - Likely cause: no retraining cadence or freshness SLA checks.
- Minimal fix: define staleness SLO and retrain triggers tied to data/model age.
- Validation: staleness score improves and stale-model indicators decline.
Serving Ops
Section titled “Serving Ops”- Symptom metric:
ml_architecture.serving_ops_score - Likely cause: missing health/readiness, graceful shutdown, rollback controls.
- Minimal fix: add health/readiness probes, safe shutdown hooks, and rollback playbooks.
- Validation: serving-ops score increases and infra-control evidence improves.
Model Validation Gates
Section titled “Model Validation Gates”- Symptom metric:
ml_architecture.model_validation_gates_score - Likely cause: no explicit promotion gates against baseline quality thresholds.
- Minimal fix: enforce baseline-vs-candidate checks with fail-closed promotion criteria.
- Validation: gate-coverage evidence appears and score improves.
Calibration and Uncertainty
Section titled “Calibration and Uncertainty”- Symptom metric:
ml_architecture.calibration_uncertainty_score - Likely cause: confidence outputs are uncalibrated and uncertainty handling is undefined.
- Minimal fix: add calibration evaluation and a fallback/abstain policy for low-confidence predictions.
- Validation: calibration-control evidence increases and score rises.
Feature Store Consistency
Section titled “Feature Store Consistency”- Symptom metric:
ml_architecture.feature_store_consistency_score - Likely cause: offline training features and online serving features are not point-in-time consistent.
- Minimal fix: enforce online/offline parity checks and point-in-time correctness tests.
- Validation: consistency evidence appears and parity-gap indicators decline.
Progressive Delivery Analysis
Section titled “Progressive Delivery Analysis”- Symptom metric:
ml_architecture.progressive_delivery_analysis_score - Likely cause: canary rollouts lack quantitative guardrails and abort automation.
- Minimal fix: add canary analysis templates with SLO guardrails and automatic rollback triggers.
- Validation: rollout-analysis evidence appears and score increases.
Provenance Attestation
Section titled “Provenance Attestation”- Symptom metric:
ml_architecture.provenance_attestation_score - Likely cause: model/data artifacts are not accompanied by signed provenance metadata.
- Minimal fix: generate provenance attestations and bind artifact digests to build/release metadata.
- Validation: attestation evidence increases and provenance gaps shrink.
Responsible AI Governance
Section titled “Responsible AI Governance”- Symptom metric:
ml_architecture.responsible_ai_governance_score - Likely cause: model cards, risk assessments, and limitations are missing or incomplete.
- Minimal fix: publish model governance artifacts and require review before release.
- Validation: governance-document evidence appears and score rises.
Attestation Enforcement
Section titled “Attestation Enforcement”- Symptom metric:
ml_architecture.attestation_enforcement_score - Likely cause: deployment admission does not verify signatures/provenance.
- Minimal fix: enforce deploy-time signature and provenance checks in admission policy.
- Validation: enforcement evidence appears and bypass paths are reduced.
Model Registry Governance
Section titled “Model Registry Governance”- Symptom metric:
ml_architecture.model_registry_governance_score - Likely cause: mutable aliases and weak approval controls in the registry.
- Minimal fix: require immutable version references, staged aliases, and approval metadata.
- Validation: registry-governance evidence increases and score improves.
Lineage Schema Fidelity
Section titled “Lineage Schema Fidelity”- Symptom metric:
ml_architecture.lineage_schema_fidelity_score - Likely cause: lineage events omit required run/input/output/schema facets.
- Minimal fix: standardize lineage schema and enforce completeness in pipeline emission.
- Validation: schema-completeness evidence improves and score rises.
Adversarial Resilience
Section titled “Adversarial Resilience”- Symptom metric:
ml_architecture.adversarial_resilience_score - Likely cause: no adversarial/poisoning/backdoor evaluations in model validation.
- Minimal fix: add adversarial robustness tests and promotion thresholds in CI.
- Validation: resilience-eval evidence appears and score increases.
Post-Market Incident Readiness
Section titled “Post-Market Incident Readiness”- Symptom metric:
ml_architecture.post_market_incident_readiness_score - Likely cause: incident runbooks, kill switch controls, and retention plans are incomplete.
- Minimal fix: define post-deployment incident procedures with rollback/kill-switch drills.
- Validation: incident-readiness evidence increases and score improves.
GenAI Telemetry SemConv
Section titled “GenAI Telemetry SemConv”- Symptom metric:
ml_architecture.genai_telemetry_semconv_score - Likely cause: GenAI serving telemetry is missing semantic-convention aligned attributes.
- Minimal fix: adopt OpenTelemetry GenAI semantic conventions for token/error/latency telemetry.
- Validation: semconv telemetry evidence appears and score rises.
Composite Cross-Check
Section titled “Composite Cross-Check”After category fixes, confirm these together:
ml_architecture.overall_scoreml_architecture.overall_score_extended- detector-level scores for changed categories
- finding severity trend in high-centrality modules
If overall_score stalls, inspect low-confidence categories and unresolved adjacent bottlenecks from Scoring and Keys.