Skip to content
Arxo Arxo

Scoring and Keys

This page documents scoring behavior and emitted keys for finetuning_architecture.

All detector and composite scores are normalized to 0..1 (higher is better), except eval_maturity_level (0..4).

reproducibility_score =
mean(base_model_versioning_score, run_lineage_score, determinism_envelope_score, checkpoint_eval_lineage_score)
data_integrity_score =
mean(dataset_contamination_score, eval_absence_score, loss_template_alignment_score, distillation_integrity_score)
safety_governance_score =
mean(model_artifact_access_score, artifact_trust_surface_score, privacy_recordkeeping_score, data_provenance_score, adapter_isolation_score)

overall_finetuning_health is a weighted mean:

weights = [
1.4, # method_integrity
1.2, # distillation_integrity
1.3, # checkpoint_eval_lineage
1.2, # artifact_trust_surface
1.2, # privacy_recordkeeping
1.3, # reproducibility_score
1.3, # data_integrity_score
1.1, # safety_governance_score
1.0, # memory_strategy
0.9, # cost_tracking
0.8 # prompt_format_inconsistency
]
overall_finetuning_health = sum(score_i * weight_i) / sum(weight_i)

Optional config gates:

  • require_base_pinning=true and base_model_versioning_score < 0.8 => Critical
  • require_eval_harness=true and eval_absence_score < 0.8 => Critical
  • require_full_determinism=true and determinism_envelope_score < 0.8 => Critical
  • require_preference_eval=true and profile in {"dpo","ppo","rft","grpo","rloo"} and preference_pipeline_integrity_score < 0.8 => Critical
  • require_checkpoint_eval_lineage=true and checkpoint_eval_lineage_score < 0.8 => Critical
  • require_safe_serialization=true and artifact_trust_surface_score < 0.85 => Critical
  • privacy_profile in {"dp", "strict"} and privacy_recordkeeping_score < 0.8 => Critical

Profile-driven eval escalation:

  • profile="dpo" and eval_maturity_level < 3 and eval_absence_score < 1.0 => eval_absence severity escalates to Critical
  • profile in {"ppo","rft","grpo","rloo"} and safety_eval_present < 0.5 and eval_absence_score < 1.0 => eval_absence severity escalates to Critical

Detector score keys (0..1, higher is better)

Section titled “Detector score keys (0..1, higher is better)”
Metric Key
finetuning_architecture.base_model_versioning_score
finetuning_architecture.run_lineage_score
finetuning_architecture.eval_absence_score
finetuning_architecture.dataset_contamination_score
finetuning_architecture.checkpoint_management_score
finetuning_architecture.adapter_isolation_score
finetuning_architecture.model_artifact_access_score
finetuning_architecture.chat_template_score
finetuning_architecture.oom_risk_score
finetuning_architecture.resume_safety_score
finetuning_architecture.cost_tracking_score
finetuning_architecture.artifact_metadata_score
finetuning_architecture.prompt_format_inconsistency_score
finetuning_architecture.determinism_envelope_score
finetuning_architecture.preference_pipeline_integrity_score
finetuning_architecture.loss_template_alignment_score
finetuning_architecture.memory_strategy_score
finetuning_architecture.data_provenance_score
finetuning_architecture.method_integrity_score
finetuning_architecture.distillation_integrity_score
finetuning_architecture.checkpoint_eval_lineage_score
finetuning_architecture.artifact_trust_surface_score
finetuning_architecture.privacy_recordkeeping_score
Metric KeyRange / TypeDirection
finetuning_architecture.eval_maturity_level0..4Higher is better
finetuning_architecture.reproducibility_score0..1Higher is better
finetuning_architecture.data_integrity_score0..1Higher is better
finetuning_architecture.safety_governance_score0..1Higher is better
finetuning_architecture.overall_finetuning_health0..1Higher is better
Metric KeyRange / TypeDirection
finetuning_architecture.pipeline_dag_depthNumberInformational
finetuning_architecture.pipeline_cycle_countNumberInformational
finetuning_architecture.pipeline_completeness_score0..1Higher is better
finetuning_architecture.training_files_with_gpu_countNumberInformational
finetuning_architecture.training_files_with_database_countNumberInformational
finetuning_architecture.training_files_with_storage_countNumberInformational
  • Findings are generated from detector evidence and include rule_id, severity, and code-span evidence.
  • No evidence means no finding for that detector, even if score is low.

This contract is documented against metric version 2.0.0.