Scoring and Keys
Scoring and Keys
Section titled “Scoring and Keys”This page documents scoring behavior and emitted keys for finetuning_architecture.
Version 2.0.0 Contract
Section titled “Version 2.0.0 Contract”All detector and composite scores are normalized to 0..1 (higher is better), except eval_maturity_level (0..4).
Scoring Formulas
Section titled “Scoring Formulas”Composite Metrics (0..1)
Section titled “Composite Metrics (0..1)”reproducibility_score = mean(base_model_versioning_score, run_lineage_score, determinism_envelope_score, checkpoint_eval_lineage_score)data_integrity_score = mean(dataset_contamination_score, eval_absence_score, loss_template_alignment_score, distillation_integrity_score)safety_governance_score = mean(model_artifact_access_score, artifact_trust_surface_score, privacy_recordkeeping_score, data_provenance_score, adapter_isolation_score)overall_finetuning_health is a weighted mean:
weights = [ 1.4, # method_integrity 1.2, # distillation_integrity 1.3, # checkpoint_eval_lineage 1.2, # artifact_trust_surface 1.2, # privacy_recordkeeping 1.3, # reproducibility_score 1.3, # data_integrity_score 1.1, # safety_governance_score 1.0, # memory_strategy 0.9, # cost_tracking 0.8 # prompt_format_inconsistency]
overall_finetuning_health = sum(score_i * weight_i) / sum(weight_i)Severity Escalation Gates
Section titled “Severity Escalation Gates”Optional config gates:
require_base_pinning=trueandbase_model_versioning_score < 0.8=>Criticalrequire_eval_harness=trueandeval_absence_score < 0.8=>Criticalrequire_full_determinism=trueanddeterminism_envelope_score < 0.8=>Criticalrequire_preference_eval=trueandprofile in {"dpo","ppo","rft","grpo","rloo"}andpreference_pipeline_integrity_score < 0.8=>Criticalrequire_checkpoint_eval_lineage=trueandcheckpoint_eval_lineage_score < 0.8=>Criticalrequire_safe_serialization=trueandartifact_trust_surface_score < 0.85=>Criticalprivacy_profile in {"dp", "strict"}andprivacy_recordkeeping_score < 0.8=>Critical
Profile-driven eval escalation:
profile="dpo"andeval_maturity_level < 3andeval_absence_score < 1.0=>eval_absenceseverity escalates toCriticalprofile in {"ppo","rft","grpo","rloo"}andsafety_eval_present < 0.5andeval_absence_score < 1.0=>eval_absenceseverity escalates toCritical
Emitted Key Contract
Section titled “Emitted Key Contract”Detector score keys (0..1, higher is better)
Section titled “Detector score keys (0..1, higher is better)”| Metric Key |
|---|
finetuning_architecture.base_model_versioning_score |
finetuning_architecture.run_lineage_score |
finetuning_architecture.eval_absence_score |
finetuning_architecture.dataset_contamination_score |
finetuning_architecture.checkpoint_management_score |
finetuning_architecture.adapter_isolation_score |
finetuning_architecture.model_artifact_access_score |
finetuning_architecture.chat_template_score |
finetuning_architecture.oom_risk_score |
finetuning_architecture.resume_safety_score |
finetuning_architecture.cost_tracking_score |
finetuning_architecture.artifact_metadata_score |
finetuning_architecture.prompt_format_inconsistency_score |
finetuning_architecture.determinism_envelope_score |
finetuning_architecture.preference_pipeline_integrity_score |
finetuning_architecture.loss_template_alignment_score |
finetuning_architecture.memory_strategy_score |
finetuning_architecture.data_provenance_score |
finetuning_architecture.method_integrity_score |
finetuning_architecture.distillation_integrity_score |
finetuning_architecture.checkpoint_eval_lineage_score |
finetuning_architecture.artifact_trust_surface_score |
finetuning_architecture.privacy_recordkeeping_score |
Composite and evaluation keys
Section titled “Composite and evaluation keys”| Metric Key | Range / Type | Direction |
|---|---|---|
finetuning_architecture.eval_maturity_level | 0..4 | Higher is better |
finetuning_architecture.reproducibility_score | 0..1 | Higher is better |
finetuning_architecture.data_integrity_score | 0..1 | Higher is better |
finetuning_architecture.safety_governance_score | 0..1 | Higher is better |
finetuning_architecture.overall_finetuning_health | 0..1 | Higher is better |
Pipeline and effect diagnostics
Section titled “Pipeline and effect diagnostics”| Metric Key | Range / Type | Direction |
|---|---|---|
finetuning_architecture.pipeline_dag_depth | Number | Informational |
finetuning_architecture.pipeline_cycle_count | Number | Informational |
finetuning_architecture.pipeline_completeness_score | 0..1 | Higher is better |
finetuning_architecture.training_files_with_gpu_count | Number | Informational |
finetuning_architecture.training_files_with_database_count | Number | Informational |
finetuning_architecture.training_files_with_storage_count | Number | Informational |
Findings Emission Notes
Section titled “Findings Emission Notes”- Findings are generated from detector evidence and include
rule_id,severity, and code-span evidence. - No evidence means no finding for that detector, even if score is low.
Version Note
Section titled “Version Note”This contract is documented against metric version 2.0.0.