Skip to content
Arxo Arxo

Fine-tuning Architecture

The finetuning_architecture metric evaluates architecture quality for LLM fine-tuning systems. It focuses on reproducibility, data/eval integrity, safety/governance, and operational reliability.

Last verified against engine metric version 2.0.0.

Fine-tuning failures are often caused by pipeline architecture gaps, not by model design alone:

  • Unpinned base models and weak lineage break reproducibility.
  • Missing eval harnesses and weak split hygiene allow regressions to ship.
  • Weak checkpoint/resume controls create unstable training recovery.
  • Adapter/metadata/access-control gaps increase governance and deployment risk.
  • Missing budget and OOM controls increase cost and operational failures.

finetuning_architecture surfaces these risks as detector scores plus evidence-backed findings.

  • Reproducibility gaps: base-model pinning, lineage capture, determinism envelope, checkpoint-eval linkage.
  • Data and evaluation integrity gaps: missing eval harness, contamination risk, prompt/template-loss mismatch, weak preference/distillation controls.
  • Safety and governance gaps: artifact access checks, unsafe serialization/trust surfaces, privacy recordkeeping, metadata and provenance gaps.
  • Operational risk gaps: OOM controls, checkpoint management, resume safety, and cost tracking.
  • Runtime model quality or benchmark accuracy.
  • End-to-end privacy compliance proof.
  • Security guarantees for infrastructure outside scanned repositories.
  • Dynamic behavior hidden behind external services without repository evidence.

Use this metric as an architectural risk signal, then validate critical paths with runtime tests and governance review.

  • Discovers likely fine-tuning files using file-extension and framework-pattern anchors.
  • Uses semantic indexing and targeted parsing where applicable to reduce noisy matches.
  • Emits normalized metric scores and actionable findings with file/line evidence when available.
  • Requires analysis data that includes call-graph and effect-index enrichment.
  • Scans code/config candidates from common training extensions (.py, .ipynb, .ts, .js, .rs, .yml, .yaml, .json, .toml).
  • Scans dataset-like files (.jsonl, .json, .parquet, .csv, .arrow) with bounded sampling for content-based checks.
  • finetuning_architecture.base_model_versioning_score
  • finetuning_architecture.run_lineage_score
  • finetuning_architecture.eval_absence_score
  • finetuning_architecture.eval_maturity_level
  • finetuning_architecture.dataset_contamination_score
  • finetuning_architecture.chat_template_score
  • finetuning_architecture.checkpoint_management_score
  • finetuning_architecture.resume_safety_score
  • finetuning_architecture.adapter_isolation_score
  • finetuning_architecture.model_artifact_access_score
  • finetuning_architecture.artifact_trust_surface_score
  • finetuning_architecture.privacy_recordkeeping_score
  • finetuning_architecture.method_integrity_score
  • finetuning_architecture.distillation_integrity_score
  • finetuning_architecture.checkpoint_eval_lineage_score
  • finetuning_architecture.prompt_format_inconsistency_score
  • finetuning_architecture.oom_risk_score
  • finetuning_architecture.cost_tracking_score
  • finetuning_architecture.artifact_metadata_score

Composite and pipeline outputs:

Metric KeyRange / TypeDirection
finetuning_architecture.reproducibility_score0..1Higher is better
finetuning_architecture.data_integrity_score0..1Higher is better
finetuning_architecture.safety_governance_score0..1Higher is better
finetuning_architecture.overall_finetuning_health0..1Higher is better
finetuning_architecture.pipeline_dag_depthNumberInformational
finetuning_architecture.pipeline_cycle_countNumberInformational
finetuning_architecture.pipeline_completeness_score0..1Higher is better

For the full key contract and formulas, see Scoring and Keys.

Findings are emitted when evidence is available, including:

  • rule_id for automation and policy correlation
  • severity for triage priority
  • evidence with code span (path, line) where possible
  • recommendation, impact, and effort to guide remediation
metrics:
- id: finetuning_architecture
enabled: true
config:
profile: "sft" # "sft" | "dpo" | "ppo" | "rft" | "grpo" | "rloo" | "distill"
require_eval_harness: true
require_base_pinning: true
require_full_determinism: false
require_preference_eval: false
require_checkpoint_eval_lineage: true
require_safe_serialization: true
privacy_profile: "strict" # "none" | "dp" | "recordkeeping" | "strict"
large_sequence_threshold: 2048
metrics:
- id: finetuning_architecture
policy:
invariants:
- metric: finetuning_architecture.overall_finetuning_health
op: ">="
value: 0.70
message: "Overall fine-tuning architecture health baseline not met"
- metric: finetuning_architecture.base_model_versioning_score
op: ">="
value: 0.80
message: "Base model and tokenizer pinning baseline not met"
- metric: finetuning_architecture.eval_absence_score
op: ">="
value: 0.75
message: "Eval harness and maturity baseline not met"
- metric: finetuning_architecture.dataset_contamination_score
op: ">="
value: 0.75
message: "Dataset contamination controls are insufficient"
- metric: finetuning_architecture.resume_safety_score
op: ">="
value: 0.80
message: "Checkpoint resume safety baseline not met"
- metric: finetuning_architecture.checkpoint_eval_lineage_score
op: ">="
value: 0.80
message: "Checkpoint-eval lineage baseline not met"
- metric: finetuning_architecture.artifact_trust_surface_score
op: ">="
value: 0.85
message: "Artifact trust/safe serialization baseline not met"

For staged rollout profiles, see Policy and CI Gates.

  • Documentation route: /metrics/finetuning-architecture
  • Stable metric ID: finetuning_architecture