Fine-tuning Architecture

The finetuning_architecture metric evaluates architecture quality for LLM fine-tuning systems. It focuses on reproducibility, data/eval integrity, safety/governance, and operational reliability.

Last verified against engine metric version 2.0.0.

Why It Matters

Fine-tuning failures are often caused by pipeline architecture gaps, not by model design alone:

Unpinned base models and weak lineage break reproducibility.
Missing eval harnesses and weak split hygiene allow regressions to ship.
Weak checkpoint/resume controls create unstable training recovery.
Adapter/metadata/access-control gaps increase governance and deployment risk.
Missing budget and OOM controls increase cost and operational failures.

finetuning_architecture surfaces these risks as detector scores plus evidence-backed findings.

What It Catches

Reproducibility gaps: base-model pinning, lineage capture, determinism envelope, checkpoint-eval linkage.
Data and evaluation integrity gaps: missing eval harness, contamination risk, prompt/template-loss mismatch, weak preference/distillation controls.
Safety and governance gaps: artifact access checks, unsafe serialization/trust surfaces, privacy recordkeeping, metadata and provenance gaps.
Operational risk gaps: OOM controls, checkpoint management, resume safety, and cost tracking.

What It Does Not Catch

Runtime model quality or benchmark accuracy.
End-to-end privacy compliance proof.
Security guarantees for infrastructure outside scanned repositories.
Dynamic behavior hidden behind external services without repository evidence.

Use this metric as an architectural risk signal, then validate critical paths with runtime tests and governance review.

How Detection Works

Discovers likely fine-tuning files using file-extension and framework-pattern anchors.
Uses semantic indexing and targeted parsing where applicable to reduce noisy matches.
Emits normalized metric scores and actionable findings with file/line evidence when available.

Required Inputs and Scan Scope

Requires analysis data that includes call-graph and effect-index enrichment.
Scans code/config candidates from common training extensions (.py, .ipynb, .ts, .js, .rs, .yml, .yaml, .json, .toml).
Scans dataset-like files (.jsonl, .json, .parquet, .csv, .arrow) with bounded sampling for content-based checks.

Key Outputs

finetuning_architecture.base_model_versioning_score
finetuning_architecture.run_lineage_score
finetuning_architecture.eval_absence_score
finetuning_architecture.eval_maturity_level
finetuning_architecture.dataset_contamination_score
finetuning_architecture.chat_template_score
finetuning_architecture.checkpoint_management_score
finetuning_architecture.resume_safety_score
finetuning_architecture.adapter_isolation_score
finetuning_architecture.model_artifact_access_score
finetuning_architecture.artifact_trust_surface_score
finetuning_architecture.privacy_recordkeeping_score
finetuning_architecture.method_integrity_score
finetuning_architecture.distillation_integrity_score
finetuning_architecture.checkpoint_eval_lineage_score
finetuning_architecture.prompt_format_inconsistency_score
finetuning_architecture.oom_risk_score
finetuning_architecture.cost_tracking_score
finetuning_architecture.artifact_metadata_score

Composite and pipeline outputs:

Metric Key	Range / Type	Direction
`finetuning_architecture.reproducibility_score`	`0..1`	Higher is better
`finetuning_architecture.data_integrity_score`	`0..1`	Higher is better
`finetuning_architecture.safety_governance_score`	`0..1`	Higher is better
`finetuning_architecture.overall_finetuning_health`	`0..1`	Higher is better
`finetuning_architecture.pipeline_dag_depth`	Number	Informational
`finetuning_architecture.pipeline_cycle_count`	Number	Informational
`finetuning_architecture.pipeline_completeness_score`	`0..1`	Higher is better

For the full key contract and formulas, see Scoring and Keys.

Finding Anatomy (What You Triage)

Findings are emitted when evidence is available, including:

rule_id for automation and policy correlation
severity for triage priority
evidence with code span (path, line) where possible
recommendation, impact, and effort to guide remediation

Config Quick Reference

metrics:
  - id: finetuning_architecture
    enabled: true
    config:
      profile: "sft"                # "sft" | "dpo" | "ppo" | "rft" | "grpo" | "rloo" | "distill"
      require_eval_harness: true
      require_base_pinning: true
      require_full_determinism: false
      require_preference_eval: false
      require_checkpoint_eval_lineage: true
      require_safe_serialization: true
      privacy_profile: "strict"     # "none" | "dp" | "recordkeeping" | "strict"
      large_sequence_threshold: 2048

Policy Quick Start

metrics:
  - id: finetuning_architecture

policy:
  invariants:
    - metric: finetuning_architecture.overall_finetuning_health
      op: ">="
      value: 0.70
      message: "Overall fine-tuning architecture health baseline not met"
    - metric: finetuning_architecture.base_model_versioning_score
      op: ">="
      value: 0.80
      message: "Base model and tokenizer pinning baseline not met"
    - metric: finetuning_architecture.eval_absence_score
      op: ">="
      value: 0.75
      message: "Eval harness and maturity baseline not met"
    - metric: finetuning_architecture.dataset_contamination_score
      op: ">="
      value: 0.75
      message: "Dataset contamination controls are insufficient"
    - metric: finetuning_architecture.resume_safety_score
      op: ">="
      value: 0.80
      message: "Checkpoint resume safety baseline not met"
    - metric: finetuning_architecture.checkpoint_eval_lineage_score
      op: ">="
      value: 0.80
      message: "Checkpoint-eval lineage baseline not met"
    - metric: finetuning_architecture.artifact_trust_surface_score
      op: ">="
      value: 0.85
      message: "Artifact trust/safe serialization baseline not met"

For staged rollout profiles, see Policy and CI Gates.

Runtime and ID Compatibility

Documentation route: /metrics/finetuning-architecture
Stable metric ID: finetuning_architecture

Fine-tuning Architecture

Fine-tuning Architecture

Why It Matters

What It Catches

What It Does Not Catch

How Detection Works

Required Inputs and Scan Scope

Key Outputs

Finding Anatomy (What You Triage)

Config Quick Reference

Policy Quick Start

Runtime and ID Compatibility

Read Next