Skip to content
Arxo Arxo

Examples and Report Walkthrough

This page walks through real finetuning_architecture sample artifacts and how to interpret results.

Bundled local sample outputs:

  • crates/arxo-engine/src/metrics/ai_observability/finetuning_architecture/samples/toy-finetune-workflow-report.json
  • crates/arxo-engine/src/metrics/ai_observability/finetuning_architecture/samples/toy-finetune-workflow-report.md

Sample project and config:

  • crates/arxo-engine/src/metrics/ai_observability/finetuning_architecture/samples/toy-finetune-workflow/
  • crates/arxo-engine/src/metrics/ai_observability/finetuning_architecture/samples/finetuning-architecture-config.yaml

From your project directory, run Arxo with the path to your fine-tuning project and config:

Terminal window
arxo analyze \
--path /path/to/your/finetune-project \
--config finetuning-architecture-config.yaml \
--format json \
--output report.json
  • finetuning_architecture.overall_finetuning_health
  • finetuning_architecture.reproducibility_score
  • finetuning_architecture.data_integrity_score
  • finetuning_architecture.safety_governance_score

These summarize whether the pipeline is broadly healthy before detector-level triage.

  • Reproducibility: base model pinning, run lineage, determinism envelope, checkpoint-eval linkage.
  • Data/eval: eval harness maturity, contamination risk, prompt/template-loss consistency, distillation integrity.
  • Safety/governance: artifact access, trust surface, privacy recordkeeping, provenance.
  • Operations: OOM controls, cost tracking, checkpoint hygiene, resume safety.

Review findings with rule_id and CodeSpan evidence to prioritize fixes in concrete files/lines.

  • pipeline_dag_depth
  • pipeline_cycle_count
  • pipeline_completeness_score
  • effect counts for GPU/database/storage in training files

These help explain operational topology and missing stages.

Signals:

  • eval_absence_score: low
  • dataset_contamination_score: low
  • checkpoint_eval_lineage_score: low
  • artifact_trust_surface_score: low

Action order:

  1. Add eval split and quality/safety metrics.
  2. Enforce split contamination controls and dedup checks.
  3. Link checkpoints to eval outcomes and rollback criteria.
  4. Harden artifact trust surface (safe_serialization, avoid unsafe trust paths).

Signals:

  • reproducibility_score: high
  • data_integrity_score: high
  • safety_governance_score: high
  • overall_finetuning_health: high

Action order:

  1. Keep baseline no-regression policies enabled in CI.
  2. Raise thresholds gradually for high-impact detector keys.
  3. Focus remediation on new findings only, not already green categories.
  • Low score without findings usually means weak evidence density; inspect central training/config files first.
  • Findings are best used as fix-entry points, while scores are better for release gates and trend tracking.