LLM Architecture

The llm_integration metric evaluates architecture quality for LLM-enabled systems. It focuses on production safety and operability risks, not prompt-writing style.

Last verified against engine metric version 1.2.0.

Why It Matters

LLM features often fail through architecture gaps rather than model capability:

Missing observability hides regressions and incident root causes.
Missing context and cost controls create runaway spend and latency.
Missing governance and version controls make behavior drift hard to manage.
Missing safety boundaries increases prompt injection and data leakage risk.

llm_integration surfaces these gaps as normalized llm.* metrics plus evidence-backed findings.

What You Get in Output

Each run emits:

Primary risk/coverage keys (for observability, cost/context, security, governance, reliability, model/tooling)
Additive detail keys for MCP and eval quality:
- llm.mcp_oauth21_gap
- llm.mcp_pkce_gap
- llm.mcp_resource_binding_gap
- llm.mcp_audience_validation_gap
- llm.mcp_tool_annotations_gap
- llm.mcp_tool_output_schema_gap
- llm.eval_presence_score
- llm.eval_breadth_score
- llm.eval_gating_score
- llm.eval_dataset_versioning_score
- llm.eval_quality_score
Confidence keys for all primary metrics (for example llm.observability_gap.confidence)
Diagnostic signals:
- llm.pii_taint_used
- llm.pii_fallback_reason
- llm.blast_radius_available
- llm.call_sites_total
- llm.call_sites_discovered_count
- llm.call_sites_enriched_count
- llm.call_sites_unresolved_count

How to Read Values

Most keys are risk/gap/absence semantics: higher is worse.
Score keys are higher is better:
- llm.prompt_hardcoding_score
- llm.model_coupling_score
- llm.overall_integration_health
- llm.eval_presence_score
- llm.eval_breadth_score
- llm.eval_gating_score
- llm.eval_dataset_versioning_score
- llm.eval_quality_score
Confidence keys are always higher is better (0..1).

Full Config Reference

metrics:
  - id: llm_integration
    enabled: true
    config:
      # Optional language filter (extension-based)
      languages: ["python", "typescript", "rust", "java"]

      # Policy-oriented thresholds
      min_observability_coverage: 0.8
      max_pii_risk_tolerance: 0.0
      require_eval_harness: true

      # Weight boost for upstream callers of LLM files (0.0-1.0)
      upstream_llm_call_boost: 0.6

      # AST confirmation controls
      ast_confirmation_enabled: true
      ast_confirmation_languages: ["python", "typescript"]
      ast_confirmation_severity_cap_on_fallback: "medium"
      ast_confirmation_confidence_on_fallback: 0.45

      # GenAI OTel semantic convention behavior
      otel_semconv_mode: compat          # compat | strict
      otel_semconv_maturity_mode: development   # stable | development

      # Optional overrides for overall health weighting
      health_weights:
        observability: 0.12
        context_budget: 0.12
        eval_harness: 0.12
        pii: 0.16
        cost_tracking: 0.10
        prompt_hardcoding: 0.06
        model_coupling: 0.04
        fallback: 0.04
        model_version: 0.04
        tool_policy: 0.02
        cache_idempotency: 0.02
        streaming: 0.02
        rate_limit: 0.02
        security: 0.18
        embedding_drift: 0.01
        template_governance: 0.01
        agent_loop: 0.01
        instruction_boundary: 0.01
        supply_chain: 0.04
        data_model_poisoning: 0.03
        system_prompt_leakage: 0.03
        vector_embedding_weakness: 0.02
        misinformation_overreliance: 0.03
        mcp_authz: 0.02
        mcp_tool_contract: 0.02
        genai_otel_semconv: 0.03
        structured_output_enforcement: 0.04
        model_rollout_guardrail: 0.03

Notes:

health_weights are normalized at runtime so the effective total is always 1.0.
Severity values for ast_confirmation_severity_cap_on_fallback: good | low | medium | high | critical.

Balanced Profile Quick Start (Recommended)

metrics:
  - id: llm_integration

policy:
  invariants:
    - metric: llm.pii_leakage_risk
      op: "=="
      value: 0
      message: "PII must not flow to LLM prompts without controls"

    - metric: llm.observability_gap
      op: "<="
      value: 0.2
      message: "LLM call observability coverage is below baseline"

    - metric: llm.fallback_absence
      op: "<="
      value: 0.2
      message: "Fallback and timeout strategy is required"

    - metric: llm.overall_integration_health
      op: ">="
      value: 0.7
      message: "Overall LLM architecture health baseline not met"

For full profile presets, see Policy and CI Gates:

Balanced (Recommended)
Strict (Production-Sensitive)
Exploratory (Early Adoption)

Known Limits and Diagnostics

Static analysis quality can degrade when context is missing. Check these diagnostics before tightening gates:

llm.blast_radius_available = 0: call-graph-dependent context is unavailable.
llm.pii_taint_used = 0: PII analysis used fallback mode.
llm.pii_fallback_reason:
- 1: no call graph or no call sites
- 2: taint propagation failed
- 3: source detection failed

Runtime IDs

Metric ID: llm_integration
CLI metric flag: --metric llm_integration
MCP tool name: check_llm_integration

Example Output (Trimmed)

{
  "id": "llm_integration",
  "version": "1.2.0",
  "data": [
    { "key": "llm.observability_gap", "value": 0.22 },
    { "key": "llm.pii_leakage_risk", "value": 0.0 },
    { "key": "llm.eval_quality_score", "value": 0.67 },
    { "key": "llm.mcp_oauth21_gap", "value": 0.0 },
    { "key": "llm.overall_integration_health", "value": 0.74 },
    { "key": "llm.pii_taint_used", "value": 1 },
    { "key": "llm.call_sites_total", "value": 18 }
  ],
  "findings": [
    {
      "rule_id": "arxo/llm-observability-gap",
      "severity": "medium",
      "title": "Improve LLM call observability",
      "evidence": [
        {
          "kind": "code_span",
          "locator": {
            "path": "src/llm/service.py",
            "line_start": 42
          }
        }
      ],
      "recommendation": "Wrap LLM calls with tracing spans and token/cost attributes."
    }
  ]
}

LLM Architecture

LLM Architecture

Why It Matters

What You Get in Output

How to Read Values

Full Config Reference

Balanced Profile Quick Start (Recommended)

Known Limits and Diagnostics

Runtime IDs

Example Output (Trimmed)

Read Next