Skip to content
Arxo Arxo

LLM Architecture

The llm_integration metric evaluates architecture quality for LLM-enabled systems. It focuses on production safety and operability risks, not prompt-writing style.

Last verified against engine metric version 1.2.0.

LLM features often fail through architecture gaps rather than model capability:

  • Missing observability hides regressions and incident root causes.
  • Missing context and cost controls create runaway spend and latency.
  • Missing governance and version controls make behavior drift hard to manage.
  • Missing safety boundaries increases prompt injection and data leakage risk.

llm_integration surfaces these gaps as normalized llm.* metrics plus evidence-backed findings.

Each run emits:

  • Primary risk/coverage keys (for observability, cost/context, security, governance, reliability, model/tooling)
  • Additive detail keys for MCP and eval quality:
    • llm.mcp_oauth21_gap
    • llm.mcp_pkce_gap
    • llm.mcp_resource_binding_gap
    • llm.mcp_audience_validation_gap
    • llm.mcp_tool_annotations_gap
    • llm.mcp_tool_output_schema_gap
    • llm.eval_presence_score
    • llm.eval_breadth_score
    • llm.eval_gating_score
    • llm.eval_dataset_versioning_score
    • llm.eval_quality_score
  • Confidence keys for all primary metrics (for example llm.observability_gap.confidence)
  • Diagnostic signals:
    • llm.pii_taint_used
    • llm.pii_fallback_reason
    • llm.blast_radius_available
    • llm.call_sites_total
    • llm.call_sites_discovered_count
    • llm.call_sites_enriched_count
    • llm.call_sites_unresolved_count
  • Most keys are risk/gap/absence semantics: higher is worse.
  • Score keys are higher is better:
    • llm.prompt_hardcoding_score
    • llm.model_coupling_score
    • llm.overall_integration_health
    • llm.eval_presence_score
    • llm.eval_breadth_score
    • llm.eval_gating_score
    • llm.eval_dataset_versioning_score
    • llm.eval_quality_score
  • Confidence keys are always higher is better (0..1).
metrics:
- id: llm_integration
enabled: true
config:
# Optional language filter (extension-based)
languages: ["python", "typescript", "rust", "java"]
# Policy-oriented thresholds
min_observability_coverage: 0.8
max_pii_risk_tolerance: 0.0
require_eval_harness: true
# Weight boost for upstream callers of LLM files (0.0-1.0)
upstream_llm_call_boost: 0.6
# AST confirmation controls
ast_confirmation_enabled: true
ast_confirmation_languages: ["python", "typescript"]
ast_confirmation_severity_cap_on_fallback: "medium"
ast_confirmation_confidence_on_fallback: 0.45
# GenAI OTel semantic convention behavior
otel_semconv_mode: compat # compat | strict
otel_semconv_maturity_mode: development # stable | development
# Optional overrides for overall health weighting
health_weights:
observability: 0.12
context_budget: 0.12
eval_harness: 0.12
pii: 0.16
cost_tracking: 0.10
prompt_hardcoding: 0.06
model_coupling: 0.04
fallback: 0.04
model_version: 0.04
tool_policy: 0.02
cache_idempotency: 0.02
streaming: 0.02
rate_limit: 0.02
security: 0.18
embedding_drift: 0.01
template_governance: 0.01
agent_loop: 0.01
instruction_boundary: 0.01
supply_chain: 0.04
data_model_poisoning: 0.03
system_prompt_leakage: 0.03
vector_embedding_weakness: 0.02
misinformation_overreliance: 0.03
mcp_authz: 0.02
mcp_tool_contract: 0.02
genai_otel_semconv: 0.03
structured_output_enforcement: 0.04
model_rollout_guardrail: 0.03

Notes:

  • health_weights are normalized at runtime so the effective total is always 1.0.
  • Severity values for ast_confirmation_severity_cap_on_fallback: good | low | medium | high | critical.
Section titled “Balanced Profile Quick Start (Recommended)”
metrics:
- id: llm_integration
policy:
invariants:
- metric: llm.pii_leakage_risk
op: "=="
value: 0
message: "PII must not flow to LLM prompts without controls"
- metric: llm.observability_gap
op: "<="
value: 0.2
message: "LLM call observability coverage is below baseline"
- metric: llm.fallback_absence
op: "<="
value: 0.2
message: "Fallback and timeout strategy is required"
- metric: llm.overall_integration_health
op: ">="
value: 0.7
message: "Overall LLM architecture health baseline not met"

For full profile presets, see Policy and CI Gates:

  • Balanced (Recommended)
  • Strict (Production-Sensitive)
  • Exploratory (Early Adoption)

Static analysis quality can degrade when context is missing. Check these diagnostics before tightening gates:

  • llm.blast_radius_available = 0: call-graph-dependent context is unavailable.
  • llm.pii_taint_used = 0: PII analysis used fallback mode.
  • llm.pii_fallback_reason:
    • 1: no call graph or no call sites
    • 2: taint propagation failed
    • 3: source detection failed
  • Metric ID: llm_integration
  • CLI metric flag: --metric llm_integration
  • MCP tool name: check_llm_integration
{
"id": "llm_integration",
"version": "1.2.0",
"data": [
{ "key": "llm.observability_gap", "value": 0.22 },
{ "key": "llm.pii_leakage_risk", "value": 0.0 },
{ "key": "llm.eval_quality_score", "value": 0.67 },
{ "key": "llm.mcp_oauth21_gap", "value": 0.0 },
{ "key": "llm.overall_integration_health", "value": 0.74 },
{ "key": "llm.pii_taint_used", "value": 1 },
{ "key": "llm.call_sites_total", "value": 18 }
],
"findings": [
{
"rule_id": "arxo/llm-observability-gap",
"severity": "medium",
"title": "Improve LLM call observability",
"evidence": [
{
"kind": "code_span",
"locator": {
"path": "src/llm/service.py",
"line_start": 42
}
}
],
"recommendation": "Wrap LLM calls with tracing spans and token/cost attributes."
}
]
}