LLM Architecture
LLM Architecture
Section titled “LLM Architecture”The llm_integration metric evaluates architecture quality for LLM-enabled systems. It focuses on production safety and operability risks, not prompt-writing style.
Last verified against engine metric version 1.2.0.
Why It Matters
Section titled “Why It Matters”LLM features often fail through architecture gaps rather than model capability:
- Missing observability hides regressions and incident root causes.
- Missing context and cost controls create runaway spend and latency.
- Missing governance and version controls make behavior drift hard to manage.
- Missing safety boundaries increases prompt injection and data leakage risk.
llm_integration surfaces these gaps as normalized llm.* metrics plus evidence-backed findings.
What You Get in Output
Section titled “What You Get in Output”Each run emits:
- Primary risk/coverage keys (for observability, cost/context, security, governance, reliability, model/tooling)
- Additive detail keys for MCP and eval quality:
llm.mcp_oauth21_gapllm.mcp_pkce_gapllm.mcp_resource_binding_gapllm.mcp_audience_validation_gapllm.mcp_tool_annotations_gapllm.mcp_tool_output_schema_gapllm.eval_presence_scorellm.eval_breadth_scorellm.eval_gating_scorellm.eval_dataset_versioning_scorellm.eval_quality_score
- Confidence keys for all primary metrics (for example
llm.observability_gap.confidence) - Diagnostic signals:
llm.pii_taint_usedllm.pii_fallback_reasonllm.blast_radius_availablellm.call_sites_totalllm.call_sites_discovered_countllm.call_sites_enriched_countllm.call_sites_unresolved_count
How to Read Values
Section titled “How to Read Values”- Most keys are risk/gap/absence semantics: higher is worse.
- Score keys are higher is better:
llm.prompt_hardcoding_scorellm.model_coupling_scorellm.overall_integration_healthllm.eval_presence_scorellm.eval_breadth_scorellm.eval_gating_scorellm.eval_dataset_versioning_scorellm.eval_quality_score
- Confidence keys are always higher is better (
0..1).
Full Config Reference
Section titled “Full Config Reference”metrics: - id: llm_integration enabled: true config: # Optional language filter (extension-based) languages: ["python", "typescript", "rust", "java"]
# Policy-oriented thresholds min_observability_coverage: 0.8 max_pii_risk_tolerance: 0.0 require_eval_harness: true
# Weight boost for upstream callers of LLM files (0.0-1.0) upstream_llm_call_boost: 0.6
# AST confirmation controls ast_confirmation_enabled: true ast_confirmation_languages: ["python", "typescript"] ast_confirmation_severity_cap_on_fallback: "medium" ast_confirmation_confidence_on_fallback: 0.45
# GenAI OTel semantic convention behavior otel_semconv_mode: compat # compat | strict otel_semconv_maturity_mode: development # stable | development
# Optional overrides for overall health weighting health_weights: observability: 0.12 context_budget: 0.12 eval_harness: 0.12 pii: 0.16 cost_tracking: 0.10 prompt_hardcoding: 0.06 model_coupling: 0.04 fallback: 0.04 model_version: 0.04 tool_policy: 0.02 cache_idempotency: 0.02 streaming: 0.02 rate_limit: 0.02 security: 0.18 embedding_drift: 0.01 template_governance: 0.01 agent_loop: 0.01 instruction_boundary: 0.01 supply_chain: 0.04 data_model_poisoning: 0.03 system_prompt_leakage: 0.03 vector_embedding_weakness: 0.02 misinformation_overreliance: 0.03 mcp_authz: 0.02 mcp_tool_contract: 0.02 genai_otel_semconv: 0.03 structured_output_enforcement: 0.04 model_rollout_guardrail: 0.03Notes:
health_weightsare normalized at runtime so the effective total is always1.0.- Severity values for
ast_confirmation_severity_cap_on_fallback:good | low | medium | high | critical.
Balanced Profile Quick Start (Recommended)
Section titled “Balanced Profile Quick Start (Recommended)”metrics: - id: llm_integration
policy: invariants: - metric: llm.pii_leakage_risk op: "==" value: 0 message: "PII must not flow to LLM prompts without controls"
- metric: llm.observability_gap op: "<=" value: 0.2 message: "LLM call observability coverage is below baseline"
- metric: llm.fallback_absence op: "<=" value: 0.2 message: "Fallback and timeout strategy is required"
- metric: llm.overall_integration_health op: ">=" value: 0.7 message: "Overall LLM architecture health baseline not met"For full profile presets, see Policy and CI Gates:
Balanced(Recommended)Strict(Production-Sensitive)Exploratory(Early Adoption)
Known Limits and Diagnostics
Section titled “Known Limits and Diagnostics”Static analysis quality can degrade when context is missing. Check these diagnostics before tightening gates:
llm.blast_radius_available = 0: call-graph-dependent context is unavailable.llm.pii_taint_used = 0: PII analysis used fallback mode.llm.pii_fallback_reason:1: no call graph or no call sites2: taint propagation failed3: source detection failed
Runtime IDs
Section titled “Runtime IDs”- Metric ID:
llm_integration - CLI metric flag:
--metric llm_integration - MCP tool name:
check_llm_integration
Example Output (Trimmed)
Section titled “Example Output (Trimmed)”{ "id": "llm_integration", "version": "1.2.0", "data": [ { "key": "llm.observability_gap", "value": 0.22 }, { "key": "llm.pii_leakage_risk", "value": 0.0 }, { "key": "llm.eval_quality_score", "value": 0.67 }, { "key": "llm.mcp_oauth21_gap", "value": 0.0 }, { "key": "llm.overall_integration_health", "value": 0.74 }, { "key": "llm.pii_taint_used", "value": 1 }, { "key": "llm.call_sites_total", "value": 18 } ], "findings": [ { "rule_id": "arxo/llm-observability-gap", "severity": "medium", "title": "Improve LLM call observability", "evidence": [ { "kind": "code_span", "locator": { "path": "src/llm/service.py", "line_start": 42 } } ], "recommendation": "Wrap LLM calls with tracing spans and token/cost attributes." } ]}