ConstructionOS: Agentic AI for Construction

Command CenterSite Active Β· 3 Projects
Active Projects
12
$847M total contract value
On-Schedule Projects
8
67% vs industry avg 42%
Safety Incidents MTD
0
847 work hours since LTI
Cost Variance
+2.1%
Over budget β€” monitoring
πŸ€– AI Agent Status
13 construction AI agents across project, site, and procurement
Schedule Intelligence2 delays detected
Safety Monitor3 active alerts
Cost Control+2.1% variance
RFI Processor12 RFIs in queue
Quality Inspection4 sign-offs today
Procurement AI3 POs raised today
πŸ“‘ Live Site Intelligence Feed
Real-time AI monitoring across all projects and sites
Priority Project Status
PRJ-2024-0001
CRITICAL
Riverside Tower β€” 24-floor RC Frame
$142M Β· 18 months Β· Day 287 of 540
⚠ AI: Concrete pour delayed 8 days β€” crane breakdown Level 14
PRJ-2024-0007
AT RISK
M-40 Motorway Extension β€” 12km
$284M Β· 36 months Β· Day 421 of 1095
AI: Bitumen supply chain 3-week delay β€” re-sequence recommended
PRJ-2024-0011
ON TRACK
St. Andrews Hospital Wing β€” $84M
$84M Β· 24 months Β· Day 180 of 720
AI: All milestones met Β· next: structural steel week 27
Why ConstructionOS
πŸ“… Schedule Overruns
90% of construction projects finish late. The average delay is 20% beyond planned duration. ConstructionOS detects schedule risks 14 days before they become critical β€” while there is still time to act.
β›‘ Safety Incidents
Construction accounts for 20% of all workplace fatalities despite being 6% of the workforce. The Safety Monitor analyses site footage, toolbox talk records, and near-miss reports to prevent incidents before they occur.
πŸ’° Cost Overruns
Average construction project is 66% over budget (McKinsey). The Cost Control Agent tracks earned value, flags cost variances, and predicts final cost at completion before the overrun becomes irreversible.
Active
12
Critical
2
At Risk
4
On Track
6
TCV
$847M
All Active Projects
PRJ-2024-0001
CRITICAL
Riverside Tower Β· 24-floor RC
$142M Β· S. Ramirez Β· Day 287/540
PRJ-2024-0003
DELAYED
Greenfields Residential Β· 240 units
$96M Β· J. Okafor Β· Day 512/720
PRJ-2024-0007
AT RISK
M-40 Motorway Extension Β· 12km
$284M Β· L. Chen Β· Day 421/1095
PRJ-2024-0011
ON TRACK
St. Andrews Hospital Wing
$84M Β· S. Ramirez Β· Day 180/720
PRJ-2024-0014
ON TRACK
Central Business Park Β· Phase 2
$47M Β· T. Williams Β· Day 90/365
PRJ-2024-0018
AT RISK
Solar Farm Β· 150MW Β· Site prep
$124M Β· L. Chen Β· Day 45/180
Project Detail β€” PRJ-2024-0001
Riverside Tower β€” 24-floor Residential
RC Frame Β· Started Nov 2024 Β· Target complete: May 2026
CRITICAL
Contract Value
$142,000,000
Schedule Status
-8 days (crane)
Cost Variance (CPI)
0.94 (-6%)
Completion (SPI)
0.97 Β· 53%
⚠ AI Flags
1. Tower crane TC-2 breakdown at Level 14 β€” 8-day concrete pour delay
2. Delay cascades to structural steel β€” knock-on +12 working days
3. Final completion risk: +3 weeks if crane not repaired by May 22
4. EAC revised to $148.2M (+4.4%) β€” AI cost-to-complete forecast
Total Agents
13
Actions Today
847
Safety Flags
3
Schedule Alerts
2
Project Intelligence Agents
πŸ“…
Schedule Intelligence
Analyses programme using critical path method, resource-loaded schedules, and dependency chains. Detects delay risks 14 days early. Suggests mitigation sequences to recover float.
Running Β· 2 delays
ReAct + CPM
πŸ’°
Cost Control Agent
Earned value management: tracks CPI, SPI, EAC, TCPI in real time. Flags cost variances above threshold. Predicts final cost at completion using regression on similar past projects.
Running Β· +2.1% var
Reflection + EVM
πŸ“
RFI & Submittal Processor
Analyses RFIs against contract drawings and specs. Generates draft responses with clause citations. Tracks submittal register, review cycles, and approval status.
Running Β· 12 RFIs
Reflection + RAG
Safety & Quality Agents
β›‘
Safety Monitor
Analyses site induction records, toolbox talks, near-miss reports, and permit-to-work compliance. Flags non-compliant activities in real time. Integrates with IoT wearables and camera feeds.
Running Β· 3 alerts
ReAct + Vision
βœ…
Quality Inspection AI
Generates ITPs (Inspection and Test Plans), tracks NCRs (Non-Conformance Reports), and manages hold/witness point compliance. Photo evidence AI analysis.
Running Β· 4 sign-offs
Reflection + Vision
πŸ—
BIM Clash Detection
Integrates with Revit and Navisworks. Detects hard and soft clashes between structural, MEP, and architectural models. Generates clash reports with priority ranking and resolution suggestions.
Processing Β· 14 clashes
Multi-Agent + BIM
Site Operations Agents
πŸ“¦
Procurement AI
Matches material requirements to project programme. Raises POs ahead of delivery need. Tracks supply chain disruptions and recommends alternative sourcing when lead times are at risk.
Running Β· 3 POs raised
Planning + Supply Chain
πŸ‘·
Labour Intelligence
Tracks site headcount vs programme requirements. Flags trades shortfalls 2 weeks ahead. Manages subcontractor performance scores, payment milestones, and attendance records.
Running Β· 284 workers
ReAct + Forecasting
πŸ“
Document Control AI
Manages drawing revisions, transmittal logs, and contractual correspondence. Flags superseded documents, ensures latest-revision-only usage, and tracks contractual notice deadlines.
Idle Β· all current
Sequential + Registry
Active Alerts
3
Immediate action
Days Since LTI
847
Lost Time Injury free
Safety Walks (AI)
47
Daily AI site review
Near Misses (MTD)
2
Both investigated
Active Safety Alerts
β›‘ Working at Height β€” No Harness Detected Β· PRJ-0001 Level 12
CRITICAL
Camera feed AI detected worker at Level 12 parapet without fall arrest harness. Zone: grid C4-D5. Time: 19:41. Worker ID: unknown (no face recognition β€” GDPR compliant). Site supervisor alerted immediately. Work zone should be halted until compliance confirmed.
⚠ Permit to Work Expired β€” Hot Works Β· PRJ-0001 Level 8
HIGH
Hot works permit #PTW-0847 expired at 18:00. Welding activity detected via smoke sensor at Level 8 at 19:15 β€” 75 minutes after permit expiry. Subcontractor: Apex Steelworks. Continuation without valid permit is contractual breach and regulatory violation.
πŸ“‹ Induction Overdue β€” 4 Workers Β· PRJ-0007 Site
MEDIUM
4 new subcontractor workers on PRJ-0007 site have not completed site induction within required 24-hour window. Subcontractor: Delta Groundworks. Site access should be restricted until induction complete.
Critical Path Tasks
247
Delays Detected
2
14 days advance warning
Float Saved
34 days
AI recovery sequences
Forecast Accuracy
89%
πŸ“… Schedule Analysis β€” PRJ-2024-0001
Critical path impact of Level 14 crane breakdown
schedule-agent Β· PRJ-0001
ANALYSE β†’ Critical path recalculated post-crane event
IMPACT β†’ L14 concrete: -8 working days
CASCADE β†’ Structural steel: knocked on -12 days
OPTIONS β†’ 3 recovery sequences generated
OPT1: Weekend concrete pours (+$84K cost)
OPT2: Rented mobile crane (+$47K/wk)
OPT3: Re-sequence floors 15-17 parallel
RECMD β†’ OPT2 + OPT3 combined: recover 10 days
AI Recommendation: Mobilise rental crane within 3 days AND re-sequence Levels 15-17 to run concurrently. Net recovery: 10 of 12 days. Residual delay: 2 days β€” within contractual EOT allowance for unforeseen plant failure.
πŸ“Š Programme Health β€” All Projects
SPI (Schedule Performance Index) across 12 active projects
PRJ-0001 Β· Riverside Tower
0.97
PRJ-0003 Β· Greenfields
0.88
PRJ-0007 Β· M-40 Motorway
0.94
PRJ-0011 Β· St Andrews
1.02
PRJ-0014 Β· Business Park
1.00
Total Budgeted Value
$847M
Overall CPI
0.96
Below 1.0 = over budget
EAC (all projects)
$881M
Forecast at completion
Variance to Budget
+$34M
+4.0% over budget forecast
πŸ’° Earned Value β€” PRJ-2024-0001
Cost performance vs plan Β· AI forecast to completion
Planned Value (PV)$74.2M
Earned Value (EV)$71.9M
Actual Cost (AC)$76.5M
Cost Performance Index (CPI)0.94
Schedule Performance Index (SPI)0.97
EAC (AI forecast)$148.2M (+4.4%)
⚠️ Cost Variance Drivers
AI-identified root causes of cost overrun
Crane breakdown repair + rental: +$284K direct + $47K/week rental. Uninsured portion: $180K.
Concrete price escalation: Ready-mix +12% above bill of quantities rate since contract award. Change order required.
Rework β€” Level 9 formwork: Defective formwork resulted in re-pour. Subcontractor liable β€” NCR issued. Recovery: $180K via retention.
Value engineering saving: Alternative precast staircase saved $340K vs in-situ design. Net variance positive on this line item.
RFIs Open
12
AI Draft Response
94%
Accepted by engineer
Avg Response Time
4h
vs 3 days manual
Submittals Tracking
284
πŸ“ RFI Processing β€” How It Works
The RFI Agent reads each Request for Information against the contract drawings, specifications, and schedules. It extracts the technical question, retrieves relevant drawing revisions and specification clauses from the document control RAG corpus, and generates a draft response with clause citations. For complex RFIs requiring engineer's professional judgment, the draft is flagged for senior review. All RFI responses include contractual basis, relevant drawing/spec references, and cost/time implications if any. Average engineer review time: 15 minutes vs 3 days manual.
ITPs Completed
847
NCRs Open
4
Hold Points
3
Awaiting engineer
Photo Evidence
2,847
βœ… Quality Management System
Quality Inspection AI manages the full ITP lifecycle: generates inspection and test plans from specification requirements, tracks hold/witness/review points, and uses vision AI to analyse photo evidence of completed work. NCR detection: AI flags deviations from specification in inspection photos (e.g., rebar spacing, concrete surface finish, weld quality). All NCRs tracked to closure with root cause analysis. Documents fully ISO 9001 compliant quality management trail.
POs Raised Today
3
Supply Chain Risks
2
Cost Savings AI
$284K
Lead Times Monitored
847
πŸ“¦ AI Procurement Intelligence
The Procurement Agent analyses project programme to identify material requirements 6–8 weeks ahead of need date. Monitors supplier lead times and flags shortfalls before they become critical path events. Bitumen delay (PRJ-0007): detected 3 weeks before impact. Three alternative suppliers identified with comparable spec at +2% cost premium. Recommendation: split order across two suppliers to de-risk single-source dependency. Cost saving from AI-negotiated bulk buys: $284K YTD across all projects.
Workers On Site
284
Trades Shortfalls
2
Next 2 weeks
Productivity Index
0.92
Subcontractors
18
πŸ‘· Labour Intelligence
Labour Agent tracks site headcount vs programme resource requirement curve for every trade. Forecasts shortfalls 2 weeks ahead using programme look-ahead + subcontractor resource returns. Current flags: Electrical (PRJ-0001, Week 32) β€” 4 sparks required, subcontractor confirmed only 2. Recommended action: engage additional electrical sub. Structural steel (PRJ-0011, Week 28) β€” 6 ironworkers needed β€” ahead of schedule milestone. Productivity index tracks actual progress vs planned man-hours to identify low-output activities early.
Drawings Managed
4,821
Current Revision
100%
Superseded in Use
0
Transmittals
847
πŸ“ Document Control AI
Document Control Agent manages 4,821 project documents across drawings, specifications, submittals, RFIs, and correspondence. Key function: superseded drawing detection β€” alerts when site teams are referencing an outdated revision. Automated transmittal register ensures all design changes are formally issued and acknowledged. Contractual notice tracking: monitors notice deadlines under NEC/JCT/FIDIC contract forms to protect time and cost entitlements. AI parsing of architect's instructions, variation orders, and engineer's certificates for automated cost register updates.
Agents Active
13
Actions/Day
847
Safety Events
3
Schedule Alerts
2
πŸ“‘ Live Agent Trace
All AI decisions logged β€” ISO 19650 compliant
πŸ›‘ Construction AI Governance
Why every AI output is advisory β€” not autonomous
Safety decisions β€” always human: AI flags safety hazards but never issues stop-work orders autonomously. Site supervisor confirms and acts. Life-safety decisions require human judgment.
Contractual actions β€” engineer approved: RFI responses, variation orders, and NCRs require engineer sign-off. AI generates drafts, humans approve. All actions legally binding only with authorised signature.
ISO 19650 BIM compliance: All document management actions logged per ISO 19650 Common Data Environment requirements. Full audit trail for employer information requirements.
AgentOps β€” Live Agent Observability

πŸ“‘ Live Trace Feed

πŸ“Š Session Metrics (24h)

Total Sessions2,847
Avg Latency1.4s
P95 Latency3.1s
Error Rate0.3%
Tool Calls12,284
HITL Escalations47
RAGAS GatePASS βœ“

πŸ’° Cost & Tokens

Cost (24h)Β£847
Input Tokens48.2M
Output Tokens12.4M
Cache Hit Rate67%
Cost/SessionΒ£0.30

🎯 RAGAS Quality Scores

Faithfulness0.94 βœ“
Answer Relevance0.91 βœ“
Context Precision0.89 βœ“
Context Recall0.93 βœ“
Hallucination Rate0.8%

πŸ€– Agent Health

All agentsHealthy
OrchestratorActive
Tool registryOnline
MCP serversConnected
Memory storeHealthy
MLOps / LLMOps β€” Model Lifecycle

🧠 Model Registry

claude-sonnet-4-5 PRODUCTIONPrimary
claude-haiku-4-5 ROUTINGFast path
claude-opus-4-5 SHADOWComplex
text-embedding-3-large RAGVectors

Automatic fallback routing. Versioned in MLflow. Prompt changes require RAGAS eval gate pass.

πŸ“ˆ Drift Detection

Faithfulness drift (7d)+0.02 stable
Latency drift (7d)+120ms watch
Output length driftWithin Β±5%
Sentiment driftNo anomaly
Alert thresholdΞ”>0.05 β†’ PagerDuty

πŸ”€ A/B Experiment Controller

Prompt v2.3 vs v2.4Running
CoT vs DirectStaging

Statistical significance (p<0.05) required before promotion.

πŸͺ Feature Store

Vector IndexPinecone
Dimensions3,072
Indexed Docs284K
Retrieval P9542ms

πŸ“¦ Prompt Version Control

System promptsGit-tracked
Few-shot examplesVersioned
Eval datasetsDVC tracked
DevSecOps β€” Security-First CI/CD Pipeline

πŸš€ CI/CD Pipeline

πŸ”SAST β€” Semgrep + BanditPASS
πŸ“¦SCA β€” SBOM + TrivyPASS
πŸ§ͺUnit + Integration tests847/847
🎯RAGAS eval gate (β‰₯0.92)0.94 βœ“
πŸ”Secrets scan β€” GitleaksCLEAN
🐳Container scan β€” Grype0 CRITICAL
🚒Deploy β†’ KubernetesDEPLOYED

πŸ” Security Posture

RBAC β€” Role-based accessEnforced
API keys β€” HashiCorp VaultRotated 30d
mTLS β€” Istio service meshActive
PII scrubbing β€” NeMoActive
Audit log β€” ImmutableCloudWatch
Pen testQuarterly
SOC 2 Type IIIn progress
ISO 27001Compliant

πŸ— Infrastructure as Code

TerraformCloud infra
HelmK8s workloads
ArgoCD GitOpsSynced
Kustomize overlaysdev/stg/prd

♻️ Rollback & DR

RTO Target<15 min
RPO Target<5 min
Blue/Green DeployActive
Auto-rollbackError rate >1%

πŸ“‹ Regulatory Compliance

GDPR Art. 22 HITLEnforced
EU AI Act Art. 9Documented
NIST AI RMFMapped
ISO/IEC 42001Compliant
AI Observability β€” OpenTelemetry + Langfuse

πŸ”­ Observability Stack

L1TracesOpenTelemetry β†’ Jaeger
L2MetricsPrometheus β†’ Grafana
L3LLM TracesLangfuse (self-hosted)
L4LogsFluentd β†’ OpenSearch
L5AlertsAlertManager β†’ PagerDuty

πŸ“Š SLO Dashboard

Availability SLO99.9% target
Current (30d)99.96%
Error Budget73% remain
P50 Response0.8s
P95 Response3.1s
P99 Response7.4s

🚨 Active Alerts

Latency P95Normal
Error rate0.3% βœ“
Token budget84% remain
RAG recall0.93 βœ“
Latency drift+120ms watch

πŸ”¬ Langfuse Trace Explorer

πŸ“ˆ Avg Span Breakdown

API Gateway12ms
Auth + RBAC8ms
RAG retrieval42ms
Guardrail check18ms
LLM inference1,240ms
Tool execution84ms
Total E2E1,452ms
Guardrails β€” Responsible AI Framework

πŸ›‘ NeMo Guardrails β€” Active Rails

βœ… Human-in-the-Loop (HITL) Gate
All consequential actions require human approval before execution. Confidence <0.85 always escalates. GDPR Article 22 compliant β€” no fully automated consequential decisions.
πŸ” PII Detection & Scrubbing
Microsoft Presidio + custom patterns. Names, emails, NI/SSN, card numbers scrubbed from all LLM I/O before logging. 47 entity types across 12 jurisdictions.
🚫 Toxicity & Hallucination Filter
NeMo topic rails block off-topic responses. Factual grounding check cross-references every claim against retrieved context. Hallucination >5% triggers human review queue.
⏱ Rate Limiting & Abuse Prevention
Per-user token budgets at API gateway. 10Γ— anomalous usage triggers suspension + security alert. Cloudflare WAF DDoS protection.

πŸ“‹ Audit Trail & Explainability

πŸ“ Immutable Decision Log
Every AI recommendation logged: input context, retrieved docs, reasoning chain, confidence, model version, user ID, timestamp. 7-year retention for regulated decisions.
πŸ”Ž Explainability (XAI)
Every recommendation includes source citations, confidence intervals, alternatives considered, and limitation disclosures. SHAP attribution for structured ML models.
βš–οΈ Bias Monitoring
Fairness metrics tracked across protected characteristics. Disparate impact analysis monthly. EU AI Act Article 10 data governance requirements met.
πŸ› Regulatory Mapping
GDPR Art. 5/22 Β· EU AI Act Art. 9/10/13/14 Β· NIST AI RMF Β· ISO/IEC 42001 Β· IEEE 7001 Transparency. Compliance evidence pack generated quarterly.
0.3%
Hallucination Rate
Target <2%
100%
HITL Coverage
Consequential acts
0
PII Leaks (30d)
Target: 0
A+
Security Grade
Mozilla Observatory
Multi-Agent Architecture β€” Mesh & Orchestration

πŸ•Έ Agent Mesh Topology

Orchestrator
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6

Orchestrator decomposes tasks, routes to specialists, aggregates results, handles conflicts. All inter-agent communication via typed schemas. No agent takes external action without Orchestrator validation.

βš™οΈ Agent Patterns

ReAct β€” Reason + Act loopsAnalytical
Reflection β€” Self-critique cyclesHigh-stakes
Planning β€” Hierarchical decompositionMulti-step
RAG β€” Retrieval-augmented genKnowledge
HITL β€” Human-in-the-loopAll consequential
Tool Use β€” Function callingAll agents

πŸ”„ Temporal.io Orchestration

Active Workflows2,847
HITL Signals Pending47
Retry PolicyExp backoff Γ—3
Saga PatternCompensating txns
Durable ExecutionCrash-safe βœ“

πŸ“¨ Kafka Message Bus

Topics47 agent topics
Throughput12K msgs/s
Consumer Lag<100ms
Schema RegistryConfluent
Dead Letter QueueMonitored

πŸ”Œ MCP Integration Layer

MCP β€” Data sourcesActive
MCP β€” CRM/ERPActive
MCP β€” Document storeActive
OAuth 2.0 authAll connectors
JSON Schema validationAll tools
Evaluation Framework β€” Continuous Quality Gates
0.94
Faithfulness
Gate β‰₯0.92 βœ“
0.91
Answer Relevance
Gate β‰₯0.88 βœ“
0.89
Context Precision
Gate β‰₯0.85 βœ“
0.93
Context Recall
Gate β‰₯0.90 βœ“

πŸ§ͺ Eval Suite Composition

Golden dataset2,847 Q&A pairs
Unit evals (per agent)120–400 cases
Integration evals84 end-to-end flows
Adversarial probes47 jailbreak tests
LLM-as-judgeclaude-opus-4-5
Human eval cadenceWeekly 5% sample

πŸ” Eval-Driven Dev Flow

1
Change proposed β†’ PR opened
Automated eval suite runs against golden dataset in CI. Results posted to PR.
2
RAGAS gate enforced
All metrics must meet thresholds. Failure blocks merge.
3
Canary deploy (5%)
Langfuse online evals on live traffic. Drift alerts trigger auto-rollback.
4
Full rollout + monitor
Weekly human eval sample. Monthly RAGAS full re-run.
Infrastructure β€” Kubernetes Β· Scale Β· Resilience

☸️ Kubernetes Cluster

ClusterEKS / GKE / AKS
Node pools3 (system Β· app Β· GPU)
HPA targetCPU 70% β†’ scale
KEDA triggersKafka consumer lag
Spot instances80% non-critical
Multi-AZ3 zones

πŸ’Ύ Data Architecture

PostgreSQL (RDS)Operational
Redis (ElastiCache)Session + cache
Pinecone / pgvectorVector search
S3 Intelligent TierDocuments
Kafka (MSK)Event streaming
Snowflake / BigQueryAnalytics DWH

πŸ’° Cost Architecture

LLM API (Anthropic)~45% of AI cost
Vector DB~12% of AI cost
Compute (K8s)~28% of AI cost
Prompt cache savingsβˆ’67% input tokens
Haiku fast-path savingβˆ’40% LLM spend
Est. monthly totalΒ£8–28K

πŸ” Disaster Recovery

1
Primary failure detected (<2 min)
Route53 health check fails β†’ DNS failover. Temporal promotes standby. Kafka MirrorMaker live.
2
DR validates (<5 min)
Smoke tests auto-run. PagerDuty alert to on-call. RTO target: 15 minutes.
3
Data reconciled (<15 min)
PostgreSQL read replica promoted. S3 cross-region lag <5min. RPO: 5 minutes.

πŸ“Š Capacity Planning

  • Baseline: 3 app nodes Β· 2 vCPU Β· 8GB RAM each
  • Scale trigger: Kafka consumer lag >10K msgs
  • Max scale: 20 nodes via KEDA + HPA
  • LLM concurrency: 50 parallel sessions managed
  • Vector search: Pinecone p1 β†’ p2 at 500K docs
  • DB connections: PgBouncer pool (max 500)
Documentation β€” Deployment Guide & Runbook

πŸš€ 10-Week Deployment Guide

1
Week 1–2: Data Foundation & Infrastructure
Deploy K8s cluster. Provision Temporal.io, Kafka, PostgreSQL, Pinecone. Connect source systems via MCP. Establish data governance and RBAC. Run baseline eval on golden dataset.
2
Week 3–4: Core Agents Live
Deploy first 3 highest-value agents. Wire HITL approval workflows in Temporal. Configure NeMo guardrails and PII scrubbing. Set up Langfuse tracing and RAGAS eval gate.
3
Week 5–7: Full Agent Mesh
Deploy all agents. Configure Orchestrator routing. A/B test prompt variants. Enable drift detection. Train end-users on HITL workflow.
4
Week 8–10: Production Hardening
Pen test + SAST/DAST scan. Load test 10Γ— baseline. Configure PagerDuty. Compliance review (GDPR, EU AI Act). Produce runbook. Go-live.

πŸ— 7-Layer Platform Stack

L7PresentationReact Β· Next.js Β· SSO
L6API GatewayFastAPI Β· OAuth2 Β· WAF
L5OrchestrationTemporal.io Β· LangGraph
L4Agent RuntimeNeMo Β· RAGAS Β· Tools
L3Model + ToolsClaude API Β· MCP servers
L2Data + IntegrationKafka Β· PostgreSQL Β· Redis
L1ObservabilityOTel Β· Langfuse Β· Grafana

πŸ”Œ Integration How-To

  • MCP server per data source (REST/GraphQL/gRPC)
  • OAuth 2.0 service account per enterprise system
  • Kafka topics per agent capability namespace
  • Schema registry for typed message contracts
  • Data lineage via OpenLineage β†’ Marquez
  • Webhooks for real-time event ingestion
  • dbt + Airflow for batch data refresh

πŸ‘€ RBAC User Roles

ViewerRead dashboards
AnalystRun queries + export
ApproverHITL decisions
ManagerConfig + agents
AdminFull platform
AI EngineerModels + prompts

IdP via Okta/Azure AD. MFA enforced for Approver+.

πŸ“ž Incident Runbook

  • High latency (>5s): Check Langfuse trace β†’ vector store β†’ LLM API status
  • RAGAS gate fail: Roll back last prompt change β†’ notify AI engineer
  • Error spike: Circuit breaker β†’ fallback to previous version
  • PII leak: Suspend session β†’ DPO notification within 24h
  • HITL queue backup: Escalate to senior approver
  • Cost overrun: Auto-throttle β†’ route to Haiku