ARTlligence Platform: From Prototype to Production

The Prototype Gap Platform Architecture
6
Production Dimensions
100%
Of demos fail in prod
12 wk
From prototype to MVP
ยฃ400K
Typical first engagement
ARTlligence
Closes the gap
๐ŸŽฏ
The real objection is not "this isn't impressive"
When a CTO says "cool prototype" they mean: I don't know how to go from this to something I can run my business on. They're not dismissing the value โ€” they're identifying the gap between a compelling demo and a system they'd stake their operations on. Your job is to close that gap, credibly.
Why every AI prototype looks the same to a CTO
What they see
Impressive demo with hardcoded scenarios, simulated agent responses, mock data, and a beautiful UI that works perfectly for the 3 use cases you prepared.
โ†’
What they need
A system connected to their actual data, handling their actual edge cases, failing gracefully, auditable, secure, and operated by their team.
What they see
Agents that respond in 2 seconds in a demo. Clean outputs. Every decision looks right.
โ†’
What they need
Agents that handle 10,000 requests/day, degrade gracefully under load, retry on failure, and tell you when they're uncertain.
What they see
A live agent log that looks like decisions are being made. Hard to tell what's real vs simulated.
โ†’
What they need
Complete observability: every LLM call traced, every decision logged with cost and latency, evaluation metrics tracked, and alerts when quality degrades.
๐Ÿ“‹ The 6 Production Dimensions
Every dimension must be solved. Missing one breaks the system.
๐Ÿ—
Agent Infrastructure โ€” Durable orchestration, fault tolerance, stateful workflows that survive restarts
๐Ÿ”Œ
Real Integrations โ€” Live connections to enterprise systems: SAP, Salesforce, SharePoint, ServiceNow
๐Ÿ“Š
Evaluation Framework โ€” Continuous quality measurement โ€” you need to know if it's right
๐Ÿ“ก
LLMOps & Observability โ€” Full trace capture, cost attribution, drift detection, alerting
๐Ÿ”
Security & Governance โ€” RBAC, audit trails, guardrails, human-in-the-loop approval gates
โš™๏ธ
Production Deployment โ€” Containerised, versioned, scalable, monitored, with rollback
๐Ÿ’ก
The ARTlligence position
The 20 OS products prove we understand the domain and the value. The Platform Architecture is what closes the sale. We're not selling demos โ€” we're selling a defined, deliverable path from prototype to production enterprise system.
What the prototypes have vs what production needs
DimensionPrototype (Current)Production (Required)Technology
Agent OrchestrationHardcoded if/else, direct LLM callsTemporal/LangGraph โ€” durable, stateful, retryable workflowsTemporal LangGraph
Data & IntegrationsMock data, simulated responsesLive MCP connectors to SAP, Salesforce, SharePoint, ServiceNowMCP REST APIs
State ManagementIn-memory, lost on restartPersistent state in Redis/Postgres โ€” workflows survive crashesRedis PostgreSQL
Message QueueDirect agent-to-agent callsKafka/Redis queues โ€” decoupled, buffered, replayableKafka Redis Streams
EvaluationNo quality measurementRAGAS continuous evaluation โ€” faithfulness, relevancy, accuracy trackedRAGAS Custom evals
ObservabilitySimulated log streamLangfuse/LangSmith โ€” every LLM call traced, cost attributed, latency trackedLangfuse LangSmith
Authentication & RBACNone โ€” open accessSSO + RBAC โ€” who can trigger which agents and see which outputsAuth0 Entra ID
Audit TrailNothing logged permanentlyImmutable audit log โ€” every decision with full context, actor, timestampOpenTelemetry SIEM
GuardrailsNo input/output safetyNeMo Guardrails โ€” PII detection, topic restriction, hallucination mitigationNeMo Lakera Guard
Human-in-the-LoopMentioned in UI onlyTemporal signals โ€” workflow pauses, sends approval request, waits for humanTemporal Webhooks
ScalabilitySingle thread, single userKubernetes horizontal scaling โ€” 10,000+ concurrent agent tasksK8s Docker
Error HandlingSilent failures, no retriesDead letter queues, exponential backoff, circuit breakers, fallback modelsResilience4j Custom
Cost ControlUnknown cost per operationToken budget per agent, cost attribution per workflow, budget alertsLangfuse Custom
Multi-tenancySingle tenant onlyTenant isolation โ€” separate data, agent config, and cost attribution per clientCustom Row-level security
DeploymentNetlify static HTMLCI/CD pipeline, versioned agents, blue-green deployment, rollbackGitHub Actions ArgoCD
๐Ÿ›
The ARTlligence Platform Stack โ€” 7 Layers
Every enterprise AI system is built on these 7 layers. The same stack powers all 20 OS products. Build the platform once โ€” deploy any OS on top of it.
Layer 7 โ€” Enterprise Presentation
The dashboards and UIs (what we've built). React / HTML. Calls the Agent API. Real-time updates via WebSocket. SSO-authenticated.
React / Next.jsTailwind CSSWebSocketAuth0 / Entra SSOshadcn/ui
โ†•
Layer 6 โ€” Agent API Gateway
FastAPI service. Authenticates requests. Validates RBAC. Routes to Temporal workflows. Returns structured results. Rate-limits per tenant.
FastAPIJWT / OAuth2RBAC middlewareRate limitingRequest validation
โ†•
Layer 5 โ€” Workflow Orchestration
Temporal for durable, stateful, retryable workflows. LangGraph for agentic reasoning loops. Agents are Activities inside Temporal Workflows โ€” they can fail, retry, wait for signals (human approval), and recover from crashes.
Temporal.ioLangGraphGoogle ADKSignals (HITL)Saga patternsCompensation logic
โ†•
Layer 4 โ€” Agent Runtime
Individual agents as stateless Python/TypeScript functions. Input schema โ†’ reasoning โ†’ tool calls โ†’ output schema. NeMo Guardrails wrap every agent. RAGAS eval on every output. Cost budget enforced per agent call.
Python agentsNeMo GuardrailsRAGAS evaluationToken budgetsFallback modelsOutput validation
โ†•
Layer 3 โ€” Model & Tool Layer
LLM routing: Claude 3.5 Sonnet for reasoning, Haiku for classification, GPT-4o for multimodal. MCP for standardised tool connectivity. RAG retrieval from vector store. Structured tool calls with JSON schema validation.
Claude Sonnet 4GPT-4oGemini 1.5 ProMCP connectorsPinecone / QdrantModel routing
โ†•
Layer 2 โ€” Data & Integration Layer
Enterprise connectors: SAP, Salesforce, SharePoint, ServiceNow, Jira, Confluence, custom ERPs. Kafka for real-time event streaming. PostgreSQL for structured state. Redis for caching and queues. S3/Blob for document storage.
KafkaPostgreSQLRedisSAP connectorSalesforce APISharePoint MCPServiceNow
โ†•
Layer 1 โ€” Observability & Security Foundation
OpenTelemetry for all traces. Langfuse for LLM observability. Every agent call: input, output, model, tokens, cost, latency. Immutable audit log. PII scrubbing before any LLM call. Anomaly alerts on cost/quality drift.
LangfuseOpenTelemetryPrometheusGrafanaPII scrubbingAudit log (immutable)Alerting
Production agent design principles
๐Ÿ“ Agent Contract Pattern
Every production agent has a defined contract โ€” not just a prompt
# Production agent contract class FraudDetectionAgent(BaseAgent): # Defined inputs โ€” validated at runtime input_schema = ClaimInput( claim_id=str, claimant_id=str, amount=float, claim_type=ClaimType ) # Defined outputs โ€” validated before return output_schema = FraudScore( score=float, # 0.0โ€“1.0 signals=list[Signal], recommendation=str, confidence=float, requires_human=bool ) # Hard constraints max_tokens = 1500 max_latency_ms = 3000 fallback_model = "claude-haiku-4" requires_guardrail = True
๐Ÿ”„ The 5 Agentic Patterns
Choose the right pattern for each use case
Sequential: Agent A โ†’ Agent B โ†’ Agent C. Use for: compliance checks, document processing, report generation. Predictable, auditable, easy to test.
ReAct (Reasoning + Acting): Think โ†’ Act โ†’ Observe โ†’ Think. Use for: fraud detection, anomaly investigation, research. Handles uncertainty well.
Planning + Execution: Plan all steps first, then execute. Use for: production scheduling, logistics routing, complex multi-step tasks.
Multi-Agent Collaboration: Specialist agents with a coordinator. Use for: 100+ agent systems, parallel research, complex workflows requiring different expertise.
Reflection: Agent reviews its own output before returning. Use for: drafting, legal/medical content, any output where quality matters more than speed.
โšก Human-in-the-Loop โ€” The Right Way
Not a UI checkbox โ€” a durable workflow pause that waits for a real human decision
# Temporal HITL pattern class ClaimsWorkflow(Workflow): async def run(self, claim): # Step 1: AI triage fraud_score = await activity.fraud_check(claim) # Step 2: Human gate โ€” workflow PAUSES if fraud_score > 0.7: await notify_investigator(claim, fraud_score) # Workflow sleeps until signal arrives decision = await workflow.wait_for_signal( signal_name="investigator_decision", timeout=timedelta(hours=48) ) # Resumes when human sends signal return await activity.settle(claim, decision)
Why Temporal: The workflow survives server restarts. If the investigator takes 3 days to respond, the workflow is still waiting โ€” exactly where it left off. No polling loops, no lost state.
Audit trail: Every pause, every signal received, every decision is recorded in Temporal's event history. Fully auditable for regulatory purposes.
Escalation: Timeout signals trigger escalation workflows automatically โ€” if no response in 48h, the workflow escalates to a senior investigator.
Not just approvals: Same pattern for any blocking action โ€” data corrections, ambiguity resolution, policy exceptions. The workflow pauses and waits for the human input it needs.
๐Ÿ”„ Temporal โ€” Why It's the Right Choice
Durable execution: your workflows survive any failure
Durability: If a server crashes mid-workflow, Temporal replays from the last committed event. The workflow continues exactly where it stopped. No data loss, no zombie tasks.
Visibility: Every workflow execution is visible in the Temporal UI โ€” what state it's in, what activities have run, what's waiting. Production debugging becomes possible.
Retry logic: Activities retry automatically with exponential backoff. Transient failures (rate limits, network blips) are handled without any code. Permanent failures trigger compensating actions.
Timers: Schedule an action 6 weeks from now. Trigger SLAs. Escalate after timeout. All natively, without cron jobs or polling.
Versioning: Update workflow logic without breaking in-flight workflows. Deploy new agent versions alongside old ones during migration.
๐Ÿ”€ LangGraph โ€” Agentic Reasoning Loops
For agents that need to reason, not just execute
# LangGraph inside a Temporal Activity from langgraph.graph import StateGraph def build_fraud_agent(): graph = StateGraph(FraudState) # Nodes: each is a reasoning step graph.add_node("analyse_claim", analyse_claim) graph.add_node("check_history", check_history) graph.add_node("network_analysis", network_analysis) graph.add_node("score_and_report", score_report) # Conditional routing based on state graph.add_conditional_edges( "analyse_claim", lambda s: "network_analysis" if s.suspicious else "score_and_report" ) return graph.compile(checkpointer=checkpointer)
LangGraph is used inside Temporal Activities for reasoning loops. Temporal handles durability and HITL. LangGraph handles the agent's internal reasoning. Separation of concerns.
๐Ÿ“จ Message Queue Architecture โ€” Why Agents Need Queues
Direct agent-to-agent calls create brittle systems. Queues create resilient ones.
โŒ Direct calls (prototype pattern)
# What the demos do result = agent_a(input) result2 = agent_b(result) result3 = agent_c(result2) # If agent_b crashes โ†’ everything lost # If agent_b is slow โ†’ caller blocks # Can't retry, can't scale, can't replay
โœ… Queue-based (production pattern)
# What production systems use kafka.publish("claims.new", claim_event) # Agent B consumes from queue # Independently scaled (10 instances) # Crash โ†’ message stays in queue # Slow โ†’ backpressure handled # Replay โ†’ re-process any message
๐Ÿ”Œ
MCP is the integration game-changer
Model Context Protocol (MCP) is Anthropic's open standard for connecting AI to data sources and tools. Instead of building custom integrations for every enterprise system, MCP provides a standardised connector interface. One MCP server for SharePoint works with any agent. The ecosystem is growing fast โ€” 200+ community MCP servers already built.
๐Ÿ— MCP Connector Architecture
How MCP replaces custom integration code
# MCP server for Salesforce (example) from mcp import Server, Tool salesforce_mcp = Server("salesforce") @salesforce_mcp.tool() async def get_account(account_id: str): """Retrieve account details from Salesforce""" return await sf_client.query( f"SELECT * FROM Account WHERE Id='{account_id}'" ) # Agent uses the tool โ€” no custom code needed agent = Agent( tools=[salesforce_mcp, sharepoint_mcp, sap_mcp], # Agent automatically discovers available tools # and knows how to call them from schema )
๐Ÿ“‹ Enterprise Integration Catalogue
What's available via MCP today
SystemMCP AvailableAuth Method
SharePoint Onlineโœ“ ReadyOAuth2 + Entra
Salesforceโœ“ ReadyOAuth2
ServiceNowโœ“ ReadyAPI Key + OAuth
Jira / Confluenceโœ“ ReadyAPI Token
SAP S/4HANABuildSAP BTP OAuth
SAP ERP (older)BuildRFC / BAPI
Oracle ERPBuildREST API
MS Dynamics 365โœ“ ReadyEntra OAuth
PostgreSQL / SQLโœ“ ReadyConnection string
Custom REST APIsGenericAny
๐Ÿ” Integration Security Principles
Every connector must follow these rules
Least privilege: MCP connectors are scoped to the minimum data required. A claims fraud agent can read claims โ€” it cannot modify or delete. Permissions defined at connector level, not agent level.
No credentials in agent prompts: All credentials stored in secrets manager (Vault/AWS Secrets Manager). Injected at runtime. Never in code, never in prompts, never in logs.
Audit every tool call: Every MCP tool call logged: which agent, which tool, what parameters, what was returned, latency, cost. Data access auditable for compliance.
โš ๏ธ
The most neglected production requirement
LLM outputs degrade silently. A model update, a prompt change, or a subtle shift in input distribution can reduce quality by 30% without any error being thrown. Without evaluation infrastructure, you won't know until a client calls. Evaluation is not testing โ€” it's continuous quality monitoring.
๐Ÿ“Š RAGAS โ€” RAG Evaluation Framework
4 metrics that catch the ways RAG fails in production
Faithfulness (target: >0.92): Is every claim in the output supported by the retrieved context? Catches hallucinations โ€” the model inventing facts not in the source. Most critical metric.
Answer Relevancy (target: >0.85): Does the answer actually address the question? Catches cases where the model retrieves relevant context but answers a different question.
Context Precision (target: >0.80): Is the retrieved context actually useful? Low precision means you're retrieving noise โ€” the model is working around irrelevant documents.
Context Recall (target: >0.80): Did retrieval find all the relevant information? Low recall means the model is answering from incomplete context โ€” confident but wrong.
๐Ÿงช Golden Dataset โ€” The Evaluation Foundation
Build this before you deploy anything to production
# Golden dataset structure golden_dataset = [ { "input": "Analyse claim CLM-0847 for fraud", "expected_output": { "fraud_score": 0.91, "signals": ["3rd fire claim 24mo", "assessor link"], "recommendation": "decline_and_refer_SIU" }, "grading_criteria": { "score_range": (0.85, 1.0), "required_signals": ["fire_claim_frequency"], "acceptable_recommendations": ["decline_and_refer_SIU"] } } # 50โ€“200 cases per agent ] # Run on every deployment assert eval_score >= 0.92, "Deploy blocked"
๐Ÿ“ˆ Continuous Evaluation Pipeline
Quality monitoring doesn't stop at deployment
Pre-deployment gate: Golden dataset evaluation runs automatically on every PR. Evaluation score below threshold โ†’ deployment blocked. No exceptions. This is your minimum quality bar.
Production sampling: 5% of live agent outputs evaluated continuously using RAGAS + LLM-as-judge. Score logged in Langfuse. Alert fires if 7-day average drops below threshold.
Human feedback loop: When humans override an AI recommendation, that's a training signal. Overrides collected, reviewed, and periodically used to update the golden dataset and retune prompts.
๐Ÿ“ก What Langfuse Captures
The complete picture of every agent interaction
LLM call input/outputEvery call
Model used ยท prompt version ยท temperatureConfig
Input tokens ยท output tokens ยท costCost
End-to-end latency ยท per-step latencyPerf
Tool calls made ยท tool call resultsTools
RAGAS scores ยท faithfulness ยท relevancyQuality
User feedback ยท human override eventsFeedback
Guardrail trigger events ยท blocked inputsSafety
Session ID ยท user ID ยท tenant IDIdentity
๐Ÿšจ Alert Architecture
What triggers an alert in production
P0 โ€” Page immediately: Error rate >5% on any agent ยท Cost spike >3ร— daily baseline ยท RAGAS faithfulness below 0.85 ยท Any PII leak detected
P1 โ€” Notify in 30 min: P95 latency >10s ยท Human override rate >20% (signals AI recommendations degraded) ยท Evaluation score trending down 3 days
P2 โ€” Daily digest: Token cost increase >15% week-over-week ยท New tool call patterns ยท Unusual input distribution shifts
Weekly review: Model performance report ยท Cost attribution by agent ยท Human feedback summary ยท Evaluation trend analysis
๐Ÿ’ฐ Cost Architecture โ€” The Hidden Production Challenge
LLM costs are the #1 reason enterprise AI projects fail at scale
Model tiering: Route to cheapest model that meets quality bar. Classification โ†’ Haiku (cheap). Reasoning โ†’ Sonnet (mid). Complex synthesis โ†’ Opus (expensive). 60โ€“80% cost reduction on mixed workloads.
Semantic caching: Cache LLM responses for semantically similar inputs. Langfuse + Redis. Cache hit rate 30โ€“50% on typical enterprise workloads. Cost reduction proportional to cache hit rate.
Token budgets: Hard token limits per agent per call. Context window management โ€” summarise old context rather than passing full history. Prompt compression for repetitive patterns.
๐Ÿ›ก NeMo Guardrails โ€” Production Safety Layer
Every agent input and output passes through guardrails
# guardrails config (colang) define flow check input # Detect PII before LLM sees it $pii = execute pii_detection(input) if $pii.detected: $input = execute redact_pii(input) log_event("pii_redacted", $pii.types) # Block prompt injection attempts $injection = execute injection_check(input) if $injection.detected: log_event("injection_blocked") return "Request blocked by safety policy" define flow check output # Verify no hallucinated citations $faith = execute faithfulness_check(output) if $faith.score < 0.85: log_event("low_faithfulness", $faith.score) return rerun_with_stricter_prompt()
๐Ÿ”‘ RBAC Architecture
Who can do what โ€” enforced at the API layer
RoleCan ViewCan TriggerCan Approve
AI ConsumerAgent outputsNoneNone
OperatorAll outputs + tracesStandard workflowsNone
ApproverFlagged itemsStandard workflowsAssigned flags
ManagerTeam scopeAll workflowsAll approvals
AdminEverything + costsEverythingEverything
AI EngineerTraces + evalsDev/test onlyEval gates
RBAC is enforced at the API Gateway layer โ€” not in the agent or UI. Every request carries a JWT. Permissions checked before any agent workflow is started. Tenant isolation: users can only see and act on their tenant's data.
๐Ÿ“‹ Audit Trail Architecture
Every AI decision is permanently logged โ€” because it has to be
What's logged: Agent triggered (who, when, why), inputs provided, outputs generated, tool calls made, human approval decisions, overrides, final action taken. Immutable โ€” cannot be edited or deleted.
Why it matters: FCA Consumer Duty requires explainability for every financial decision. GDPR Article 22 requires human review rights for automated decisions. ICO requires data access logs. Audit trail is not optional.
Storage: OpenTelemetry traces โ†’ Grafana Tempo. Agent decisions โ†’ append-only PostgreSQL table with cryptographic hash chain. SIEM integration for security events. Retention: 7 years for regulated industries.
๐Ÿณ Container Architecture
Each layer deployed independently โ€” scale what needs scaling
# docker-compose.prod.yml (simplified) services: api-gateway: image: artlligence/api-gateway:latest replicas: 3 # horizontal scale fraud-detection-worker: image: artlligence/fraud-agent:v2.1.4 replicas: 10 # scale with queue depth env: MAX_TOKENS_PER_CALL: 1500 FALLBACK_MODEL: claude-haiku-4 temporal-worker: image: artlligence/temporal-worker:latest replicas: 5 langfuse: # observability nemo-guardrails:# safety layer redis: # caching + queues postgres: # state + audit log
๐Ÿš€ CI/CD Pipeline โ€” Agent Deployment
Never deploy a degraded agent to production
# .github/workflows/deploy-agent.yml on: [push to main] jobs: evaluate: run: pytest tests/eval/ # golden dataset run: ragas evaluate # RAG quality assert: faithfulness >= 0.92 assert: answer_relevancy >= 0.85 # Block deploy if evals fail deploy-canary: needs: evaluate # only if eval passes run: deploy to 5% of traffic wait: 30 min assert: error_rate < 0.01 assert: p95_latency < 5000ms deploy-full: needs: deploy-canary run: roll out to 100% traffic # Old version kept for 1h rollback
โ˜ Cloud Deployment Options
Where to run โ€” depends on client requirements
Cloud-native (default): GCP Cloud Run + GKE, or AWS ECS + EKS, or Azure AKS. Managed Temporal Cloud. Langfuse Cloud. Fastest deployment, lowest ops burden. Most clients start here.
Private cloud / VPC: Deploy entire stack in client's VPC. LLM calls stay within their network using Vertex AI or Azure OpenAI. Required for financial services, government, healthcare with strict data residency.
On-premise / air-gapped: Local LLM models (Ollama / vLLM) for clients where data cannot leave premises. Performance trade-off vs cloud models. Defence, intelligence, critical national infrastructure.
๐Ÿ“…
12 weeks from signed contract to production MVP
This is the standard ARTlligence delivery timeline for any single OS product. After Week 12, the client has a production-ready system connected to their data, with full observability, governance, and ongoing improvement cycle. Subsequent OS products on the same platform are faster โ€” platform is already built.
1
Weeks 1โ€“2: Discovery & Data Architecture
Team: Solution Architect + 1 AI Engineer
Stakeholder interviews to define exact use cases and success metrics. Data source mapping โ€” what exists, what's accessible, what quality it's in. Integration feasibility: SAP/Salesforce/SharePoint connectivity tested. RBAC and data governance requirements documented. Security and compliance baseline. Output: Technical Architecture Document + Integration Plan + Evaluation Criteria.
2
Weeks 3โ€“4: Platform Foundation
Team: 2 AI Engineers + DevOps
Temporal cluster deployed. Langfuse observability live. NeMo Guardrails configured for sector. MCP connectors built and tested against live data. RBAC and SSO integrated with client identity provider. PostgreSQL audit schema defined. Kafka topics configured. First golden dataset built (50 cases) with client domain experts. CI/CD pipeline established.
3
Weeks 5โ€“8: Agent Development
Team: 3 AI Engineers + 1 Domain Expert
Agents built one at a time, highest value first. Each agent: schema defined โ†’ prompt engineered โ†’ evaluation score >0.92 โ†’ HITL workflow โ†’ integration tested โ†’ code reviewed โ†’ deployed to staging. Target: 5โ€“8 agents per week with quality gates. Weekly demo to stakeholders with live data. Evaluation score tracked โ€” no agent ships below threshold. Human-in-the-loop flows tested with actual approvers.
4
Weeks 9โ€“10: Integration & Load Testing
Team: 2 AI Engineers + QA
End-to-end workflow testing with production data volumes. Load testing: target throughput ร— 3. Failure injection: kill random services, verify recovery. Cost profiling: token cost per workflow documented. Security penetration test on agent API. Canary deployment to 5% of real traffic. Monitor evaluation scores on live data โ€” confirm alignment with golden dataset.
5
Weeks 11โ€“12: Production Launch & Handover
Team: Full team + Client operations team
Full production rollout. Runbook documentation. On-call escalation procedures. Client operations team trained on Temporal UI, Langfuse, and alert handling. Feedback collection mechanism live. 30-day post-launch hypercare: ARTlligence on-call for P0/P1 incidents. Ongoing improvement cycle agreed: monthly evaluation reviews, quarterly model updates, continuous golden dataset expansion.
Cost architecture for enterprise AI systems
๐Ÿ’ท Build Cost โ€” Indicative Ranges
Typical ARTlligence engagement structure
Engagement TypeScopeInvestment
Proof of Value2 agents ยท 4 weeks ยท single integrationยฃ40Kโ€“ยฃ80K
Single OS MVP8โ€“12 agents ยท 12 weeks ยท 3โ€“5 integrationsยฃ200Kโ€“ยฃ400K
Enterprise PlatformFull OS ยท 12+ agents ยท all integrationsยฃ500Kโ€“ยฃ1.2M
Multi-OS Programme3+ OS products on shared platformยฃ1Mโ€“ยฃ3M+
๐Ÿ’ฐ LLM Running Cost โ€” At Scale
Monthly operational cost examples (Claude Sonnet 4)
WorkloadVolume/dayMonthly LLM Cost
Insurance claims triage500 claimsยฃ800โ€“ยฃ2,000
Document intelligence2,000 docsยฃ2,000โ€“ยฃ5,000
Customer service AI5,000 queriesยฃ3,000โ€“ยฃ8,000
Full enterprise OS10,000+ opsยฃ8,000โ€“ยฃ25,000
Model tiering (routing simple tasks to Haiku) and semantic caching typically reduce LLM costs by 50โ€“70% vs worst-case estimates. Cost per output tracked to the penny in Langfuse.
๐Ÿ“ˆ ROI Framing for Clients
How to present the investment case โ€” make it undeniable
Payback period: For InsuranceOS, fraud prevention of ยฃ2.1M/month against a ยฃ300K build cost = payback in 6 weeks. Frame the investment as a fraction of Year 1 value โ€” not as a technology cost.
Cost avoidance vs revenue: Both matter. ManufacturingOS downtime reduction (ยฃ12M/year avoided) is harder to see than revenue but easier to quantify precisely. Use real production data from the discovery phase.
Ongoing value: Unlike a software licence, AI systems improve over time. The golden dataset grows. Models improve. Human feedback refines recommendations. Year 3 value > Year 1 value from the same investment.
Team structure for production AI delivery
๐Ÿ— Solution Architect
1 per engagement ยท ARTlligence lead
Owns the technical architecture. Runs discovery. Defines integration patterns. Ensures platform decisions are sound at scale. Presents to client CTOs. Reviews all agent contracts before build. Accountable for delivery.
System designLLM architectureEnterprise integrationClient communication
๐Ÿค– AI Engineer
2โ€“4 per engagement ยท core build team
Builds agents. Engineers prompts. Writes evaluation tests. Builds MCP connectors. Implements Temporal workflows. Owns RAGAS scores for their agents. Fixes quality issues found in production sampling.
Python / TypeScriptLangGraphTemporalPrompt engineeringEvaluation
๐Ÿ”ง MLOps / DevOps
1 per programme ยท shared across engagements
Owns platform infrastructure: Kubernetes, CI/CD, Langfuse, Temporal cluster, monitoring. Manages deployments, rollbacks, scaling. On-call for infrastructure incidents. Builds and maintains the deployment pipeline.
KubernetesTerraformGitHub ActionsGrafanaIncident response
๐ŸŽ“ Skills to Build or Hire
The specific capability gaps that separate prototype shops from production delivery firms
SkillWhy CriticalHow to BuildPriority
Temporal.ioDurable workflow orchestration is non-negotiable at enterprise scaleTemporal.io docs + build a HITL workflow from scratchP0
RAGAS EvaluationWithout eval, you can't prove quality or detect degradationRAGAS docs + build golden dataset for one agentP0
Langfuse LLMOpsEvery enterprise client will ask "how do you monitor this in production"Langfuse self-hosted + instrument one real agentP0
MCP Server BuildingSAP and Oracle integrations require custom MCP serversBuild a custom MCP server for one internal toolP1
NeMo GuardrailsEvery financial/healthcare/government client requires itNeMo docs + configure for one sector use caseP1
Enterprise KubernetesProduction systems require proper container orchestrationGKE or EKS with autoscaling + deploy Temporal on itP1
๐ŸŽฏ
When a CTO says "cool prototype" โ€” here's exactly what to say
Don't defend the demo. Agree with them and then go further. The demo is meant to show you what's possible โ€” not what we'd ship. Here's the platform we use to turn that into something you'd bet your operations on.
The 4 objections and the exact responses
Objection 1: "This is just a demo โ€” it's not connected to our real data"
Response: "Correct โ€” and that's deliberate. The demo shows you the intelligence layer. The integration layer is our Week 3โ€“4 work. We connect to your SAP / Salesforce / SharePoint via MCP โ€” a standardised protocol that means we're not writing custom integration code for every system. We've done this connection before for [comparable client]. The first integration usually takes 2โ€“3 weeks. After that, your agents are working on your actual data, in your actual environment."
Back it up with: Show the MCP connector architecture. Name the specific systems they use and confirm you have connectors for them. Offer a paid technical feasibility sprint (Weeks 1โ€“2) before full commitment.
Objection 2: "How do I know it's giving correct answers?"
Response: "This is the right question โ€” and it's why we build an evaluation framework before we deploy a single agent. We work with your domain experts to build a golden dataset: 50โ€“200 cases with known-correct answers. Every agent must score above a threshold on this dataset before it ships to production. And we run RAGAS evaluation on 5% of live outputs continuously โ€” so if quality degrades after a model update or data shift, we know within 24 hours before your users do."
Back it up with: Show the RAGAS metrics screen. Explain the 0.92 faithfulness threshold. Describe what happens when the alert fires. This is the answer that separates you from every other AI vendor.
Objection 3: "What happens when it makes a mistake?"
Response: "Three things. First โ€” our human-in-the-loop architecture means no consequential action is taken without a human approving it. The AI recommends; your authorised people decide. Second โ€” every AI decision is logged permanently with full context. If something goes wrong, we can trace exactly what the agent saw, what it reasoned, and what it recommended. Third โ€” overrides are training signals. When your team overrides a recommendation, that feeds back into the evaluation dataset. The system gets better at the specific cases your team disagrees with."
Back it up with: Show the Temporal HITL code pattern. Show the audit log schema. Explain the override feedback loop. This answer addresses the real fear: liability.
Objection 4: "We're not ready โ€” we need to sort out our data first"
Response: "That's often true โ€” and sorting out data quality is actually part of what we do in the discovery phase. In our experience, the first agent we build forces clarity on data quality issues that have been invisible for years. We don't need perfect data to start โ€” we need to understand what you have. The data quality issues that would block production become visible in Weeks 1โ€“2, and we either work around them or help you fix them. The worst outcome is discovering them on Day 1 of Week 12."
Back it up with: Propose a paid 2-week discovery sprint. Low financial risk for them, high information value. Most organisations can approve ยฃ20โ€“40K without board sign-off. Discovery converts to full engagement at ~70% rate.
The positioning shift โ€” from vendor to partner
โŒ Don't position as
"We build AI demos and prototypes." "We use Claude / OpenAI to build things." "We can build you a chatbot." These frame you as a commodity vendor competing on day rate.
โœ… Position as
"We deliver production enterprise AI systems with defined evaluation frameworks, full observability, and governance built in. The demos show what's possible. The platform is what makes it real." Compete on reliability, not capability.
๐Ÿ’ก The killer differentiator
"Every other AI vendor will show you a demo. We're the only ones who can also show you the evaluation framework that tells you if it's working." Evaluation infrastructure is the moat. Most AI shops don't have it.