6
Production Dimensions
100%
Of demos fail in prod
12 wk
From prototype to MVP
ยฃ400K
Typical first engagement
ARTlligence
Closes the gap
The real objection is not "this isn't impressive"
When a CTO says "cool prototype" they mean: I don't know how to go from this to something I can run my business on. They're not dismissing the value โ they're identifying the gap between a compelling demo and a system they'd stake their operations on. Your job is to close that gap, credibly.
Why every AI prototype looks the same to a CTO
What they see
Impressive demo with hardcoded scenarios, simulated agent responses, mock data, and a beautiful UI that works perfectly for the 3 use cases you prepared.
โ
What they need
A system connected to their actual data, handling their actual edge cases, failing gracefully, auditable, secure, and operated by their team.
What they see
Agents that respond in 2 seconds in a demo. Clean outputs. Every decision looks right.
โ
What they need
Agents that handle 10,000 requests/day, degrade gracefully under load, retry on failure, and tell you when they're uncertain.
What they see
A live agent log that looks like decisions are being made. Hard to tell what's real vs simulated.
โ
What they need
Complete observability: every LLM call traced, every decision logged with cost and latency, evaluation metrics tracked, and alerts when quality degrades.
๐ The 6 Production Dimensions
Every dimension must be solved. Missing one breaks the system.
Agent Infrastructure โ Durable orchestration, fault tolerance, stateful workflows that survive restarts
Real Integrations โ Live connections to enterprise systems: SAP, Salesforce, SharePoint, ServiceNow
Evaluation Framework โ Continuous quality measurement โ you need to know if it's right
LLMOps & Observability โ Full trace capture, cost attribution, drift detection, alerting
Security & Governance โ RBAC, audit trails, guardrails, human-in-the-loop approval gates
Production Deployment โ Containerised, versioned, scalable, monitored, with rollback
The ARTlligence position
The 20 OS products prove we understand the domain and the value. The Platform Architecture is what closes the sale. We're not selling demos โ we're selling a defined, deliverable path from prototype to production enterprise system.
What the prototypes have vs what production needs
| Dimension | Prototype (Current) | Production (Required) | Technology |
|---|---|---|---|
| Agent Orchestration | Hardcoded if/else, direct LLM calls | Temporal/LangGraph โ durable, stateful, retryable workflows | Temporal LangGraph |
| Data & Integrations | Mock data, simulated responses | Live MCP connectors to SAP, Salesforce, SharePoint, ServiceNow | MCP REST APIs |
| State Management | In-memory, lost on restart | Persistent state in Redis/Postgres โ workflows survive crashes | Redis PostgreSQL |
| Message Queue | Direct agent-to-agent calls | Kafka/Redis queues โ decoupled, buffered, replayable | Kafka Redis Streams |
| Evaluation | No quality measurement | RAGAS continuous evaluation โ faithfulness, relevancy, accuracy tracked | RAGAS Custom evals |
| Observability | Simulated log stream | Langfuse/LangSmith โ every LLM call traced, cost attributed, latency tracked | Langfuse LangSmith |
| Authentication & RBAC | None โ open access | SSO + RBAC โ who can trigger which agents and see which outputs | Auth0 Entra ID |
| Audit Trail | Nothing logged permanently | Immutable audit log โ every decision with full context, actor, timestamp | OpenTelemetry SIEM |
| Guardrails | No input/output safety | NeMo Guardrails โ PII detection, topic restriction, hallucination mitigation | NeMo Lakera Guard |
| Human-in-the-Loop | Mentioned in UI only | Temporal signals โ workflow pauses, sends approval request, waits for human | Temporal Webhooks |
| Scalability | Single thread, single user | Kubernetes horizontal scaling โ 10,000+ concurrent agent tasks | K8s Docker |
| Error Handling | Silent failures, no retries | Dead letter queues, exponential backoff, circuit breakers, fallback models | Resilience4j Custom |
| Cost Control | Unknown cost per operation | Token budget per agent, cost attribution per workflow, budget alerts | Langfuse Custom |
| Multi-tenancy | Single tenant only | Tenant isolation โ separate data, agent config, and cost attribution per client | Custom Row-level security |
| Deployment | Netlify static HTML | CI/CD pipeline, versioned agents, blue-green deployment, rollback | GitHub Actions ArgoCD |
The ARTlligence Platform Stack โ 7 Layers
Every enterprise AI system is built on these 7 layers. The same stack powers all 20 OS products. Build the platform once โ deploy any OS on top of it.
Layer 7 โ Enterprise Presentation
The dashboards and UIs (what we've built). React / HTML. Calls the Agent API. Real-time updates via WebSocket. SSO-authenticated.
React / Next.jsTailwind CSSWebSocketAuth0 / Entra SSOshadcn/ui
โ
Layer 6 โ Agent API Gateway
FastAPI service. Authenticates requests. Validates RBAC. Routes to Temporal workflows. Returns structured results. Rate-limits per tenant.
FastAPIJWT / OAuth2RBAC middlewareRate limitingRequest validation
โ
Layer 5 โ Workflow Orchestration
Temporal for durable, stateful, retryable workflows. LangGraph for agentic reasoning loops. Agents are Activities inside Temporal Workflows โ they can fail, retry, wait for signals (human approval), and recover from crashes.
Temporal.ioLangGraphGoogle ADKSignals (HITL)Saga patternsCompensation logic
โ
Layer 4 โ Agent Runtime
Individual agents as stateless Python/TypeScript functions. Input schema โ reasoning โ tool calls โ output schema. NeMo Guardrails wrap every agent. RAGAS eval on every output. Cost budget enforced per agent call.
Python agentsNeMo GuardrailsRAGAS evaluationToken budgetsFallback modelsOutput validation
โ
Layer 3 โ Model & Tool Layer
LLM routing: Claude 3.5 Sonnet for reasoning, Haiku for classification, GPT-4o for multimodal. MCP for standardised tool connectivity. RAG retrieval from vector store. Structured tool calls with JSON schema validation.
Claude Sonnet 4GPT-4oGemini 1.5 ProMCP connectorsPinecone / QdrantModel routing
โ
Layer 2 โ Data & Integration Layer
Enterprise connectors: SAP, Salesforce, SharePoint, ServiceNow, Jira, Confluence, custom ERPs. Kafka for real-time event streaming. PostgreSQL for structured state. Redis for caching and queues. S3/Blob for document storage.
KafkaPostgreSQLRedisSAP connectorSalesforce APISharePoint MCPServiceNow
โ
Layer 1 โ Observability & Security Foundation
OpenTelemetry for all traces. Langfuse for LLM observability. Every agent call: input, output, model, tokens, cost, latency. Immutable audit log. PII scrubbing before any LLM call. Anomaly alerts on cost/quality drift.
LangfuseOpenTelemetryPrometheusGrafanaPII scrubbingAudit log (immutable)Alerting
Production agent design principles
๐ Agent Contract Pattern
Every production agent has a defined contract โ not just a prompt
# Production agent contract
class FraudDetectionAgent(BaseAgent):
# Defined inputs โ validated at runtime
input_schema = ClaimInput(
claim_id=str,
claimant_id=str,
amount=float,
claim_type=ClaimType
)
# Defined outputs โ validated before return
output_schema = FraudScore(
score=float, # 0.0โ1.0
signals=list[Signal],
recommendation=str,
confidence=float,
requires_human=bool
)
# Hard constraints
max_tokens = 1500
max_latency_ms = 3000
fallback_model = "claude-haiku-4"
requires_guardrail = True
๐ The 5 Agentic Patterns
Choose the right pattern for each use case
Sequential: Agent A โ Agent B โ Agent C. Use for: compliance checks, document processing, report generation. Predictable, auditable, easy to test.
ReAct (Reasoning + Acting): Think โ Act โ Observe โ Think. Use for: fraud detection, anomaly investigation, research. Handles uncertainty well.
Planning + Execution: Plan all steps first, then execute. Use for: production scheduling, logistics routing, complex multi-step tasks.
Multi-Agent Collaboration: Specialist agents with a coordinator. Use for: 100+ agent systems, parallel research, complex workflows requiring different expertise.
Reflection: Agent reviews its own output before returning. Use for: drafting, legal/medical content, any output where quality matters more than speed.
โก Human-in-the-Loop โ The Right Way
Not a UI checkbox โ a durable workflow pause that waits for a real human decision
# Temporal HITL pattern
class ClaimsWorkflow(Workflow):
async def run(self, claim):
# Step 1: AI triage
fraud_score = await activity.fraud_check(claim)
# Step 2: Human gate โ workflow PAUSES
if fraud_score > 0.7:
await notify_investigator(claim, fraud_score)
# Workflow sleeps until signal arrives
decision = await workflow.wait_for_signal(
signal_name="investigator_decision",
timeout=timedelta(hours=48)
)
# Resumes when human sends signal
return await activity.settle(claim, decision)
Why Temporal: The workflow survives server restarts. If the investigator takes 3 days to respond, the workflow is still waiting โ exactly where it left off. No polling loops, no lost state.
Audit trail: Every pause, every signal received, every decision is recorded in Temporal's event history. Fully auditable for regulatory purposes.
Escalation: Timeout signals trigger escalation workflows automatically โ if no response in 48h, the workflow escalates to a senior investigator.
Not just approvals: Same pattern for any blocking action โ data corrections, ambiguity resolution, policy exceptions. The workflow pauses and waits for the human input it needs.
๐ Temporal โ Why It's the Right Choice
Durable execution: your workflows survive any failure
Durability: If a server crashes mid-workflow, Temporal replays from the last committed event. The workflow continues exactly where it stopped. No data loss, no zombie tasks.
Visibility: Every workflow execution is visible in the Temporal UI โ what state it's in, what activities have run, what's waiting. Production debugging becomes possible.
Retry logic: Activities retry automatically with exponential backoff. Transient failures (rate limits, network blips) are handled without any code. Permanent failures trigger compensating actions.
Timers: Schedule an action 6 weeks from now. Trigger SLAs. Escalate after timeout. All natively, without cron jobs or polling.
Versioning: Update workflow logic without breaking in-flight workflows. Deploy new agent versions alongside old ones during migration.
๐ LangGraph โ Agentic Reasoning Loops
For agents that need to reason, not just execute
# LangGraph inside a Temporal Activity
from langgraph.graph import StateGraph
def build_fraud_agent():
graph = StateGraph(FraudState)
# Nodes: each is a reasoning step
graph.add_node("analyse_claim", analyse_claim)
graph.add_node("check_history", check_history)
graph.add_node("network_analysis", network_analysis)
graph.add_node("score_and_report", score_report)
# Conditional routing based on state
graph.add_conditional_edges(
"analyse_claim",
lambda s: "network_analysis"
if s.suspicious else "score_and_report"
)
return graph.compile(checkpointer=checkpointer)
LangGraph is used inside Temporal Activities for reasoning loops. Temporal handles durability and HITL. LangGraph handles the agent's internal reasoning. Separation of concerns.
๐จ Message Queue Architecture โ Why Agents Need Queues
Direct agent-to-agent calls create brittle systems. Queues create resilient ones.
โ Direct calls (prototype pattern)
# What the demos do
result = agent_a(input)
result2 = agent_b(result)
result3 = agent_c(result2)
# If agent_b crashes โ everything lost
# If agent_b is slow โ caller blocks
# Can't retry, can't scale, can't replay
โ
Queue-based (production pattern)
# What production systems use
kafka.publish("claims.new", claim_event)
# Agent B consumes from queue
# Independently scaled (10 instances)
# Crash โ message stays in queue
# Slow โ backpressure handled
# Replay โ re-process any message
MCP is the integration game-changer
Model Context Protocol (MCP) is Anthropic's open standard for connecting AI to data sources and tools. Instead of building custom integrations for every enterprise system, MCP provides a standardised connector interface. One MCP server for SharePoint works with any agent. The ecosystem is growing fast โ 200+ community MCP servers already built.
๐ MCP Connector Architecture
How MCP replaces custom integration code
# MCP server for Salesforce (example)
from mcp import Server, Tool
salesforce_mcp = Server("salesforce")
@salesforce_mcp.tool()
async def get_account(account_id: str):
"""Retrieve account details from Salesforce"""
return await sf_client.query(
f"SELECT * FROM Account WHERE Id='{account_id}'"
)
# Agent uses the tool โ no custom code needed
agent = Agent(
tools=[salesforce_mcp, sharepoint_mcp, sap_mcp],
# Agent automatically discovers available tools
# and knows how to call them from schema
)
๐ Enterprise Integration Catalogue
What's available via MCP today
| System | MCP Available | Auth Method |
|---|---|---|
| SharePoint Online | โ Ready | OAuth2 + Entra |
| Salesforce | โ Ready | OAuth2 |
| ServiceNow | โ Ready | API Key + OAuth |
| Jira / Confluence | โ Ready | API Token |
| SAP S/4HANA | Build | SAP BTP OAuth |
| SAP ERP (older) | Build | RFC / BAPI |
| Oracle ERP | Build | REST API |
| MS Dynamics 365 | โ Ready | Entra OAuth |
| PostgreSQL / SQL | โ Ready | Connection string |
| Custom REST APIs | Generic | Any |
๐ Integration Security Principles
Every connector must follow these rules
Least privilege: MCP connectors are scoped to the minimum data required. A claims fraud agent can read claims โ it cannot modify or delete. Permissions defined at connector level, not agent level.
No credentials in agent prompts: All credentials stored in secrets manager (Vault/AWS Secrets Manager). Injected at runtime. Never in code, never in prompts, never in logs.
Audit every tool call: Every MCP tool call logged: which agent, which tool, what parameters, what was returned, latency, cost. Data access auditable for compliance.
The most neglected production requirement
LLM outputs degrade silently. A model update, a prompt change, or a subtle shift in input distribution can reduce quality by 30% without any error being thrown. Without evaluation infrastructure, you won't know until a client calls. Evaluation is not testing โ it's continuous quality monitoring.
๐ RAGAS โ RAG Evaluation Framework
4 metrics that catch the ways RAG fails in production
Faithfulness (target: >0.92): Is every claim in the output supported by the retrieved context? Catches hallucinations โ the model inventing facts not in the source. Most critical metric.
Answer Relevancy (target: >0.85): Does the answer actually address the question? Catches cases where the model retrieves relevant context but answers a different question.
Context Precision (target: >0.80): Is the retrieved context actually useful? Low precision means you're retrieving noise โ the model is working around irrelevant documents.
Context Recall (target: >0.80): Did retrieval find all the relevant information? Low recall means the model is answering from incomplete context โ confident but wrong.
๐งช Golden Dataset โ The Evaluation Foundation
Build this before you deploy anything to production
# Golden dataset structure
golden_dataset = [
{
"input": "Analyse claim CLM-0847 for fraud",
"expected_output": {
"fraud_score": 0.91,
"signals": ["3rd fire claim 24mo", "assessor link"],
"recommendation": "decline_and_refer_SIU"
},
"grading_criteria": {
"score_range": (0.85, 1.0),
"required_signals": ["fire_claim_frequency"],
"acceptable_recommendations": ["decline_and_refer_SIU"]
}
}
# 50โ200 cases per agent
]
# Run on every deployment
assert eval_score >= 0.92, "Deploy blocked"
๐ Continuous Evaluation Pipeline
Quality monitoring doesn't stop at deployment
Pre-deployment gate: Golden dataset evaluation runs automatically on every PR. Evaluation score below threshold โ deployment blocked. No exceptions. This is your minimum quality bar.
Production sampling: 5% of live agent outputs evaluated continuously using RAGAS + LLM-as-judge. Score logged in Langfuse. Alert fires if 7-day average drops below threshold.
Human feedback loop: When humans override an AI recommendation, that's a training signal. Overrides collected, reviewed, and periodically used to update the golden dataset and retune prompts.
๐ก What Langfuse Captures
The complete picture of every agent interaction
LLM call input/outputEvery call
Model used ยท prompt version ยท temperatureConfig
Input tokens ยท output tokens ยท costCost
End-to-end latency ยท per-step latencyPerf
Tool calls made ยท tool call resultsTools
RAGAS scores ยท faithfulness ยท relevancyQuality
User feedback ยท human override eventsFeedback
Guardrail trigger events ยท blocked inputsSafety
Session ID ยท user ID ยท tenant IDIdentity
๐จ Alert Architecture
What triggers an alert in production
P0 โ Page immediately: Error rate >5% on any agent ยท Cost spike >3ร daily baseline ยท RAGAS faithfulness below 0.85 ยท Any PII leak detected
P1 โ Notify in 30 min: P95 latency >10s ยท Human override rate >20% (signals AI recommendations degraded) ยท Evaluation score trending down 3 days
P2 โ Daily digest: Token cost increase >15% week-over-week ยท New tool call patterns ยท Unusual input distribution shifts
Weekly review: Model performance report ยท Cost attribution by agent ยท Human feedback summary ยท Evaluation trend analysis
๐ฐ Cost Architecture โ The Hidden Production Challenge
LLM costs are the #1 reason enterprise AI projects fail at scale
Model tiering: Route to cheapest model that meets quality bar. Classification โ Haiku (cheap). Reasoning โ Sonnet (mid). Complex synthesis โ Opus (expensive). 60โ80% cost reduction on mixed workloads.
Semantic caching: Cache LLM responses for semantically similar inputs. Langfuse + Redis. Cache hit rate 30โ50% on typical enterprise workloads. Cost reduction proportional to cache hit rate.
Token budgets: Hard token limits per agent per call. Context window management โ summarise old context rather than passing full history. Prompt compression for repetitive patterns.
๐ก NeMo Guardrails โ Production Safety Layer
Every agent input and output passes through guardrails
# guardrails config (colang)
define flow check input
# Detect PII before LLM sees it
$pii = execute pii_detection(input)
if $pii.detected:
$input = execute redact_pii(input)
log_event("pii_redacted", $pii.types)
# Block prompt injection attempts
$injection = execute injection_check(input)
if $injection.detected:
log_event("injection_blocked")
return "Request blocked by safety policy"
define flow check output
# Verify no hallucinated citations
$faith = execute faithfulness_check(output)
if $faith.score < 0.85:
log_event("low_faithfulness", $faith.score)
return rerun_with_stricter_prompt()
๐ RBAC Architecture
Who can do what โ enforced at the API layer
| Role | Can View | Can Trigger | Can Approve |
|---|---|---|---|
| AI Consumer | Agent outputs | None | None |
| Operator | All outputs + traces | Standard workflows | None |
| Approver | Flagged items | Standard workflows | Assigned flags |
| Manager | Team scope | All workflows | All approvals |
| Admin | Everything + costs | Everything | Everything |
| AI Engineer | Traces + evals | Dev/test only | Eval gates |
RBAC is enforced at the API Gateway layer โ not in the agent or UI. Every request carries a JWT. Permissions checked before any agent workflow is started. Tenant isolation: users can only see and act on their tenant's data.
๐ Audit Trail Architecture
Every AI decision is permanently logged โ because it has to be
What's logged: Agent triggered (who, when, why), inputs provided, outputs generated, tool calls made, human approval decisions, overrides, final action taken. Immutable โ cannot be edited or deleted.
Why it matters: FCA Consumer Duty requires explainability for every financial decision. GDPR Article 22 requires human review rights for automated decisions. ICO requires data access logs. Audit trail is not optional.
Storage: OpenTelemetry traces โ Grafana Tempo. Agent decisions โ append-only PostgreSQL table with cryptographic hash chain. SIEM integration for security events. Retention: 7 years for regulated industries.
๐ณ Container Architecture
Each layer deployed independently โ scale what needs scaling
# docker-compose.prod.yml (simplified)
services:
api-gateway:
image: artlligence/api-gateway:latest
replicas: 3 # horizontal scale
fraud-detection-worker:
image: artlligence/fraud-agent:v2.1.4
replicas: 10 # scale with queue depth
env:
MAX_TOKENS_PER_CALL: 1500
FALLBACK_MODEL: claude-haiku-4
temporal-worker:
image: artlligence/temporal-worker:latest
replicas: 5
langfuse: # observability
nemo-guardrails:# safety layer
redis: # caching + queues
postgres: # state + audit log
๐ CI/CD Pipeline โ Agent Deployment
Never deploy a degraded agent to production
# .github/workflows/deploy-agent.yml
on: [push to main]
jobs:
evaluate:
run: pytest tests/eval/ # golden dataset
run: ragas evaluate # RAG quality
assert: faithfulness >= 0.92
assert: answer_relevancy >= 0.85
# Block deploy if evals fail
deploy-canary:
needs: evaluate # only if eval passes
run: deploy to 5% of traffic
wait: 30 min
assert: error_rate < 0.01
assert: p95_latency < 5000ms
deploy-full:
needs: deploy-canary
run: roll out to 100% traffic
# Old version kept for 1h rollback
โ Cloud Deployment Options
Where to run โ depends on client requirements
Cloud-native (default): GCP Cloud Run + GKE, or AWS ECS + EKS, or Azure AKS. Managed Temporal Cloud. Langfuse Cloud. Fastest deployment, lowest ops burden. Most clients start here.
Private cloud / VPC: Deploy entire stack in client's VPC. LLM calls stay within their network using Vertex AI or Azure OpenAI. Required for financial services, government, healthcare with strict data residency.
On-premise / air-gapped: Local LLM models (Ollama / vLLM) for clients where data cannot leave premises. Performance trade-off vs cloud models. Defence, intelligence, critical national infrastructure.
12 weeks from signed contract to production MVP
This is the standard ARTlligence delivery timeline for any single OS product. After Week 12, the client has a production-ready system connected to their data, with full observability, governance, and ongoing improvement cycle. Subsequent OS products on the same platform are faster โ platform is already built.
1
Weeks 1โ2: Discovery & Data Architecture
Team: Solution Architect + 1 AI Engineer
Stakeholder interviews to define exact use cases and success metrics. Data source mapping โ what exists, what's accessible, what quality it's in. Integration feasibility: SAP/Salesforce/SharePoint connectivity tested. RBAC and data governance requirements documented. Security and compliance baseline. Output: Technical Architecture Document + Integration Plan + Evaluation Criteria.
2
Weeks 3โ4: Platform Foundation
Team: 2 AI Engineers + DevOps
Temporal cluster deployed. Langfuse observability live. NeMo Guardrails configured for sector. MCP connectors built and tested against live data. RBAC and SSO integrated with client identity provider. PostgreSQL audit schema defined. Kafka topics configured. First golden dataset built (50 cases) with client domain experts. CI/CD pipeline established.
3
Weeks 5โ8: Agent Development
Team: 3 AI Engineers + 1 Domain Expert
Agents built one at a time, highest value first. Each agent: schema defined โ prompt engineered โ evaluation score >0.92 โ HITL workflow โ integration tested โ code reviewed โ deployed to staging. Target: 5โ8 agents per week with quality gates. Weekly demo to stakeholders with live data. Evaluation score tracked โ no agent ships below threshold. Human-in-the-loop flows tested with actual approvers.
4
Weeks 9โ10: Integration & Load Testing
Team: 2 AI Engineers + QA
End-to-end workflow testing with production data volumes. Load testing: target throughput ร 3. Failure injection: kill random services, verify recovery. Cost profiling: token cost per workflow documented. Security penetration test on agent API. Canary deployment to 5% of real traffic. Monitor evaluation scores on live data โ confirm alignment with golden dataset.
5
Weeks 11โ12: Production Launch & Handover
Team: Full team + Client operations team
Full production rollout. Runbook documentation. On-call escalation procedures. Client operations team trained on Temporal UI, Langfuse, and alert handling. Feedback collection mechanism live. 30-day post-launch hypercare: ARTlligence on-call for P0/P1 incidents. Ongoing improvement cycle agreed: monthly evaluation reviews, quarterly model updates, continuous golden dataset expansion.
Cost architecture for enterprise AI systems
๐ท Build Cost โ Indicative Ranges
Typical ARTlligence engagement structure
| Engagement Type | Scope | Investment |
|---|---|---|
| Proof of Value | 2 agents ยท 4 weeks ยท single integration | ยฃ40Kโยฃ80K |
| Single OS MVP | 8โ12 agents ยท 12 weeks ยท 3โ5 integrations | ยฃ200Kโยฃ400K |
| Enterprise Platform | Full OS ยท 12+ agents ยท all integrations | ยฃ500Kโยฃ1.2M |
| Multi-OS Programme | 3+ OS products on shared platform | ยฃ1Mโยฃ3M+ |
๐ฐ LLM Running Cost โ At Scale
Monthly operational cost examples (Claude Sonnet 4)
| Workload | Volume/day | Monthly LLM Cost |
|---|---|---|
| Insurance claims triage | 500 claims | ยฃ800โยฃ2,000 |
| Document intelligence | 2,000 docs | ยฃ2,000โยฃ5,000 |
| Customer service AI | 5,000 queries | ยฃ3,000โยฃ8,000 |
| Full enterprise OS | 10,000+ ops | ยฃ8,000โยฃ25,000 |
Model tiering (routing simple tasks to Haiku) and semantic caching typically reduce LLM costs by 50โ70% vs worst-case estimates. Cost per output tracked to the penny in Langfuse.
๐ ROI Framing for Clients
How to present the investment case โ make it undeniable
Payback period: For InsuranceOS, fraud prevention of ยฃ2.1M/month against a ยฃ300K build cost = payback in 6 weeks. Frame the investment as a fraction of Year 1 value โ not as a technology cost.
Cost avoidance vs revenue: Both matter. ManufacturingOS downtime reduction (ยฃ12M/year avoided) is harder to see than revenue but easier to quantify precisely. Use real production data from the discovery phase.
Ongoing value: Unlike a software licence, AI systems improve over time. The golden dataset grows. Models improve. Human feedback refines recommendations. Year 3 value > Year 1 value from the same investment.
Team structure for production AI delivery
๐ Solution Architect
1 per engagement ยท ARTlligence lead
Owns the technical architecture. Runs discovery. Defines integration patterns. Ensures platform decisions are sound at scale. Presents to client CTOs. Reviews all agent contracts before build. Accountable for delivery.
System designLLM architectureEnterprise integrationClient communication
๐ค AI Engineer
2โ4 per engagement ยท core build team
Builds agents. Engineers prompts. Writes evaluation tests. Builds MCP connectors. Implements Temporal workflows. Owns RAGAS scores for their agents. Fixes quality issues found in production sampling.
Python / TypeScriptLangGraphTemporalPrompt engineeringEvaluation
๐ง MLOps / DevOps
1 per programme ยท shared across engagements
Owns platform infrastructure: Kubernetes, CI/CD, Langfuse, Temporal cluster, monitoring. Manages deployments, rollbacks, scaling. On-call for infrastructure incidents. Builds and maintains the deployment pipeline.
KubernetesTerraformGitHub ActionsGrafanaIncident response
๐ Skills to Build or Hire
The specific capability gaps that separate prototype shops from production delivery firms
| Skill | Why Critical | How to Build | Priority |
|---|---|---|---|
| Temporal.io | Durable workflow orchestration is non-negotiable at enterprise scale | Temporal.io docs + build a HITL workflow from scratch | P0 |
| RAGAS Evaluation | Without eval, you can't prove quality or detect degradation | RAGAS docs + build golden dataset for one agent | P0 |
| Langfuse LLMOps | Every enterprise client will ask "how do you monitor this in production" | Langfuse self-hosted + instrument one real agent | P0 |
| MCP Server Building | SAP and Oracle integrations require custom MCP servers | Build a custom MCP server for one internal tool | P1 |
| NeMo Guardrails | Every financial/healthcare/government client requires it | NeMo docs + configure for one sector use case | P1 |
| Enterprise Kubernetes | Production systems require proper container orchestration | GKE or EKS with autoscaling + deploy Temporal on it | P1 |
When a CTO says "cool prototype" โ here's exactly what to say
Don't defend the demo. Agree with them and then go further. The demo is meant to show you what's possible โ not what we'd ship. Here's the platform we use to turn that into something you'd bet your operations on.
The 4 objections and the exact responses
Objection 1: "This is just a demo โ it's not connected to our real data"
Response: "Correct โ and that's deliberate. The demo shows you the intelligence layer. The integration layer is our Week 3โ4 work. We connect to your SAP / Salesforce / SharePoint via MCP โ a standardised protocol that means we're not writing custom integration code for every system. We've done this connection before for [comparable client]. The first integration usually takes 2โ3 weeks. After that, your agents are working on your actual data, in your actual environment."
Back it up with: Show the MCP connector architecture. Name the specific systems they use and confirm you have connectors for them. Offer a paid technical feasibility sprint (Weeks 1โ2) before full commitment.
Objection 2: "How do I know it's giving correct answers?"
Response: "This is the right question โ and it's why we build an evaluation framework before we deploy a single agent. We work with your domain experts to build a golden dataset: 50โ200 cases with known-correct answers. Every agent must score above a threshold on this dataset before it ships to production. And we run RAGAS evaluation on 5% of live outputs continuously โ so if quality degrades after a model update or data shift, we know within 24 hours before your users do."
Back it up with: Show the RAGAS metrics screen. Explain the 0.92 faithfulness threshold. Describe what happens when the alert fires. This is the answer that separates you from every other AI vendor.
Objection 3: "What happens when it makes a mistake?"
Response: "Three things. First โ our human-in-the-loop architecture means no consequential action is taken without a human approving it. The AI recommends; your authorised people decide. Second โ every AI decision is logged permanently with full context. If something goes wrong, we can trace exactly what the agent saw, what it reasoned, and what it recommended. Third โ overrides are training signals. When your team overrides a recommendation, that feeds back into the evaluation dataset. The system gets better at the specific cases your team disagrees with."
Back it up with: Show the Temporal HITL code pattern. Show the audit log schema. Explain the override feedback loop. This answer addresses the real fear: liability.
Objection 4: "We're not ready โ we need to sort out our data first"
Response: "That's often true โ and sorting out data quality is actually part of what we do in the discovery phase. In our experience, the first agent we build forces clarity on data quality issues that have been invisible for years. We don't need perfect data to start โ we need to understand what you have. The data quality issues that would block production become visible in Weeks 1โ2, and we either work around them or help you fix them. The worst outcome is discovering them on Day 1 of Week 12."
Back it up with: Propose a paid 2-week discovery sprint. Low financial risk for them, high information value. Most organisations can approve ยฃ20โ40K without board sign-off. Discovery converts to full engagement at ~70% rate.
The positioning shift โ from vendor to partner
โ Don't position as
"We build AI demos and prototypes." "We use Claude / OpenAI to build things." "We can build you a chatbot." These frame you as a commodity vendor competing on day rate.
โ
Position as
"We deliver production enterprise AI systems with defined evaluation frameworks, full observability, and governance built in. The demos show what's possible. The platform is what makes it real." Compete on reliability, not capability.
๐ก The killer differentiator
"Every other AI vendor will show you a demo. We're the only ones who can also show you the evaluation framework that tells you if it's working." Evaluation infrastructure is the moat. Most AI shops don't have it.