Production Dimensions

100%

Of demos fail in prod

12 wk

From prototype to MVP

£400K

Typical first engagement

ARTlligence

Closes the gap

🎯

The real objection is not "this isn't impressive"

When a CTO says "cool prototype" they mean: I don't know how to go from this to something I can run my business on. They're not dismissing the value — they're identifying the gap between a compelling demo and a system they'd stake their operations on. Your job is to close that gap, credibly.

Why every AI prototype looks the same to a CTO

What they see

Impressive demo with hardcoded scenarios, simulated agent responses, mock data, and a beautiful UI that works perfectly for the 3 use cases you prepared.

→

What they need

A system connected to their actual data, handling their actual edge cases, failing gracefully, auditable, secure, and operated by their team.

What they see

Agents that respond in 2 seconds in a demo. Clean outputs. Every decision looks right.

→

What they need

Agents that handle 10,000 requests/day, degrade gracefully under load, retry on failure, and tell you when they're uncertain.

What they see

A live agent log that looks like decisions are being made. Hard to tell what's real vs simulated.

→

What they need

Complete observability: every LLM call traced, every decision logged with cost and latency, evaluation metrics tracked, and alerts when quality degrades.

📋 The 6 Production Dimensions

Every dimension must be solved. Missing one breaks the system.

🏗

Agent Infrastructure — Durable orchestration, fault tolerance, stateful workflows that survive restarts

🔌

Real Integrations — Live connections to enterprise systems: SAP, Salesforce, SharePoint, ServiceNow

📊

Evaluation Framework — Continuous quality measurement — you need to know if it's right

📡

LLMOps & Observability — Full trace capture, cost attribution, drift detection, alerting

🔐

Security & Governance — RBAC, audit trails, guardrails, human-in-the-loop approval gates

⚙️

Production Deployment — Containerised, versioned, scalable, monitored, with rollback

💡

The ARTlligence position

The 20 OS products prove we understand the domain and the value. The Platform Architecture is what closes the sale. We're not selling demos — we're selling a defined, deliverable path from prototype to production enterprise system.

What the prototypes have vs what production needs

Dimension	Prototype (Current)	Production (Required)	Technology
Agent Orchestration	Hardcoded if/else, direct LLM calls	Temporal/LangGraph — durable, stateful, retryable workflows	Temporal LangGraph
Data & Integrations	Mock data, simulated responses	Live MCP connectors to SAP, Salesforce, SharePoint, ServiceNow	MCP REST APIs
State Management	In-memory, lost on restart	Persistent state in Redis/Postgres — workflows survive crashes	Redis PostgreSQL
Message Queue	Direct agent-to-agent calls	Kafka/Redis queues — decoupled, buffered, replayable	Kafka Redis Streams
Evaluation	No quality measurement	RAGAS continuous evaluation — faithfulness, relevancy, accuracy tracked	RAGAS Custom evals
Observability	Simulated log stream	Langfuse/LangSmith — every LLM call traced, cost attributed, latency tracked	Langfuse LangSmith
Authentication & RBAC	None — open access	SSO + RBAC — who can trigger which agents and see which outputs	Auth0 Entra ID
Audit Trail	Nothing logged permanently	Immutable audit log — every decision with full context, actor, timestamp	OpenTelemetry SIEM
Guardrails	No input/output safety	NeMo Guardrails — PII detection, topic restriction, hallucination mitigation	NeMo Lakera Guard
Human-in-the-Loop	Mentioned in UI only	Temporal signals — workflow pauses, sends approval request, waits for human	Temporal Webhooks
Scalability	Single thread, single user	Kubernetes horizontal scaling — 10,000+ concurrent agent tasks	K8s Docker
Error Handling	Silent failures, no retries	Dead letter queues, exponential backoff, circuit breakers, fallback models	Resilience4j Custom
Cost Control	Unknown cost per operation	Token budget per agent, cost attribution per workflow, budget alerts	Langfuse Custom
Multi-tenancy	Single tenant only	Tenant isolation — separate data, agent config, and cost attribution per client	Custom Row-level security
Deployment	Netlify static HTML	CI/CD pipeline, versioned agents, blue-green deployment, rollback	GitHub Actions ArgoCD

🏛

The ARTlligence Platform Stack — 7 Layers

Every enterprise AI system is built on these 7 layers. The same stack powers all 20 OS products. Build the platform once — deploy any OS on top of it.

Layer 7 — Enterprise Presentation

The dashboards and UIs (what we've built). React / HTML. Calls the Agent API. Real-time updates via WebSocket. SSO-authenticated.

React / Next.jsTailwind CSSWebSocketAuth0 / Entra SSOshadcn/ui

↕

Layer 6 — Agent API Gateway

FastAPI service. Authenticates requests. Validates RBAC. Routes to Temporal workflows. Returns structured results. Rate-limits per tenant.

FastAPIJWT / OAuth2RBAC middlewareRate limitingRequest validation

↕

Layer 5 — Workflow Orchestration

Temporal for durable, stateful, retryable workflows. LangGraph for agentic reasoning loops. Agents are Activities inside Temporal Workflows — they can fail, retry, wait for signals (human approval), and recover from crashes.

Temporal.ioLangGraphGoogle ADKSignals (HITL)Saga patternsCompensation logic

↕

Layer 4 — Agent Runtime

Individual agents as stateless Python/TypeScript functions. Input schema → reasoning → tool calls → output schema. NeMo Guardrails wrap every agent. RAGAS eval on every output. Cost budget enforced per agent call.

Python agentsNeMo GuardrailsRAGAS evaluationToken budgetsFallback modelsOutput validation

↕

Layer 3 — Model & Tool Layer

LLM routing: Claude 3.5 Sonnet for reasoning, Haiku for classification, GPT-4o for multimodal. MCP for standardised tool connectivity. RAG retrieval from vector store. Structured tool calls with JSON schema validation.

Claude Sonnet 4GPT-4oGemini 1.5 ProMCP connectorsPinecone / QdrantModel routing

↕

Layer 2 — Data & Integration Layer

Enterprise connectors: SAP, Salesforce, SharePoint, ServiceNow, Jira, Confluence, custom ERPs. Kafka for real-time event streaming. PostgreSQL for structured state. Redis for caching and queues. S3/Blob for document storage.

KafkaPostgreSQLRedisSAP connectorSalesforce APISharePoint MCPServiceNow

↕

Layer 1 — Observability & Security Foundation

OpenTelemetry for all traces. Langfuse for LLM observability. Every agent call: input, output, model, tokens, cost, latency. Immutable audit log. PII scrubbing before any LLM call. Anomaly alerts on cost/quality drift.

LangfuseOpenTelemetryPrometheusGrafanaPII scrubbingAudit log (immutable)Alerting

Production agent design principles

📐 Agent Contract Pattern

Every production agent has a defined contract — not just a prompt

# Production agent contract
class FraudDetectionAgent(BaseAgent):
  # Defined inputs — validated at runtime
  input_schema = ClaimInput(
    claim_id=str,
    claimant_id=str,
    amount=float,
    claim_type=ClaimType
  )
  # Defined outputs — validated before return
  output_schema = FraudScore(
    score=float,        # 0.0–1.0
    signals=list[Signal],
    recommendation=str,
    confidence=float,
    requires_human=bool
  )
  # Hard constraints
  max_tokens = 1500
  max_latency_ms = 3000
  fallback_model = "claude-haiku-4"
  requires_guardrail = True

🔄 The 5 Agentic Patterns

Choose the right pattern for each use case

Sequential: Agent A → Agent B → Agent C. Use for: compliance checks, document processing, report generation. Predictable, auditable, easy to test.

ReAct (Reasoning + Acting): Think → Act → Observe → Think. Use for: fraud detection, anomaly investigation, research. Handles uncertainty well.

Planning + Execution: Plan all steps first, then execute. Use for: production scheduling, logistics routing, complex multi-step tasks.

Multi-Agent Collaboration: Specialist agents with a coordinator. Use for: 100+ agent systems, parallel research, complex workflows requiring different expertise.

Reflection: Agent reviews its own output before returning. Use for: drafting, legal/medical content, any output where quality matters more than speed.

⚡ Human-in-the-Loop — The Right Way

Not a UI checkbox — a durable workflow pause that waits for a real human decision

# Temporal HITL pattern
class ClaimsWorkflow(Workflow):
  async def run(self, claim):

    # Step 1: AI triage
    fraud_score = await activity.fraud_check(claim)

    # Step 2: Human gate — workflow PAUSES
    if fraud_score > 0.7:
      await notify_investigator(claim, fraud_score)

      # Workflow sleeps until signal arrives
      decision = await workflow.wait_for_signal(
        signal_name="investigator_decision",
        timeout=timedelta(hours=48)
      )

    # Resumes when human sends signal
    return await activity.settle(claim, decision)

Why Temporal: The workflow survives server restarts. If the investigator takes 3 days to respond, the workflow is still waiting — exactly where it left off. No polling loops, no lost state.

Audit trail: Every pause, every signal received, every decision is recorded in Temporal's event history. Fully auditable for regulatory purposes.

Escalation: Timeout signals trigger escalation workflows automatically — if no response in 48h, the workflow escalates to a senior investigator.

Not just approvals: Same pattern for any blocking action — data corrections, ambiguity resolution, policy exceptions. The workflow pauses and waits for the human input it needs.

🔄 Temporal — Why It's the Right Choice

Durable execution: your workflows survive any failure

Durability: If a server crashes mid-workflow, Temporal replays from the last committed event. The workflow continues exactly where it stopped. No data loss, no zombie tasks.

Visibility: Every workflow execution is visible in the Temporal UI — what state it's in, what activities have run, what's waiting. Production debugging becomes possible.

Retry logic: Activities retry automatically with exponential backoff. Transient failures (rate limits, network blips) are handled without any code. Permanent failures trigger compensating actions.

Timers: Schedule an action 6 weeks from now. Trigger SLAs. Escalate after timeout. All natively, without cron jobs or polling.

Versioning: Update workflow logic without breaking in-flight workflows. Deploy new agent versions alongside old ones during migration.

🔀 LangGraph — Agentic Reasoning Loops

For agents that need to reason, not just execute

# LangGraph inside a Temporal Activity
from langgraph.graph import StateGraph

def build_fraud_agent():
  graph = StateGraph(FraudState)

  # Nodes: each is a reasoning step
  graph.add_node("analyse_claim", analyse_claim)
  graph.add_node("check_history", check_history)
  graph.add_node("network_analysis", network_analysis)
  graph.add_node("score_and_report", score_report)

  # Conditional routing based on state
  graph.add_conditional_edges(
    "analyse_claim",
    lambda s: "network_analysis"
              if s.suspicious else "score_and_report"
  )
  return graph.compile(checkpointer=checkpointer)

LangGraph is used inside Temporal Activities for reasoning loops. Temporal handles durability and HITL. LangGraph handles the agent's internal reasoning. Separation of concerns.

📨 Message Queue Architecture — Why Agents Need Queues

Direct agent-to-agent calls create brittle systems. Queues create resilient ones.

❌ Direct calls (prototype pattern)

# What the demos do
result = agent_a(input)
result2 = agent_b(result)
result3 = agent_c(result2)
# If agent_b crashes → everything lost
# If agent_b is slow → caller blocks
# Can't retry, can't scale, can't replay

✅ Queue-based (production pattern)

# What production systems use
kafka.publish("claims.new", claim_event)
# Agent B consumes from queue
# Independently scaled (10 instances)
# Crash → message stays in queue
# Slow → backpressure handled
# Replay → re-process any message

🔌

MCP is the integration game-changer

Model Context Protocol (MCP) is Anthropic's open standard for connecting AI to data sources and tools. Instead of building custom integrations for every enterprise system, MCP provides a standardised connector interface. One MCP server for SharePoint works with any agent. The ecosystem is growing fast — 200+ community MCP servers already built.

🏗 MCP Connector Architecture

How MCP replaces custom integration code

# MCP server for Salesforce (example)
from mcp import Server, Tool

salesforce_mcp = Server("salesforce")

@salesforce_mcp.tool()
async def get_account(account_id: str):
  """Retrieve account details from Salesforce"""
  return await sf_client.query(
    f"SELECT * FROM Account WHERE Id='{account_id}'"
  )

# Agent uses the tool — no custom code needed
agent = Agent(
  tools=[salesforce_mcp, sharepoint_mcp, sap_mcp],
  # Agent automatically discovers available tools
  # and knows how to call them from schema
)

📋 Enterprise Integration Catalogue

What's available via MCP today

System	MCP Available	Auth Method
SharePoint Online	✓ Ready	OAuth2 + Entra
Salesforce	✓ Ready	OAuth2
ServiceNow	✓ Ready	API Key + OAuth
Jira / Confluence	✓ Ready	API Token
SAP S/4HANA	Build	SAP BTP OAuth
SAP ERP (older)	Build	RFC / BAPI
Oracle ERP	Build	REST API
MS Dynamics 365	✓ Ready	Entra OAuth
PostgreSQL / SQL	✓ Ready	Connection string
Custom REST APIs	Generic	Any

🔐 Integration Security Principles

Every connector must follow these rules

Least privilege: MCP connectors are scoped to the minimum data required. A claims fraud agent can read claims — it cannot modify or delete. Permissions defined at connector level, not agent level.

No credentials in agent prompts: All credentials stored in secrets manager (Vault/AWS Secrets Manager). Injected at runtime. Never in code, never in prompts, never in logs.

Audit every tool call: Every MCP tool call logged: which agent, which tool, what parameters, what was returned, latency, cost. Data access auditable for compliance.

⚠️

The most neglected production requirement

LLM outputs degrade silently. A model update, a prompt change, or a subtle shift in input distribution can reduce quality by 30% without any error being thrown. Without evaluation infrastructure, you won't know until a client calls. Evaluation is not testing — it's continuous quality monitoring.

📊 RAGAS — RAG Evaluation Framework

4 metrics that catch the ways RAG fails in production

Faithfulness (target: >0.92): Is every claim in the output supported by the retrieved context? Catches hallucinations — the model inventing facts not in the source. Most critical metric.

Answer Relevancy (target: >0.85): Does the answer actually address the question? Catches cases where the model retrieves relevant context but answers a different question.

Context Precision (target: >0.80): Is the retrieved context actually useful? Low precision means you're retrieving noise — the model is working around irrelevant documents.

Context Recall (target: >0.80): Did retrieval find all the relevant information? Low recall means the model is answering from incomplete context — confident but wrong.

🧪 Golden Dataset — The Evaluation Foundation

Build this before you deploy anything to production

# Golden dataset structure
golden_dataset = [
  {
    "input": "Analyse claim CLM-0847 for fraud",
    "expected_output": {
      "fraud_score": 0.91,
      "signals": ["3rd fire claim 24mo", "assessor link"],
      "recommendation": "decline_and_refer_SIU"
    },
    "grading_criteria": {
      "score_range": (0.85, 1.0),
      "required_signals": ["fire_claim_frequency"],
      "acceptable_recommendations": ["decline_and_refer_SIU"]
    }
  }
  # 50–200 cases per agent
]
# Run on every deployment
assert eval_score >= 0.92, "Deploy blocked"

📈 Continuous Evaluation Pipeline

Quality monitoring doesn't stop at deployment

Pre-deployment gate: Golden dataset evaluation runs automatically on every PR. Evaluation score below threshold → deployment blocked. No exceptions. This is your minimum quality bar.

Production sampling: 5% of live agent outputs evaluated continuously using RAGAS + LLM-as-judge. Score logged in Langfuse. Alert fires if 7-day average drops below threshold.

Human feedback loop: When humans override an AI recommendation, that's a training signal. Overrides collected, reviewed, and periodically used to update the golden dataset and retune prompts.

📡 What Langfuse Captures

The complete picture of every agent interaction

LLM call input/outputEvery call

Model used · prompt version · temperatureConfig

Input tokens · output tokens · costCost

End-to-end latency · per-step latencyPerf

Tool calls made · tool call resultsTools

RAGAS scores · faithfulness · relevancyQuality

User feedback · human override eventsFeedback

Guardrail trigger events · blocked inputsSafety

Session ID · user ID · tenant IDIdentity

🚨 Alert Architecture

What triggers an alert in production

P0 — Page immediately: Error rate >5% on any agent · Cost spike >3× daily baseline · RAGAS faithfulness below 0.85 · Any PII leak detected

P1 — Notify in 30 min: P95 latency >10s · Human override rate >20% (signals AI recommendations degraded) · Evaluation score trending down 3 days

P2 — Daily digest: Token cost increase >15% week-over-week · New tool call patterns · Unusual input distribution shifts

Weekly review: Model performance report · Cost attribution by agent · Human feedback summary · Evaluation trend analysis

💰 Cost Architecture — The Hidden Production Challenge

LLM costs are the #1 reason enterprise AI projects fail at scale

Model tiering: Route to cheapest model that meets quality bar. Classification → Haiku (cheap). Reasoning → Sonnet (mid). Complex synthesis → Opus (expensive). 60–80% cost reduction on mixed workloads.

Semantic caching: Cache LLM responses for semantically similar inputs. Langfuse + Redis. Cache hit rate 30–50% on typical enterprise workloads. Cost reduction proportional to cache hit rate.

Token budgets: Hard token limits per agent per call. Context window management — summarise old context rather than passing full history. Prompt compression for repetitive patterns.

🛡 NeMo Guardrails — Production Safety Layer

Every agent input and output passes through guardrails

# guardrails config (colang)
define flow check input
  # Detect PII before LLM sees it
  $pii = execute pii_detection(input)
  if $pii.detected:
    $input = execute redact_pii(input)
    log_event("pii_redacted", $pii.types)

  # Block prompt injection attempts
  $injection = execute injection_check(input)
  if $injection.detected:
    log_event("injection_blocked")
    return "Request blocked by safety policy"

define flow check output
  # Verify no hallucinated citations
  $faith = execute faithfulness_check(output)
  if $faith.score < 0.85:
    log_event("low_faithfulness", $faith.score)
    return rerun_with_stricter_prompt()

🔑 RBAC Architecture

Who can do what — enforced at the API layer

Role	Can View	Can Trigger	Can Approve
AI Consumer	Agent outputs	None	None
Operator	All outputs + traces	Standard workflows	None
Approver	Flagged items	Standard workflows	Assigned flags
Manager	Team scope	All workflows	All approvals
Admin	Everything + costs	Everything	Everything
AI Engineer	Traces + evals	Dev/test only	Eval gates

RBAC is enforced at the API Gateway layer — not in the agent or UI. Every request carries a JWT. Permissions checked before any agent workflow is started. Tenant isolation: users can only see and act on their tenant's data.

📋 Audit Trail Architecture

Every AI decision is permanently logged — because it has to be

What's logged: Agent triggered (who, when, why), inputs provided, outputs generated, tool calls made, human approval decisions, overrides, final action taken. Immutable — cannot be edited or deleted.

Why it matters: FCA Consumer Duty requires explainability for every financial decision. GDPR Article 22 requires human review rights for automated decisions. ICO requires data access logs. Audit trail is not optional.

Storage: OpenTelemetry traces → Grafana Tempo. Agent decisions → append-only PostgreSQL table with cryptographic hash chain. SIEM integration for security events. Retention: 7 years for regulated industries.

🐳 Container Architecture

Each layer deployed independently — scale what needs scaling

# docker-compose.prod.yml (simplified)
services:
  api-gateway:
    image: artlligence/api-gateway:latest
    replicas: 3  # horizontal scale

  fraud-detection-worker:
    image: artlligence/fraud-agent:v2.1.4
    replicas: 10  # scale with queue depth
    env:
      MAX_TOKENS_PER_CALL: 1500
      FALLBACK_MODEL: claude-haiku-4

  temporal-worker:
    image: artlligence/temporal-worker:latest
    replicas: 5

  langfuse:       # observability
  nemo-guardrails:# safety layer
  redis:          # caching + queues
  postgres:       # state + audit log

🚀 CI/CD Pipeline — Agent Deployment

Never deploy a degraded agent to production

# .github/workflows/deploy-agent.yml
on: [push to main]
jobs:
  evaluate:
    run: pytest tests/eval/ # golden dataset
    run: ragas evaluate # RAG quality
    assert: faithfulness >= 0.92
    assert: answer_relevancy >= 0.85
    # Block deploy if evals fail

  deploy-canary:
    needs: evaluate  # only if eval passes
    run: deploy to 5% of traffic
    wait: 30 min
    assert: error_rate < 0.01
    assert: p95_latency < 5000ms

  deploy-full:
    needs: deploy-canary
    run: roll out to 100% traffic
    # Old version kept for 1h rollback

☁ Cloud Deployment Options

Where to run — depends on client requirements

Cloud-native (default): GCP Cloud Run + GKE, or AWS ECS + EKS, or Azure AKS. Managed Temporal Cloud. Langfuse Cloud. Fastest deployment, lowest ops burden. Most clients start here.

Private cloud / VPC: Deploy entire stack in client's VPC. LLM calls stay within their network using Vertex AI or Azure OpenAI. Required for financial services, government, healthcare with strict data residency.

On-premise / air-gapped: Local LLM models (Ollama / vLLM) for clients where data cannot leave premises. Performance trade-off vs cloud models. Defence, intelligence, critical national infrastructure.

📅

12 weeks from signed contract to production MVP

This is the standard ARTlligence delivery timeline for any single OS product. After Week 12, the client has a production-ready system connected to their data, with full observability, governance, and ongoing improvement cycle. Subsequent OS products on the same platform are faster — platform is already built.

Weeks 1–2: Discovery & Data Architecture

Team: Solution Architect + 1 AI Engineer

Stakeholder interviews to define exact use cases and success metrics. Data source mapping — what exists, what's accessible, what quality it's in. Integration feasibility: SAP/Salesforce/SharePoint connectivity tested. RBAC and data governance requirements documented. Security and compliance baseline. Output: Technical Architecture Document + Integration Plan + Evaluation Criteria.

Weeks 3–4: Platform Foundation

Team: 2 AI Engineers + DevOps

Temporal cluster deployed. Langfuse observability live. NeMo Guardrails configured for sector. MCP connectors built and tested against live data. RBAC and SSO integrated with client identity provider. PostgreSQL audit schema defined. Kafka topics configured. First golden dataset built (50 cases) with client domain experts. CI/CD pipeline established.

Weeks 5–8: Agent Development

Team: 3 AI Engineers + 1 Domain Expert

Agents built one at a time, highest value first. Each agent: schema defined → prompt engineered → evaluation score >0.92 → HITL workflow → integration tested → code reviewed → deployed to staging. Target: 5–8 agents per week with quality gates. Weekly demo to stakeholders with live data. Evaluation score tracked — no agent ships below threshold. Human-in-the-loop flows tested with actual approvers.

Weeks 9–10: Integration & Load Testing

Team: 2 AI Engineers + QA

End-to-end workflow testing with production data volumes. Load testing: target throughput × 3. Failure injection: kill random services, verify recovery. Cost profiling: token cost per workflow documented. Security penetration test on agent API. Canary deployment to 5% of real traffic. Monitor evaluation scores on live data — confirm alignment with golden dataset.

Weeks 11–12: Production Launch & Handover

Team: Full team + Client operations team

Full production rollout. Runbook documentation. On-call escalation procedures. Client operations team trained on Temporal UI, Langfuse, and alert handling. Feedback collection mechanism live. 30-day post-launch hypercare: ARTlligence on-call for P0/P1 incidents. Ongoing improvement cycle agreed: monthly evaluation reviews, quarterly model updates, continuous golden dataset expansion.

Cost architecture for enterprise AI systems

💷 Build Cost — Indicative Ranges

Typical ARTlligence engagement structure

Engagement Type	Scope	Investment
Proof of Value	2 agents · 4 weeks · single integration	£40K–£80K
Single OS MVP	8–12 agents · 12 weeks · 3–5 integrations	£200K–£400K
Enterprise Platform	Full OS · 12+ agents · all integrations	£500K–£1.2M
Multi-OS Programme	3+ OS products on shared platform	£1M–£3M+

💰 LLM Running Cost — At Scale

Monthly operational cost examples (Claude Sonnet 4)

Workload	Volume/day	Monthly LLM Cost
Insurance claims triage	500 claims	£800–£2,000
Document intelligence	2,000 docs	£2,000–£5,000
Customer service AI	5,000 queries	£3,000–£8,000
Full enterprise OS	10,000+ ops	£8,000–£25,000

Model tiering (routing simple tasks to Haiku) and semantic caching typically reduce LLM costs by 50–70% vs worst-case estimates. Cost per output tracked to the penny in Langfuse.

📈 ROI Framing for Clients

How to present the investment case — make it undeniable

Payback period: For InsuranceOS, fraud prevention of £2.1M/month against a £300K build cost = payback in 6 weeks. Frame the investment as a fraction of Year 1 value — not as a technology cost.

Cost avoidance vs revenue: Both matter. ManufacturingOS downtime reduction (£12M/year avoided) is harder to see than revenue but easier to quantify precisely. Use real production data from the discovery phase.

Ongoing value: Unlike a software licence, AI systems improve over time. The golden dataset grows. Models improve. Human feedback refines recommendations. Year 3 value > Year 1 value from the same investment.

Team structure for production AI delivery

🏗 Solution Architect

1 per engagement · ARTlligence lead

Owns the technical architecture. Runs discovery. Defines integration patterns. Ensures platform decisions are sound at scale. Presents to client CTOs. Reviews all agent contracts before build. Accountable for delivery.

System designLLM architectureEnterprise integrationClient communication

🤖 AI Engineer

2–4 per engagement · core build team

Builds agents. Engineers prompts. Writes evaluation tests. Builds MCP connectors. Implements Temporal workflows. Owns RAGAS scores for their agents. Fixes quality issues found in production sampling.

Python / TypeScriptLangGraphTemporalPrompt engineeringEvaluation

🔧 MLOps / DevOps

1 per programme · shared across engagements

Owns platform infrastructure: Kubernetes, CI/CD, Langfuse, Temporal cluster, monitoring. Manages deployments, rollbacks, scaling. On-call for infrastructure incidents. Builds and maintains the deployment pipeline.

KubernetesTerraformGitHub ActionsGrafanaIncident response

🎓 Skills to Build or Hire

The specific capability gaps that separate prototype shops from production delivery firms

Skill	Why Critical	How to Build	Priority
Temporal.io	Durable workflow orchestration is non-negotiable at enterprise scale	Temporal.io docs + build a HITL workflow from scratch	P0
RAGAS Evaluation	Without eval, you can't prove quality or detect degradation	RAGAS docs + build golden dataset for one agent	P0
Langfuse LLMOps	Every enterprise client will ask "how do you monitor this in production"	Langfuse self-hosted + instrument one real agent	P0
MCP Server Building	SAP and Oracle integrations require custom MCP servers	Build a custom MCP server for one internal tool	P1
NeMo Guardrails	Every financial/healthcare/government client requires it	NeMo docs + configure for one sector use case	P1
Enterprise Kubernetes	Production systems require proper container orchestration	GKE or EKS with autoscaling + deploy Temporal on it	P1

🎯

When a CTO says "cool prototype" — here's exactly what to say

Don't defend the demo. Agree with them and then go further. The demo is meant to show you what's possible — not what we'd ship. Here's the platform we use to turn that into something you'd bet your operations on.

The 4 objections and the exact responses

Objection 1: "This is just a demo — it's not connected to our real data"

Response: "Correct — and that's deliberate. The demo shows you the intelligence layer. The integration layer is our Week 3–4 work. We connect to your SAP / Salesforce / SharePoint via MCP — a standardised protocol that means we're not writing custom integration code for every system. We've done this connection before for [comparable client]. The first integration usually takes 2–3 weeks. After that, your agents are working on your actual data, in your actual environment."

Back it up with: Show the MCP connector architecture. Name the specific systems they use and confirm you have connectors for them. Offer a paid technical feasibility sprint (Weeks 1–2) before full commitment.

Objection 2: "How do I know it's giving correct answers?"

Response: "This is the right question — and it's why we build an evaluation framework before we deploy a single agent. We work with your domain experts to build a golden dataset: 50–200 cases with known-correct answers. Every agent must score above a threshold on this dataset before it ships to production. And we run RAGAS evaluation on 5% of live outputs continuously — so if quality degrades after a model update or data shift, we know within 24 hours before your users do."

Back it up with: Show the RAGAS metrics screen. Explain the 0.92 faithfulness threshold. Describe what happens when the alert fires. This is the answer that separates you from every other AI vendor.

Objection 3: "What happens when it makes a mistake?"

Response: "Three things. First — our human-in-the-loop architecture means no consequential action is taken without a human approving it. The AI recommends; your authorised people decide. Second — every AI decision is logged permanently with full context. If something goes wrong, we can trace exactly what the agent saw, what it reasoned, and what it recommended. Third — overrides are training signals. When your team overrides a recommendation, that feeds back into the evaluation dataset. The system gets better at the specific cases your team disagrees with."

Back it up with: Show the Temporal HITL code pattern. Show the audit log schema. Explain the override feedback loop. This answer addresses the real fear: liability.

Objection 4: "We're not ready — we need to sort out our data first"

Response: "That's often true — and sorting out data quality is actually part of what we do in the discovery phase. In our experience, the first agent we build forces clarity on data quality issues that have been invisible for years. We don't need perfect data to start — we need to understand what you have. The data quality issues that would block production become visible in Weeks 1–2, and we either work around them or help you fix them. The worst outcome is discovering them on Day 1 of Week 12."

Back it up with: Propose a paid 2-week discovery sprint. Low financial risk for them, high information value. Most organisations can approve £20–40K without board sign-off. Discovery converts to full engagement at ~70% rate.

The positioning shift — from vendor to partner

❌ Don't position as

"We build AI demos and prototypes." "We use Claude / OpenAI to build things." "We can build you a chatbot." These frame you as a commodity vendor competing on day rate.

✅ Position as

"We deliver production enterprise AI systems with defined evaluation frameworks, full observability, and governance built in. The demos show what's possible. The platform is what makes it real." Compete on reliability, not capability.

💡 The killer differentiator

"Every other AI vendor will show you a demo. We're the only ones who can also show you the evaluation framework that tells you if it's working." Evaluation infrastructure is the moat. Most AI shops don't have it.

ARTlligence Platform: From Prototype to Production