Cloudain LogoCloudainInnovation Hub
Serverless Doesn't Mean Stateless: Engineering High-Performance AI Systems on AWS

Posted by

Cloudain Editorial Team

Cloud Architecture

Serverless Doesn't Mean Stateless: Engineering High-Performance AI Systems on AWS

How Cloudain built cost-efficient, high-performance AI systems using AWS serverless while maintaining state, context, and conversation memory across millions of interactions.

Author

Cloudain Cloud Engineering Team

Published

2025-11-04

Read Time

9 min read

Introduction

"Serverless" sounds simple: write code, deploy functions, let AWS handle the rest. But when you're building production AI systems that power Growain's marketing intelligence and Cloudain Platform's customer onboarding, reality hits fast.

The challenge: AI conversations need memory. Users expect context to persist. Responses must be fast. And costs can't spiral out of control.

This article reveals how we engineered high-performance, stateful AI systems on AWS serverless infrastructure-achieving sub-second response times while reducing costs by 40%.

The Serverless Promise vs. Reality

What Serverless Offers

AWS Lambda and serverless services promise:

  • No server management: Focus on code, not infrastructure
  • Automatic scaling: Handle 1 request or 1 million
  • Pay-per-use: Only charged for execution time
  • Built-in reliability: Multi-AZ deployment included

The Hidden Challenges

But production AI systems face unique hurdles:

1. Cold Starts

CODE
User: "What's my order status?"
System: [300ms cold start + 800ms model inference]
User: [frustrated by 1.1 second delay]

2. State Management

CODE
User: "Show me my recent purchases"
AI: [retrieves data]
User: "How much did I spend?"
AI: "I don't have context 300">from your previous message"

3. Cost Explosion

CODE
Month 1: $500 (testing)
Month 2: $2,800 (beta launch)
Month 3: $12,400 (production scale)
Month 4: [panic]

4. Latency Variability

CODE
Request 1: 250ms ✓
Request 2: 3,200ms ✗
Request 3: 180ms ✓
Request 4: 4,800ms ✗

These aren't theoretical-we hit every single one.

Scaling Cloudain's AI Load Across Brands

The Architecture Challenge

Cloudain runs multiple AI-powered products:

  • Growain: Marketing automation and campaign intelligence
  • Cloudain Platform: CRM, onboarding, and customer support
  • CoreFinOps: Financial operations and forecasting
  • MindAgain: Wellness conversations and mental health support
  • Securitain: Compliance automation and audit intelligence

Combined load:

  • 2.4M AI conversations per month
  • 18M individual messages
  • Peak: 850 requests/second
  • 24/7 global availability required

Infrastructure constraints:

  • Must stay serverless (no EC2 to manage)
  • Cost target: <$0.02 per conversation
  • P95 latency: <800ms
  • 99.9% availability

The Solution Architecture

CODE
┌─────────────────────────────────────────────────┐
│              CloudFront CDN                     │
│  Global edge locations for static assets       │
└────────────────┬────────────────────────────────┘
                 │
┌────────────────▼────────────────────────────────┐
│           API Gateway (REST/WebSocket)          │
│  • Request validation                           │
│  • Rate limiting via CoreCloud                  │
│  • JWT authentication                           │
└────────────────┬────────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
┌──────────────┐  ┌──────────────────┐
│ Lambda       │  │ Lambda            │
│ (Warm Pool)  │  │ (Provisioned)     │
│              │  │                   │
│ • Brand      │  │ • High-volume     │
│   routing    │  │   endpoints       │
│ • Context    │  │ • Sub-200ms       │
│   loading    │  │   guarantee       │
└──────┬───────┘  └────────┬──────────┘
       │                   │
       └─────────┬─────────┘
                 ▼
┌─────────────────────────────────────────────────┐
│          AWS Bedrock / OpenAI API               │
│  Model inference with streaming                 │
└────────────────┬────────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
┌──────────────┐  ┌──────────────────┐
│ DynamoDB     │  │ ElastiCache       │
│              │  │ (Redis)           │
│ • Long-term  │  │                   │
│   memory     │  │ • Session cache   │
│ • User prefs │  │ • Hot data        │
│ • Audit logs │  │ • Sub-5ms reads   │
└──────────────┘  └──────────────────┘

Token Throttling and Budget Alarms via CoreCloud

The Cost Problem

AI inference is expensive:

  • GPT-4: ~$0.03 per 1K tokens
  • Claude 3: ~$0.015 per 1K tokens
  • AWS Bedrock: ~$0.01 per 1K tokens

Average conversation: 3,000 tokens 2.4M conversations/month = $72,000/month at standard rates

Unacceptable for a growing SaaS business.

CoreCloud's Token Budget System

Per-User Limits:

TYPESCRIPT
300">interface UserTokenBudget {
  userId: string
  plan: &#39;free&#39; | &#39;professional&#39; | &#39;enterprise&#39;
  limits: {
    tokensPerDay: number
    tokensPerMonth: number
    maxContextWindow: number
  }
  usage: {
    tokensToday: number
    tokensThisMonth: number
    lastReset: Date
  }
}

// Example budget configuration
300">const budgets = {
  free: {
    tokensPerDay: 5000,
    tokensPerMonth: 100000,
    maxContextWindow: 4000
  },
  professional: {
    tokensPerDay: 50000,
    tokensPerMonth: 1000000,
    maxContextWindow: 8000
  },
  enterprise: {
    tokensPerDay: -1, // unlimited
    tokensPerMonth: -1,
    maxContextWindow: 32000
  }
}

Real-Time Throttling:

TYPESCRIPT
300">async 300">function checkTokenBudget(userId: string, estimatedTokens: number) {
  300">const budget = 300">await CoreCloud.getTokenBudget(userId)

  // Check daily limit
  300">if (budget.usage.tokensToday + estimatedTokens > budget.limits.tokensPerDay) {
    300">throw 300">new TokenLimitExceededError(&#39;Daily limit reached&#39;)
  }

  // Check monthly limit
  300">if (budget.usage.tokensThisMonth + estimatedTokens > budget.limits.tokensPerMonth) {
    300">throw 300">new TokenLimitExceededError(&#39;Monthly limit reached&#39;)
  }

  // Reserve tokens
  300">await CoreCloud.reserveTokens(userId, estimatedTokens)

  300">return 300">true
}

Brand-Level Budgets

Different products have different economics:

TYPESCRIPT
300">const brandBudgets = {
  growain: {
    // Marketing AI - higher token budgets
    tokensPerConversation: 5000,
    monthlyBudget: 50000000,
    alertThreshold: 0.8
  },

  mindagain: {
    // Wellness AI - empathetic, longer conversations
    tokensPerConversation: 8000,
    monthlyBudget: 30000000,
    alertThreshold: 0.75
  },

  corefinops: {
    // Financial AI - precision over length
    tokensPerConversation: 2000,
    monthlyBudget: 15000000,
    alertThreshold: 0.85
  }
}

Budget Alarms

CloudWatch Alarms via CoreCloud:

TYPESCRIPT
// Set up budget monitoring
300">await CloudWatch.putMetricAlarm({
  AlarmName: &#39;GrowainTokenBudget80Percent&#39;,
  MetricName: &#39;TokenUsage&#39;,
  Namespace: &#39;Cloudain/CoreCloud&#39;,
  Statistic: &#39;Sum&#39;,
  Period: 3600, // 1 hour
  EvaluationPeriods: 1,
  Threshold: brandBudgets.growain.monthlyBudget * 0.8,
  ComparisonOperator: &#39;GreaterThanThreshold&#39;,
  AlarmActions: [snsTopicArn],
  TreatMissingData: &#39;notBreaching&#39;
})

Result: 40% cost reduction through intelligent budgeting and throttling.

Memory TTL and Caching in AgenticCloud

The Memory Challenge

AI conversations need context:

CODE
User: "What&#39;s the weather?"
AI: "It&#39;s sunny, 72°F"
User: "Should I bring an umbrella?"
AI: [needs to remember previous weather response]

But storing every conversation indefinitely is expensive and unnecessary.

Multi-Tier Memory Architecture

Tier 1: Hot Cache (ElastiCache Redis)

  • Duration: Active session + 5 minutes
  • Data: Current conversation context
  • Latency: <5ms
  • Use case: Real-time conversations
TYPESCRIPT
300">interface SessionCache {
  sessionId: string
  userId: string
  brand: string
  context: {
    messages: Message[]
    userPreferences: object
    activeTools: string[]
  }
  ttl: number // 300 seconds after last activity
}

// Cache management
300">await Redis.setex(
  &#96;session:${sessionId}&#96;,
  300, // 5 minutes
  JSON.stringify(sessionContext)
)

Tier 2: Warm Storage (DynamoDB)

  • Duration: 7 days for free users, 90 days for paid
  • Data: Conversation history, summaries
  • Latency: <50ms
  • Use case: Recent conversation recall
TYPESCRIPT
300">interface ConversationHistory {
  conversationId: string
  userId: string
  brand: string
  messages: Message[]
  summary: string // AI-generated summary for quick context
  createdAt: number
  ttl: number // Auto-delete via DynamoDB TTL
}

// Set TTL based on user plan
300">const ttl = user.plan === &#39;free&#39;
  ? Date.now() + (7 * 24 * 60 * 60 * 1000)
  : Date.now() + (90 * 24 * 60 * 60 * 1000)

300">await DynamoDB.putItem({
  TableName: &#39;ConversationHistory&#39;,
  Item: {
    ...conversationData,
    ttl: ttl
  }
})

Tier 3: Cold Storage (S3 + Glacier)

  • Duration: Enterprise users, indefinite
  • Data: Complete audit trail, analytics
  • Latency: Minutes to hours
  • Use case: Compliance, analytics

Smart Context Window Management

Problem: Loading full conversation history is slow and expensive.

Solution: Intelligent summarization and context pruning.

TYPESCRIPT
300">async 300">function loadContextWindow(conversationId: string, maxTokens: number) {
  // Get recent messages 300">from cache
  300">const recentMessages = 300">await Redis.lrange(
    &#96;conv:${conversationId}:messages&#96;,
    -10, // Last 10 messages
    -1
  )

  // Calculate token count
  300">let tokenCount = countTokens(recentMessages)

  // If under budget, 300">return as-is
  300">if (tokenCount <= maxTokens) {
    300">return recentMessages
  }

  // Otherwise, load summary 300">from DynamoDB
  300">const summary = 300">await DynamoDB.getItem({
    TableName: &#39;ConversationHistory&#39;,
    Key: { conversationId },
    ProjectionExpression: &#39;summary&#39;
  })

  // Combine summary with recent messages
  300">return [
    { role: &#39;system&#39;, content: summary },
    ...recentMessages.slice(-5) // Keep last 5 messages
  ]
}

Caching Strategy

Response Caching for Common Queries:

TYPESCRIPT
// Cache frequent responses
300">const cacheKey = &#96;response:${brand}:${normalizeQuery(userMessage)}&#96;

// Check cache first
300">const cached = 300">await Redis.get(cacheKey)
300">if (cached) {
  300">return JSON.parse(cached)
}

// Generate 300">new response
300">const response = 300">await generateAIResponse(userMessage)

// Cache for 1 hour
300">await Redis.setex(cacheKey, 3600, JSON.stringify(response))

300">return response

Cache Invalidation:

  • User preferences change → Clear user cache
  • Brand config updates → Clear brand cache
  • Model update → Clear all cached responses

Memory Cost Optimization Results

Before optimization:

  • 18M messages × 500 bytes = 9GB stored
  • DynamoDB cost: ~$2,250/month
  • ElastiCache cost: ~$800/month
  • Total: ~$3,050/month

After optimization:

  • Hot cache: 50K active sessions × 20KB = 1GB
  • Warm storage: 5M recent messages × 300 bytes = 1.5GB
  • Cold storage: S3 Glacier
  • Total: ~$450/month (85% reduction)

Lambda Performance Optimization

Cold Start Elimination

Problem: Lambda cold starts add 300-500ms latency.

Solution 1: Provisioned Concurrency

TYPESCRIPT
// For high-traffic endpoints
300">await Lambda.putProvisionedConcurrencyConfig({
  FunctionName: &#39;growain-chat-handler&#39;,
  ProvisionedConcurrentExecutions: 50 // Keep 50 instances warm
})

Cost: ~$200/month for 50 instances Benefit: Zero cold starts, guaranteed <200ms response

Solution 2: Lambda SnapStart (Java/Python)

Reduces cold start from 500ms to <50ms by caching initialization.

Solution 3: Keep-Alive Pinging

TYPESCRIPT
// Ping Lambda every 5 minutes to keep warm
setInterval(300">async () => {
  300">await Lambda.invoke({
    FunctionName: &#39;mindagain-chat-handler&#39;,
    InvocationType: &#39;Event&#39;, // Async
    Payload: JSON.stringify({ 300">type: &#39;warmup&#39; })
  })
}, 5 * 60 * 1000)

Memory and CPU Optimization

Right-Sizing Lambda Memory:

TYPESCRIPT
// Memory affects both RAM and CPU
300">const lambdaConfigs = {
  &#39;lightweight-routing&#39;: {
    memory: 512, // MB
    timeout: 5 // seconds
  },
  &#39;ai-inference&#39;: {
    memory: 3008, // More CPU for faster processing
    timeout: 30
  },
  &#39;batch-processing&#39;: {
    memory: 10240, // Maximum
    timeout: 900 // 15 minutes
  }
}

Testing revealed:

  • 512MB: 800ms average response time
  • 1024MB: 450ms average response time
  • 3008MB: 200ms average response time

Sweet spot: 3008MB (6 vCPUs) for AI workloads.

Concurrent Execution Limits

TYPESCRIPT
// Prevent runaway costs
300">await Lambda.putFunctionConcurrency({
  FunctionName: &#39;growain-chat-handler&#39;,
  ReservedConcurrentExecutions: 100 // Max 100 parallel executions
})

Reserved vs. Unreserved:

  • Reserved: Guaranteed capacity, prevents throttling
  • Unreserved: Shared pool, may throttle under load

DynamoDB Performance Tuning

Table Design for Low Latency

Partition Key Strategy:

TYPESCRIPT
// Good: Distributes load evenly
{
  PK: &#39;USER#user_789&#39;,
  SK: &#39;CONV#2025-11-04T10:30:00Z&#39;
}

// Bad: Hot partition
{
  PK: &#39;BRAND#growain&#39;, // All requests hit same partition
  SK: &#39;CONV#...&#39;
}

Global Secondary Indexes (GSI):

TYPESCRIPT
// Query by multiple access patterns
{
  TableName: &#39;Conversations&#39;,
  KeySchema: [
    { AttributeName: &#39;userId&#39;, KeyType: &#39;HASH&#39; },
    { AttributeName: &#39;timestamp&#39;, KeyType: &#39;RANGE&#39; }
  ],
  GlobalSecondaryIndexes: [
    {
      IndexName: &#39;BrandTimestampIndex&#39;,
      KeySchema: [
        { AttributeName: &#39;brand&#39;, KeyType: &#39;HASH&#39; },
        { AttributeName: &#39;timestamp&#39;, KeyType: &#39;RANGE&#39; }
      ]
    }
  ]
}

Read/Write Capacity Optimization

On-Demand vs. Provisioned:

TYPESCRIPT
// On-Demand: Pay per request (good for variable traffic)
{
  BillingMode: &#39;PAY_PER_REQUEST&#39;
}

// Provisioned: Reserve capacity (good for predictable traffic)
{
  BillingMode: &#39;PROVISIONED&#39;,
  ProvisionedThroughput: {
    ReadCapacityUnits: 100,
    WriteCapacityUnits: 50
  }
}

Auto-Scaling:

TYPESCRIPT
300">await AutoScaling.registerScalableTarget({
  ServiceNamespace: &#39;dynamodb&#39;,
  ResourceId: &#39;table/Conversations&#39;,
  ScalableDimension: &#39;dynamodb:table:ReadCapacityUnits&#39;,
  MinCapacity: 50,
  MaxCapacity: 500
})

300">await AutoScaling.putScalingPolicy({
  PolicyName: &#39;DynamoDBReadAutoScaling&#39;,
  ServiceNamespace: &#39;dynamodb&#39;,
  ResourceId: &#39;table/Conversations&#39;,
  ScalableDimension: &#39;dynamodb:table:ReadCapacityUnits&#39;,
  PolicyType: &#39;TargetTrackingScaling&#39;,
  TargetTrackingScalingPolicyConfiguration: {
    TargetValue: 70.0, // 70% utilization
    PredefinedMetricSpecification: {
      PredefinedMetricType: &#39;DynamoDBReadCapacityUtilization&#39;
    }
  }
})

Batch Operations

Batch Reads:

TYPESCRIPT
// Instead of 100 individual GetItem calls
300">const items = 300">await DynamoDB.batchGetItem({
  RequestItems: {
    &#39;Conversations&#39;: {
      Keys: conversationIds.map(id => ({ conversationId: id }))
    }
  }
})

// Reduces latency 300">from 5,000ms to 150ms

Batch Writes:

TYPESCRIPT
// Write 25 items in one request
300">await DynamoDB.batchWriteItem({
  RequestItems: {
    &#39;AuditLogs&#39;: auditEvents.map(event => ({
      PutRequest: { Item: event }
    }))
  }
})

S3 and KMS Integration

Secure Data Storage

Encryption at Rest:

TYPESCRIPT
300">await S3.putBucketEncryption({
  Bucket: &#39;cloudain-conversation-archive&#39;,
  ServerSideEncryptionConfiguration: {
    Rules: [{
      ApplyServerSideEncryptionByDefault: {
        SSEAlgorithm: &#39;aws:kms&#39;,
        KMSMasterKeyID: kmsKeyId
      }
    }]
  }
})

Intelligent Tiering

Lifecycle Policies:

TYPESCRIPT
300">await S3.putBucketLifecycleConfiguration({
  Bucket: &#39;cloudain-conversation-archive&#39;,
  LifecycleConfiguration: {
    Rules: [
      {
        Id: &#39;ArchiveOldConversations&#39;,
        Status: &#39;Enabled&#39;,
        Transitions: [
          {
            Days: 90,
            StorageClass: &#39;STANDARD_IA&#39; // Infrequent Access
          },
          {
            Days: 365,
            StorageClass: &#39;GLACIER&#39; // Long-term archive
          }
        ]
      }
    ]
  }
})

Cost Impact:

  • Standard: $0.023/GB/month
  • IA: $0.0125/GB/month
  • Glacier: $0.004/GB/month

For 100GB of conversation data:

  • All Standard: $2.30/month
  • 90-day tiering: $0.60/month (74% savings)

Real-World Performance Metrics

Latency Breakdown

Growain Marketing Chat (P95):

CODE
API Gateway validation: 8ms
JWT verification (CoreCloud): 12ms
Cache check (Redis): 4ms
Context loading (DynamoDB): 38ms
AI inference (Bedrock): 520ms
Response formatting: 6ms
─────────────────────────────
Total: 588ms ✓

Cloudain Platform Support (P95):

CODE
Authentication: 15ms
Session lookup: 25ms
AI inference: 380ms
Audit logging: 8ms
─────────────────────────────
Total: 428ms ✓

Cost Per Conversation

Before Optimization:

CODE
Lambda execution: $0.008
AI inference: $0.035
DynamoDB: $0.004
Total: $0.047 per conversation
Monthly (2.4M): $112,800

After Optimization:

CODE
Lambda (provisioned): $0.002
AI inference (cached): $0.012
DynamoDB (on-demand): $0.001
Redis cache: $0.0003
Total: $0.0153 per conversation
Monthly (2.4M): $36,720

67% cost reduction

Monitoring and Observability

CloudWatch Dashboards

Key Metrics:

  • Lambda concurrent executions
  • API Gateway 4xx/5xx errors
  • DynamoDB throttling events
  • Token usage by brand
  • Average response time
  • Cache hit ratio

Custom Metrics

TYPESCRIPT
// Publish custom CloudWatch metrics
300">await CloudWatch.putMetricData({
  Namespace: &#39;Cloudain/AgenticCloud&#39;,
  MetricData: [{
    MetricName: &#39;AIInferenceLatency&#39;,
    Value: inferenceTime,
    Unit: &#39;Milliseconds&#39;,
    Dimensions: [
      { Name: &#39;Brand&#39;, Value: &#39;growain&#39; },
      { Name: &#39;Model&#39;, Value: &#39;bedrock-claude&#39; }
    ]
  }]
})

Alerts

TYPESCRIPT
// Alert on high latency
300">await CloudWatch.putMetricAlarm({
  AlarmName: &#39;HighAILatency&#39;,
  MetricName: &#39;AIInferenceLatency&#39;,
  Namespace: &#39;Cloudain/AgenticCloud&#39;,
  Statistic: &#39;Average&#39;,
  Period: 300,
  EvaluationPeriods: 2,
  Threshold: 1000, // 1 second
  ComparisonOperator: &#39;GreaterThanThreshold&#39;,
  AlarmActions: [snsTopicArn]
})

Lessons Learned

1. Provisioned Concurrency is Worth It

For high-traffic endpoints, the cost of provisioned concurrency ($200/month) is far less than the cost of poor user experience.

2. Memory != Cost (Sometimes)

Higher Lambda memory means faster execution, which can reduce total cost despite higher per-second rates.

3. Cache Everything (Intelligently)

But implement proper invalidation-stale cache is worse than no cache.

4. DynamoDB TTL is Free

Use it aggressively for automatic data cleanup.

5. Monitor Token Usage Obsessively

AI costs can spiral fast. Real-time budgets are essential.

Conclusion

Serverless doesn't mean giving up on performance or state management. With the right architecture-combining Lambda, DynamoDB, ElastiCache, and intelligent caching-you can build AI systems that are:

Fast: Sub-second response times Scalable: Handle millions of conversations Cost-effective: 67% cost reduction Stateful: Maintain context across sessions

The key lessons:

  • Use provisioned concurrency for critical paths
  • Implement multi-tier memory with TTLs
  • Cache aggressively with smart invalidation
  • Monitor and budget AI inference costs
  • Right-size Lambda memory based on actual performance

At Cloudain, this architecture powers Growain, MindAgain, CoreFinOps, and the Cloudain Platform-proving that serverless can handle enterprise AI workloads.

Build High-Performance AI on AWS

Ready to optimize your serverless AI architecture?

Schedule an Architecture Review →

Learn how CoreCloud and AgenticCloud can help you scale efficiently.

C

Cloudain Cloud Engineering Team

Expert insights on AI, Cloud, and Compliance solutions. Helping organisations transform their technology infrastructure with innovative strategies.

Unite your teams behind measurable transformation outcomes.

Partner with Cloudain specialists to architect resilient platforms, govern AI responsibly, and accelerate intelligent operations.