What We're Building
This guide walks you through adding AI capabilities to an existing SaaS product. Unlike building AI-first applications from scratch, integrating AI into established products requires careful consideration of existing architecture, user workflows, and business models.
We'll cover the most common AI integration patterns that SaaS companies are adopting:
Streaming Chat
Real-time AI assistant with context awareness
Document Analysis
Extract insights from uploaded files
Content Generation
AI-powered reports and summaries
Usage Billing
Token-based metering and limits
By the end of this guide, you'll have production-ready patterns for integrating multiple AI providers, handling rate limits, streaming responses, and tracking AI usage for billing purposes.
Prerequisites
Before starting this guide, make sure you have the following:
- Existing SaaS application - A working product with authentication, database, and API infrastructure
- API keys - Claude API key (Anthropic) and/or OpenAI API key
- Node.js 18+ and experience with TypeScript
- PostgreSQL database for storing usage data and conversation history
- Redis instance for rate limiting and job queues (can use Upstash for serverless)
- Basic understanding of streaming responses and Server-Sent Events (SSE)
We recommend using both Claude and OpenAI APIs. This provides redundancy if one service has issues, and allows you to use the best model for each task (e.g., Claude for nuanced conversations, GPT-4 for structured outputs).
Tech Stack Specification
Here's the recommended technology stack for AI integration into existing SaaS products:
| Layer | Technology | Why This Choice |
|---|---|---|
| AI Integration | Claude API, OpenAI | Core AI capabilities with provider redundancy and model flexibility |
| Caching | Redis / Upstash | Response caching to reduce API costs and improve latency |
| Queue | Inngest / QStash | Background processing for long-running AI tasks without timeouts |
| Rate Limiting | Upstash | Usage management with sliding window algorithms per tier |
| Database | PostgreSQL | Usage tracking, conversation history, and cost analytics |
AI Integration Architecture
Here's how the AI components integrate with a typical SaaS architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT (React/Next.js) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Chat Widget β β Doc Upload β β AI Reports β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β β SSE / REST β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ
β API LAYER (Node.js) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Rate Limiter (Redis) β β
β β Per-user limits β Per-org limits β Global limits β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β AI Router β β Usage Trackerβ β Job Queue β β
β β Claude/GPT β β Tokens/Costs β β BullMQ β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
βββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββ
β β β
βββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββ
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Claude API β β PostgreSQL β β Redis β β
β β OpenAI API β β Usage logs β β Queue + Rate β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β DATA LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AI Agent Workflow
Here's how to leverage AI tools throughout this build to maximize productivity:
Project Setup with Claude Code
Use Claude Code to scaffold the AI integration layer based on your existing codebase structure:
# Example prompt for Claude Code
"Analyze my existing SaaS codebase and create an AI integration
layer that includes:
1. A unified AI client that supports both Claude and OpenAI
2. Rate limiting middleware using Redis
3. Usage tracking with PostgreSQL
4. Streaming response handlers for SSE
Follow my existing patterns for error handling and authentication.
Use TypeScript and match my current code style."
UI Generation with v0.dev
Generate chat interfaces and AI-powered UI components quickly:
When prompting v0.dev, describe your existing design system colors and patterns. Ask for "a chat widget that matches a dark SaaS dashboard with purple accent colors" to get components that integrate seamlessly.
Development with Cursor
Cursor excels at understanding API documentation. Paste Claude or OpenAI API docs directly and ask Cursor to generate type-safe client wrappers and error handling.
Step-by-Step Build Guide
Phase 1: AI Architecture Patterns for SaaS
Before writing code, establish your AI architecture patterns and database schema. These patterns work for any SaaS product adding AI features - whether you're building chat, analysis, or generation capabilities:
-- AI usage tracking schema
CREATE TABLE ai_conversations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
org_id UUID REFERENCES organizations(id),
title VARCHAR(255),
context JSONB, -- Store app-specific context
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE ai_messages (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversation_id UUID REFERENCES ai_conversations(id),
role VARCHAR(20) NOT NULL, -- 'user', 'assistant', 'system'
content TEXT NOT NULL,
model VARCHAR(50), -- 'claude-3-opus', 'gpt-4', etc.
tokens_input INTEGER,
tokens_output INTEGER,
cost_cents DECIMAL(10,4),
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE ai_usage_daily (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID REFERENCES organizations(id),
date DATE NOT NULL,
total_tokens BIGINT DEFAULT 0,
total_requests INTEGER DEFAULT 0,
total_cost_cents DECIMAL(10,2) DEFAULT 0,
UNIQUE(org_id, date)
);
CREATE INDEX idx_ai_messages_conversation ON ai_messages(conversation_id);
CREATE INDEX idx_ai_usage_org_date ON ai_usage_daily(org_id, date);
Phase 2: Claude API Integration Setup
Set up a unified AI client that supports both Claude and OpenAI with automatic failover. This abstraction makes it easy to switch models or add new providers later:
// lib/ai/providers.ts
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';
// Initialize API clients
export const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY!,
});
export const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY!,
});
// Model configurations for different use cases
export const AI_MODELS = {
// Claude models - great for nuanced conversation
claude: {
fast: 'claude-3-haiku-20240307',
balanced: 'claude-3-sonnet-20240229',
powerful: 'claude-3-opus-20240229',
},
// OpenAI models - great for structured output
openai: {
fast: 'gpt-4o-mini',
balanced: 'gpt-4o',
powerful: 'gpt-4-turbo',
},
} as const;
// Helper to select model based on task complexity
export function selectModel(
provider: 'claude' | 'openai',
complexity: 'fast' | 'balanced' | 'powerful'
): string {
return AI_MODELS[provider][complexity];
}
Phase 3: Streaming Responses Implementation
Streaming is essential for responsive AI chat. Users see tokens as they arrive instead of waiting for the complete response. This unified client supports both Claude and OpenAI with automatic fallback:
// lib/ai/unified-client.ts
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export interface ChatMessage {
role: 'user' | 'assistant' | 'system';
content: string;
}
export interface StreamCallbacks {
onToken: (token: string) => void;
onComplete: (usage: { input: number; output: number }) => void;
onError: (error: Error) => void;
}
export async function streamChat(
messages: ChatMessage[],
options: {
model?: string;
systemPrompt?: string;
maxTokens?: number;
},
callbacks: StreamCallbacks
): Promise<void> {
const model = options.model || 'claude-3-sonnet-20240229';
const isClaude = model.startsWith('claude');
try {
if (isClaude) {
await streamClaude(messages, options, callbacks);
} else {
await streamOpenAI(messages, options, callbacks);
}
} catch (error) {
// Fallback to alternate provider on failure
console.error(`Primary AI failed, attempting fallback`, error);
try {
if (isClaude) {
await streamOpenAI(messages, { ...options, model: 'gpt-4o' }, callbacks);
} else {
await streamClaude(messages, { ...options, model: 'claude-3-sonnet-20240229' }, callbacks);
}
} catch (fallbackError) {
callbacks.onError(fallbackError as Error);
}
}
}
async function streamClaude(
messages: ChatMessage[],
options: { systemPrompt?: string; maxTokens?: number; model?: string },
callbacks: StreamCallbacks
): Promise<void> {
const stream = await anthropic.messages.stream({
model: options.model || 'claude-3-sonnet-20240229',
max_tokens: options.maxTokens || 1024,
system: options.systemPrompt,
messages: messages.filter(m => m.role !== 'system').map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
});
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
callbacks.onToken(event.delta.text);
}
}
const finalMessage = await stream.finalMessage();
callbacks.onComplete({
input: finalMessage.usage.input_tokens,
output: finalMessage.usage.output_tokens,
});
}
Phase 4: Usage Tracking and Rate Limiting
Implement rate limiting before exposing AI endpoints. This protects your costs and ensures fair usage across customers. Use Upstash for serverless-friendly rate limiting:
// lib/ai/rate-limiter.ts
import { Redis } from '@upstash/redis';
import { Ratelimit } from '@upstash/ratelimit';
const redis = new Redis({
url: process.env.UPSTASH_REDIS_URL!,
token: process.env.UPSTASH_REDIS_TOKEN!,
});
// Different limits for different subscription tiers
export const rateLimiters = {
// Free tier: 10 requests per minute
free: new Ratelimit({
redis,
limiter: Ratelimit.slidingWindow(10, '1 m'),
prefix: 'ai:free',
}),
// Pro tier: 60 requests per minute
pro: new Ratelimit({
redis,
limiter: Ratelimit.slidingWindow(60, '1 m'),
prefix: 'ai:pro',
}),
// Enterprise: 200 requests per minute
enterprise: new Ratelimit({
redis,
limiter: Ratelimit.slidingWindow(200, '1 m'),
prefix: 'ai:enterprise',
}),
};
export async function checkRateLimit(
orgId: string,
tier: 'free' | 'pro' | 'enterprise'
): Promise<{ success: boolean; remaining: number }> {
const limiter = rateLimiters[tier];
const { success, remaining } = await limiter.limit(orgId);
return { success, remaining };
}
Track usage for billing and analytics with a dedicated usage tracker:
// lib/ai/usage-tracker.ts
import { db } from '@/lib/db';
interface UsageRecord {
orgId: string;
userId: string;
model: string;
tokensInput: number;
tokensOutput: number;
}
// Model pricing per 1K tokens (in cents)
const PRICING = {
'claude-3-opus': { input: 1.5, output: 7.5 },
'claude-3-sonnet': { input: 0.3, output: 1.5 },
'claude-3-haiku': { input: 0.025, output: 0.125 },
'gpt-4-turbo': { input: 1.0, output: 3.0 },
'gpt-4o': { input: 0.5, output: 1.5 },
};
export async function trackUsage(usage: UsageRecord): Promise<void> {
const pricing = PRICING[usage.model as keyof typeof PRICING];
const costCents =
(usage.tokensInput / 1000) * pricing.input +
(usage.tokensOutput / 1000) * pricing.output;
// Update daily aggregates (upsert pattern)
await db.query(`
INSERT INTO ai_usage_daily (org_id, date, total_tokens, total_requests, total_cost_cents)
VALUES ($1, CURRENT_DATE, $2, 1, $3)
ON CONFLICT (org_id, date)
DO UPDATE SET
total_tokens = ai_usage_daily.total_tokens + $2,
total_requests = ai_usage_daily.total_requests + 1,
total_cost_cents = ai_usage_daily.total_cost_cents + $3
`, [
usage.orgId,
usage.tokensInput + usage.tokensOutput,
costCents
]);
}
export async function checkMonthlyLimit(
orgId: string,
limitCents: number
): Promise<{ withinLimit: boolean; used: number }> {
const result = await db.query(`
SELECT COALESCE(SUM(total_cost_cents), 0) as total
FROM ai_usage_daily
WHERE org_id = $1
AND date >= DATE_TRUNC('month', CURRENT_DATE)
`, [orgId]);
const used = parseFloat(result.rows[0].total);
return { withinLimit: used < limitCents, used };
}
Phase 5: Background AI Processing
Long-running AI tasks like document analysis or report generation should run in the background using Inngest or QStash. This avoids timeout issues and provides better UX:
// lib/ai/content-generator.ts
import { Queue, Worker } from 'bullmq';
import { anthropic } from './unified-client';
const connection = {
host: process.env.REDIS_HOST,
port: parseInt(process.env.REDIS_PORT || '6379')
};
// Queue for long-running AI generation tasks
export const contentQueue = new Queue('ai-content', { connection });
interface ContentJob {
type: 'report' | 'email' | 'summary';
data: Record<string, any>;
orgId: string;
userId: string;
webhookUrl?: string;
}
// Add a content generation job
export async function queueContentGeneration(
job: ContentJob
): Promise<string> {
const result = await contentQueue.add('generate', job, {
attempts: 3,
backoff: { type: 'exponential', delay: 1000 },
});
return result.id!;
}
// Worker to process content generation
const worker = new Worker('ai-content', async (job) => {
const { type, data, orgId, userId, webhookUrl } = job.data as ContentJob;
const templates = {
report: `Generate a detailed report based on: ${JSON.stringify(data)}`,
email: `Write a professional email: ${JSON.stringify(data)}`,
summary: `Create an executive summary: ${JSON.stringify(data)}`,
};
const response = await anthropic.messages.create({
model: 'claude-3-opus-20240229', // Use Opus for quality content
max_tokens: 4096,
messages: [{ role: 'user', content: templates[type] }],
});
const content = response.content[0].type === 'text'
? response.content[0].text
: '';
// Notify via webhook if provided
if (webhookUrl) {
await fetch(webhookUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ jobId: job.id, content, type }),
});
}
return { content, tokensUsed: response.usage };
}, { connection });
Phase 6: Caching Strategies for Cost Optimization
Caching AI responses is one of the most impactful ways to reduce costs. Use Redis/Upstash to cache identical or similar queries:
// lib/ai/cache.ts - AI Response Caching
import { Redis } from '@upstash/redis';
import crypto from 'crypto';
const redis = new Redis({
url: process.env.UPSTASH_REDIS_URL!,
token: process.env.UPSTASH_REDIS_TOKEN!,
});
// Cache TTL in seconds (24 hours for most responses)
const CACHE_TTL = 86400;
// Generate a cache key from the prompt and context
function generateCacheKey(
prompt: string,
model: string,
context?: string
): string {
const input = `${model}:${prompt}:${context || ''}`;
return `ai:cache:${crypto.createHash('sha256').update(input).digest('hex')}`;
}
// Check cache before calling AI
export async function getCachedResponse(
prompt: string,
model: string,
context?: string
): Promise<string | null> {
const key = generateCacheKey(prompt, model, context);
return await redis.get(key);
}
// Store response in cache
export async function cacheResponse(
prompt: string,
model: string,
response: string,
context?: string,
ttl: number = CACHE_TTL
): Promise<void> {
const key = generateCacheKey(prompt, model, context);
await redis.setex(key, ttl, response);
}
// Wrapper that handles caching automatically
export async function withCache<T>(
cacheKey: { prompt: string; model: string; context?: string },
fetchFn: () => Promise<T>,
options: { ttl?: number; skipCache?: boolean } = {}
): Promise<T & { cached: boolean }> {
if (!options.skipCache) {
const cached = await getCachedResponse(
cacheKey.prompt,
cacheKey.model,
cacheKey.context
);
if (cached) {
return { ...JSON.parse(cached), cached: true };
}
}
const result = await fetchFn();
await cacheResponse(
cacheKey.prompt,
cacheKey.model,
JSON.stringify(result),
cacheKey.context,
options.ttl
);
return { ...result, cached: false };
}
Cache hit rates of 30-50% are common for SaaS products with repeated queries. Combined with model tiering (Haiku for simple tasks, Sonnet for complex), you can reduce AI costs by 60-80%.
Key caching strategies that work for any SaaS:
- Semantic caching: Cache similar queries, not just identical ones. Use embeddings to find semantically similar prompts.
- Context-aware caching: Include relevant context (user tier, date range, etc.) in the cache key for personalized but cacheable responses.
- Tiered TTLs: Use shorter TTLs (1 hour) for dynamic content, longer TTLs (24+ hours) for factual or reference content.
- Cache warming: Pre-generate responses for common queries during off-peak hours.
Common Issues and Solutions
Here are some common issues you might encounter and how to solve them:
Vercel Edge Functions have a 30-second timeout. For long-running AI responses, use export const runtime = 'edge' which allows streaming up to 30 seconds, or move to background jobs for content generation tasks.
Context Window Management
When conversations get long, you'll hit context limits. Implement a sliding window or summarization strategy:
// Trim conversation history to fit context window
function trimConversation(
messages: ChatMessage[],
maxTokens: number = 100000
): ChatMessage[] {
let tokenCount = 0;
const trimmed: ChatMessage[] = [];
// Always keep system message
const system = messages.find(m => m.role === 'system');
if (system) trimmed.push(system);
// Add messages from most recent, respecting token limit
for (const msg of messages.reverse()) {
const msgTokens = estimateTokens(msg.content);
if (tokenCount + msgTokens > maxTokens) break;
if (msg.role !== 'system') {
trimmed.unshift(msg);
tokenCount += msgTokens;
}
}
return trimmed;
}
Controlling AI Costs
AI costs can spiral quickly. Combine rate limiting (Phase 4) with caching (Phase 6) for maximum savings:
- Hard limits per organization: Stop AI access when monthly spend exceeds plan limit
- Model tiering: Use Claude Haiku or GPT-4o-mini for simple tasks, reserve Opus for complex analysis
- Response caching: See Phase 6 for detailed caching patterns - aim for 30-50% cache hit rates
- Alert thresholds: Notify admins when usage hits 80% of limits
Next Steps
You now have a solid foundation for AI integration. Here's how to expand:
- Add RAG (Retrieval-Augmented Generation): Connect a vector database like Pinecone or pgvector to give AI access to your product's knowledge base
- Implement AI Agents: Allow AI to take actions in your app (create tasks, update records) with proper permission guards
- Build Custom Fine-Tuning: Use OpenAI's fine-tuning API to create models specialized for your domain
- Add Voice Interface: Integrate Whisper for voice input and ElevenLabs for AI-generated audio responses
- Create AI Marketplace: Let users create and share custom AI prompts/workflows within your platform
Adding AI to an existing SaaS product is complex. If you need expert guidance on architecture, implementation, or optimization, reach out for a consultation.
Follow the Vibe Coding Enthusiast
Follow JD β product updates on LinkedIn, personal takes on X.