Skip to main content

1. Concurrent Execution in Temporal Workflows

Problem

High concurrency in workflow executions causes excessive memory usage and potential resource exhaustion.

Solution

  • Vertical Scaling – Increase memory and CPU allocation on Fly.io to handle higher concurrency loads.
  • Improve Observability – Implement resource utilization tracking to identify optimal infrastructure configurations. Based on observability metrics, introduce automated scaling up or down.
  • Cloud Migration Evaluation – Assess whether Fly.io provides sufficient observability and scalability. Consider migrating to a more robust cloud platform such as AWS EC2 or Google Cloud (GCP) if limitations are identified.

2. Absence of a Dedicated Queue for Workflow Automation Tasks

Problem

Two main issues exist:
  1. Concurrent Request Overload – A lack of a queue processor causes incoming requests to be lost or rejected when Temporal workflows reach capacity or are scaling up.
  2. Inefficient Queue Workaround – A PostgreSQL table is currently being used as a pseudo-queue for storing incoming workflow requests. This design introduces performance bottlenecks due to frequent read operations and the absence of cleanup logic for expired entries. As the table grows, Temporal workflow execution slows down when fetching pending requests.

Solution

  • Use Temporal Task Queues – Investigate whether Temporal Task Queues can handle all incoming tasks effectively.
  • Alternative: Supabase PGMQ – If Temporal queues are insufficient, adopt pgmq in Supabase as an in-memory, durable message queue.
  • Migration – Move the current PostgreSQL-based queue logic to the selected queueing system. Prefer using Temporal’s native Task Queues for tighter integration and reduced complexity.

3. Lack of Caching

Problem

Temporal workflows frequently query PostgreSQL to retrieve workflow-related data, resulting in latency due to complex and repeated read operations.

Solution

  • Introduce In-Memory Caching – Implement Redis as a caching layer integrated with Supabase PostgreSQL. Redis will store frequently accessed data in memory, significantly reducing query latency and improving workflow throughput.

4. Temporal Workflow Updates

Problem

Updating Temporal workflows may cause ongoing executions to hang or fail during deployment.

Solution

  • Consult Temporal Team – Engage with Temporal support to identify update-safe mechanisms or best practices.
  • Container Replication Strategy – If the issue originates from Fly.io deployment behavior, ensure that running containers can be replicated during updates to maintain continuity of workflow execution.

5. Server-Sent Events (SSE) Limitations

Problem

The current SSE implementation on Fly.io performs adequately for MVP-level workloads but lacks scalability features such as caching, event replay, and state recovery for large-scale client connections.

Solution

  • Migrate to Supabase Realtime – Replace Fly.io SSE with Supabase Realtime, which provides built-in session handling, event persistence, and fault tolerance.
  • Decommission SSE on Fly.io – After migration, remove SSE instances from Fly.io, terminate related services, and update documentation to reflect the new architecture.

Implementation Plan & Analysis

Context: Accounting Queue Performance Analysis

Current State:
  • Throughput: ~22 documents/min (1 job = 1 document)
  • Worker concurrency: 3 documents at a time
  • DB connections: 15 (bottleneck)
  • Processing time: ~400ms per document
Target: 100x improvement → ~2,200 documents/min

Key Findings

1. Database Connections Are The Primary Bottleneck (60%)

Queue queries account for < 1% of processing time:
  • Queue operations: ~2ms per document
  • Actual accounting work: ~400ms per document (200x more time)
Note: Each job processes exactly 1 document (enforced by @@unique([org_id, document_id]) constraint). Connection Pooling Status:
  • Supavisor is enabled (Supabase’s connection pooler, replacement for PgBouncer)
  • DATABASE_URL uses transaction mode pooling (:6543?pgbouncer=true)
  • ⚠️ Backend pool size limited to 15 connections due to Nano compute size
  • 🎯 Supavisor can handle 1000s of clients but only opens 15 connections to Postgres
Implication: Switching to Temporal Task Queue alone won’t significantly improve performance. The Nano compute’s 15-connection backend limit must be addressed first.

2. Horizontal Scaling Patterns

With Current Setup (PostgreSQL Queue):
Workers    Documents/Min    Bottleneck
1          22               Worker concurrency (3 at a time)
5          110              DB connections (15 limit)
10         150              DB connection exhaustion
20         150              ← No improvement (DB saturated)
With DB Upgrade + PostgreSQL Queue:
Workers    Documents/Min    Bottleneck
1          30               Worker concurrency
10         1,200            Queue table contention starts
20         2,000            Queue table contention increases
50         2,500            ← Scaling ceiling (queue hotspot)
With DB Upgrade + Temporal Task Queue:
Workers    Documents/Min    Bottleneck
1          30               Worker concurrency
10         1,500            None yet (linear scaling)
20         3,000            None yet (linear scaling)
50         7,000            DB CPU limit starts
100        10,000           ← DB CPU ceiling

Stage 1: Database & Caching (PRIORITY) - Week 1

Impact: 10-30x improvement | Cost: ~$146/mo | Effort: 2-3 days This stage unlocks horizontal scaling for both queue approaches. Actions:
  1. Upgrade Supabase Compute Size ($32-80/mo) ← CRITICAL BOTTLENECK
    • Current: Nano compute (15 connections) - This is limiting horizontal scaling!
    • Recommended: Small compute (60 connections) - $32/mo additional
    • For 100x scale: Medium compute (120 connections) - $80/mo additional
    • Go to: Project Settings → Database → Compute Size
    • After upgrade, update connection_limit in packages/database/src/direct-client.ts:
      connectionUrl.searchParams.set("connection_limit", "50"); // Was 15
      
    • ✅ Supavisor pooling is already enabled (verified via scripts/verify-supavisor.ts)
  2. Implement Redis Caching ($50/mo)
    • Cache posting matrix rules (100ms → 5ms)
    • Cache chart of accounts
    • Cache organization settings
    • Reduces processing time: 400ms → 150ms per document (2.6x faster)
  3. Optimize Worker Configuration
    // apps/temporal-workers/src/worker.ts
    maxConcurrentActivityTaskExecutions: 15, // Was 3
    maxConcurrentWorkflowTaskExecutions: 20, // Was 5
    
  4. Increase Batch Processing
    // sync-accounting-queue-workflow.ts
    batchSize: 100, // Was 50
    concurrencyLimit: 10, // Was 3
    
  5. Deploy Multiple Workers
    fly scale count 5 # Start with 5 instances
    
Expected Result: 22 → 600 documents/min (27x)

Stage 2: Horizontal Scaling Test - Week 2-3

Impact: 54x total | Cost: ~$274/mo | Effort: 1 week Actions:
  1. Scale to 10 worker instances
  2. Add database indexes for queue fetching
  3. Monitor for queue table contention
  4. Implement cleanup job for old queue entries
Expected Result: 600 → 1,200 documents/min (54x) Cost Breakdown:
  • Supabase Medium Compute: +$80/mo (120 connections)
  • Redis caching: +$50/mo
  • Fly.io workers: Scale to 10 instances (4 CPU, 4GB): +$144/mo
    • Alternative: 5 larger instances (8 CPU, 8GB): +$144/mo (better CPU per job)
  • Total: ~$274/mo additional
Decision Point:
  • If throughput is sufficient (< 1,200 documents/min): STOP HERE
    • Keep PostgreSQL queue (simpler, good SQL observability)
    • Total additional cost: ~$274/mo
    • Complexity: Low
  • If hitting queue contention (need > 2,500 documents/min): Proceed to Stage 3

Stage 3: Temporal Task Queue Migration - Month 2 (ONLY IF NEEDED)

Impact: 136-227x | Cost: ~$574-1,054/mo | Effort: 3-4 weeks When to do this:
  • ✅ Current system hitting >2,500 documents/min ceiling
  • ✅ Observing queue table contention in logs
  • ✅ Need to scale beyond 20 workers
  • ✅ Budget allows $800+/month
  • ✅ Team capacity for 3-week migration
Migration Strategy:
  1. Add Version Tracking to Documents (Maintains safety)
    model documents {
      accounting_processed_version Int?  // Track what was last processed
      accounting_ready_version     Int   // Already exists, increments on changes
    }
    
  2. Implement Optimistic Locking in Activity
    export async function createFinancialEntriesActivity({
      documentId,
      organizationId,
    }) {
      return await prisma.$transaction(async (tx) => {
        // Read version at start
        const doc = await tx.documents.findUnique({
          where: { id: documentId },
          select: {
            accounting_ready_version: true,
            accounting_processed_version: true,
          },
        });
    
        // Check if already processed
        if (doc.accounting_processed_version >= doc.accounting_ready_version) {
          return { success: true, message: "Already processed" };
        }
    
        const currentVersion = doc.accounting_ready_version;
    
        // Create financial entries
        await createFinancialEntries(...);
    
        // Verify version unchanged (optimistic lock)
        const docAfter = await tx.documents.findUnique({
          where: { id: documentId },
          select: { accounting_ready_version: true },
        });
    
        if (docAfter.accounting_ready_version !== currentVersion) {
          throw new Error("VERSION_MISMATCH"); // Trigger retry
        }
    
        // Mark as processed
        await tx.documents.update({
          where: { id: documentId },
          data: { accounting_processed_version: currentVersion },
        });
    
        return { success: true, data: { processedVersion: currentVersion } };
      });
    }
    
  3. Create New Temporal Workflow
    export async function processAccountingJob({ documentId, organizationId }) {
      let maxRetries = 3;
      let attempt = 0;
    
      while (attempt < maxRetries) {
        try {
          const result = await activities.createFinancialEntriesActivity({
            documentId,
            organizationId,
          });
    
          if (result.success) return;
    
          if (result.error === "VERSION_MISMATCH") {
            await workflow.sleep("5 seconds");
            attempt++;
            continue;
          }
    
          throw new Error(result.error);
        } catch (error) {
          if (attempt >= maxRetries - 1) throw error;
          attempt++;
          await workflow.sleep("10 seconds");
        }
      }
    }
    
  4. Update Enqueue Logic
    export async function enqueueAccountingJob({ organizationId, documentId }) {
      const client = await getTemporalClient();
    
      await client.workflow.start("processAccountingJob", {
        taskQueue: "accounting-jobs",
        workflowId: `accounting-${documentId}`, // Deduplication
        args: [{ documentId, organizationId }],
        workflowIdReusePolicy: "ALLOW_DUPLICATE_FAILED_ONLY",
      });
    }
    
Expected Result: 1,200 → 5,000+ documents/min (227x) Cost Breakdown:
  • Supabase Large Compute: +$220/mo (240 connections)
  • Redis caching: +$50/mo
  • Fly.io workers: 20-50 instances (4 CPU, 4GB): +$304-784/mo
    • Alternative: 10-20 larger instances (8 CPU, 8GB): +$304-608/mo
  • Total: ~$574-1,054/mo additional

Temporal Task Queue Decision Matrix

Use PostgreSQL Queue if:
  • ✅ Need < 2,500 documents/min
  • ✅ Team prefers SQL-based observability
  • ✅ Want to minimize migration complexity
  • ✅ Budget is $400-600/month
  • ✅ 10-20 workers sufficient
Migrate to Temporal Task Queue if:
  • ✅ Need > 3,000 documents/min
  • ✅ Plan to scale to 50+ workers
  • ✅ Observing queue table contention
  • ✅ Want “infinite” horizontal scaling (up to DB CPU limit)
  • ✅ Budget allows $800-1,500/month

Safety Guarantees

Both systems provide equivalent safety for version tracking: PostgreSQL Queue:
  • Version stored in queue table
  • Checked before and after processing
  • Transaction-based atomicity
Temporal Task Queue:
  • Version stored in documents table
  • Optimistic locking in transaction
  • Retry on version mismatch
I