Performance Improvements

1. Concurrent Execution in Temporal Workflows

Problem

High concurrency in workflow executions causes excessive memory usage and potential resource exhaustion.

Solution

Vertical Scaling – Increase memory and CPU allocation on Fly.io to handle higher concurrency loads.
Improve Observability – Implement resource utilization tracking to identify optimal infrastructure configurations. Based on observability metrics, introduce automated scaling up or down.
Cloud Migration Evaluation – Assess whether Fly.io provides sufficient observability and scalability. Consider migrating to a more robust cloud platform such as AWS EC2 or Google Cloud (GCP) if limitations are identified.

2. Absence of a Dedicated Queue for Workflow Automation Tasks

Problem

Two main issues exist:

Concurrent Request Overload – A lack of a queue processor causes incoming requests to be lost or rejected when Temporal workflows reach capacity or are scaling up.
Inefficient Queue Workaround – A PostgreSQL table is currently being used as a pseudo-queue for storing incoming workflow requests. This design introduces performance bottlenecks due to frequent read operations and the absence of cleanup logic for expired entries. As the table grows, Temporal workflow execution slows down when fetching pending requests.

Solution

Use Temporal Task Queues – Investigate whether Temporal Task Queues can handle all incoming tasks effectively.
Alternative: Supabase PGMQ – If Temporal queues are insufficient, adopt pgmq in Supabase as an in-memory, durable message queue.
Migration – Move the current PostgreSQL-based queue logic to the selected queueing system. Prefer using Temporal’s native Task Queues for tighter integration and reduced complexity.

3. Lack of Caching

Problem

Temporal workflows frequently query PostgreSQL to retrieve workflow-related data, resulting in latency due to complex and repeated read operations.

Solution

Introduce In-Memory Caching – Implement Redis as a caching layer integrated with Supabase PostgreSQL. Redis will store frequently accessed data in memory, significantly reducing query latency and improving workflow throughput.

4. Temporal Workflow Updates

Problem

Updating Temporal workflows may cause ongoing executions to hang or fail during deployment.

Solution

Consult Temporal Team – Engage with Temporal support to identify update-safe mechanisms or best practices.
Container Replication Strategy – If the issue originates from Fly.io deployment behavior, ensure that running containers can be replicated during updates to maintain continuity of workflow execution.

5. Server-Sent Events (SSE) Limitations

Problem

The current SSE implementation on Fly.io performs adequately for MVP-level workloads but lacks scalability features such as caching, event replay, and state recovery for large-scale client connections.

Solution

Migrate to Supabase Realtime – Replace Fly.io SSE with Supabase Realtime, which provides built-in session handling, event persistence, and fault tolerance.
Decommission SSE on Fly.io – After migration, remove SSE instances from Fly.io, terminate related services, and update documentation to reflect the new architecture.

Implementation Plan & Analysis

Context: Accounting Queue Performance Analysis

Current State:

Throughput: ~22 documents/min (1 job = 1 document)
Worker concurrency: 3 documents at a time
DB connections: 15 (bottleneck)
Processing time: ~400ms per document

Target: 100x improvement → ~2,200 documents/min

Key Findings

1. Database Connections Are The Primary Bottleneck (60%)

Queue queries account for < 1% of processing time:

Queue operations: ~2ms per document
Actual accounting work: ~400ms per document (200x more time)

Note: Each job processes exactly 1 document (enforced by @@unique([org_id, document_id]) constraint). Connection Pooling Status:

✅ Supavisor is enabled (Supabase’s connection pooler, replacement for PgBouncer)
✅ DATABASE_URL uses transaction mode pooling (:6543?pgbouncer=true)
⚠️ Backend pool size limited to 15 connections due to Nano compute size
🎯 Supavisor can handle 1000s of clients but only opens 15 connections to Postgres

Implication: Switching to Temporal Task Queue alone won’t significantly improve performance. The Nano compute’s 15-connection backend limit must be addressed first.

2. Horizontal Scaling Patterns

With Current Setup (PostgreSQL Queue):

Workers    Documents/Min    Bottleneck
        22               Worker concurrency (3 at a time)
        110              DB connections (15 limit)
       150              DB connection exhaustion
       150              ← No improvement (DB saturated)

With DB Upgrade + PostgreSQL Queue:

Workers    Documents/Min    Bottleneck
        30               Worker concurrency
       1,200            Queue table contention starts
       2,000            Queue table contention increases
       2,500            ← Scaling ceiling (queue hotspot)

With DB Upgrade + Temporal Task Queue:

Workers    Documents/Min    Bottleneck
        30               Worker concurrency
       1,500            None yet (linear scaling)
       3,000            None yet (linear scaling)
       7,000            DB CPU limit starts
      10,000           ← DB CPU ceiling

Recommended Staged Approach

Stage 1: Database & Caching (PRIORITY) - Week 1

Impact: 10-30x improvement | Cost: ~$146/mo | Effort: 2-3 days This stage unlocks horizontal scaling for both queue approaches. Actions:

Upgrade Supabase Compute Size ($32-80/mo) ← CRITICAL BOTTLENECK
- Current: Nano compute (15 connections) - This is limiting horizontal scaling!
- Recommended: Small compute (60 connections) - $32/mo additional
- For 100x scale: Medium compute (120 connections) - $80/mo additional
- Go to: Project Settings → Database → Compute Size
- After upgrade, update connection_limit in packages/database/src/direct-client.ts:
  connectionUrl.searchParams.set("connection_limit", "50"); // Was 15
- ✅ Supavisor pooling is already enabled (verified via scripts/verify-supavisor.ts)
Implement Redis Caching ($50/mo)
- Cache posting matrix rules (100ms → 5ms)
- Cache chart of accounts
- Cache organization settings
- Reduces processing time: 400ms → 150ms per document (2.6x faster)

Optimize Worker Configuration

// apps/temporal-workers/src/worker.ts
maxConcurrentActivityTaskExecutions: 15, // Was 3
maxConcurrentWorkflowTaskExecutions: 20, // Was 5

Increase Batch Processing

// sync-accounting-queue-workflow.ts
batchSize: 100, // Was 50
concurrencyLimit: 10, // Was 3

Deploy Multiple Workers

fly scale count 5 # Start with 5 instances

Expected Result: 22 → 600 documents/min (27x)

Stage 2: Horizontal Scaling Test - Week 2-3

Impact: 54x total | Cost: ~$274/mo | Effort: 1 week Actions:

Scale to 10 worker instances
Add database indexes for queue fetching
Monitor for queue table contention
Implement cleanup job for old queue entries

Expected Result: 600 → 1,200 documents/min (54x) Cost Breakdown:

Supabase Medium Compute: +$80/mo (120 connections)
Redis caching: +$50/mo
Fly.io workers: Scale to 10 instances (4 CPU, 4GB): +$144/mo
- Alternative: 5 larger instances (8 CPU, 8GB): +$144/mo (better CPU per job)
Total: ~$274/mo additional

Decision Point:

If throughput is sufficient (< 1,200 documents/min): STOP HERE
- Keep PostgreSQL queue (simpler, good SQL observability)
- Total additional cost: ~$274/mo
- Complexity: Low
If hitting queue contention (need > 2,500 documents/min): Proceed to Stage 3

Stage 3: Temporal Task Queue Migration - Month 2 (ONLY IF NEEDED)

Impact: 136-227x | Cost: ~$574-1,054/mo | Effort: 3-4 weeks When to do this:

✅ Current system hitting >2,500 documents/min ceiling
✅ Observing queue table contention in logs
✅ Need to scale beyond 20 workers
✅ Budget allows $800+/month
✅ Team capacity for 3-week migration

Migration Strategy:

Add Version Tracking to Documents (Maintains safety)

model documents {
  accounting_processed_version Int?  // Track what was last processed
  accounting_ready_version     Int   // Already exists, increments on changes
}

Implement Optimistic Locking in Activity

export async function createFinancialEntriesActivity({
  documentId,
  organizationId,
}) {
  return await prisma.$transaction(async (tx) => {
    // Read version at start
    const doc = await tx.documents.findUnique({
      where: { id: documentId },
      select: {
        accounting_ready_version: true,
        accounting_processed_version: true,
      },
    });

    // Check if already processed
    if (doc.accounting_processed_version >= doc.accounting_ready_version) {
      return { success: true, message: "Already processed" };
    }

    const currentVersion = doc.accounting_ready_version;

    // Create financial entries
    await createFinancialEntries(...);

    // Verify version unchanged (optimistic lock)
    const docAfter = await tx.documents.findUnique({
      where: { id: documentId },
      select: { accounting_ready_version: true },
    });

    if (docAfter.accounting_ready_version !== currentVersion) {
      throw new Error("VERSION_MISMATCH"); // Trigger retry
    }

    // Mark as processed
    await tx.documents.update({
      where: { id: documentId },
      data: { accounting_processed_version: currentVersion },
    });

    return { success: true, data: { processedVersion: currentVersion } };
  });
}

Create New Temporal Workflow

export async function processAccountingJob({ documentId, organizationId }) {
  let maxRetries = 3;
  let attempt = 0;

  while (attempt < maxRetries) {
    try {
      const result = await activities.createFinancialEntriesActivity({
        documentId,
        organizationId,
      });

      if (result.success) return;

      if (result.error === "VERSION_MISMATCH") {
        await workflow.sleep("5 seconds");
        attempt++;
        continue;
      }

      throw new Error(result.error);
    } catch (error) {
      if (attempt >= maxRetries - 1) throw error;
      attempt++;
      await workflow.sleep("10 seconds");
    }
  }
}

Update Enqueue Logic

export async function enqueueAccountingJob({ organizationId, documentId }) {
  const client = await getTemporalClient();

  await client.workflow.start("processAccountingJob", {
    taskQueue: "accounting-jobs",
    workflowId: `accounting-${documentId}`, // Deduplication
    args: [{ documentId, organizationId }],
    workflowIdReusePolicy: "ALLOW_DUPLICATE_FAILED_ONLY",
  });
}

Expected Result: 1,200 → 5,000+ documents/min (227x) Cost Breakdown:

Supabase Large Compute: +$220/mo (240 connections)
Redis caching: +$50/mo
Fly.io workers: 20-50 instances (4 CPU, 4GB): +$304-784/mo
- Alternative: 10-20 larger instances (8 CPU, 8GB): +$304-608/mo
Total: ~$574-1,054/mo additional

Temporal Task Queue Decision Matrix

Use PostgreSQL Queue if:

✅ Need < 2,500 documents/min
✅ Team prefers SQL-based observability
✅ Want to minimize migration complexity
✅ Budget is $400-600/month
✅ 10-20 workers sufficient

Migrate to Temporal Task Queue if:

✅ Need > 3,000 documents/min
✅ Plan to scale to 50+ workers
✅ Observing queue table contention
✅ Want “infinite” horizontal scaling (up to DB CPU limit)
✅ Budget allows $800-1,500/month

Safety Guarantees

Both systems provide equivalent safety for version tracking: PostgreSQL Queue:

Version stored in queue table
Checked before and after processing
Transaction-based atomicity

Temporal Task Queue:

Version stored in documents table
Optimistic locking in transaction
Retry on version mismatch

Getting Started

Core CONA Logic

Development Setup

Guides

E2E Tests

RFC

Performance Improvements

1. Concurrent Execution in Temporal Workflows

Problem

Solution

2. Absence of a Dedicated Queue for Workflow Automation Tasks

Problem

Solution

3. Lack of Caching

Problem

Solution

4. Temporal Workflow Updates

Problem

Solution

5. Server-Sent Events (SSE) Limitations

Problem

Solution

Implementation Plan & Analysis

Context: Accounting Queue Performance Analysis

Key Findings

1. Database Connections Are The Primary Bottleneck (60%)

2. Horizontal Scaling Patterns

Recommended Staged Approach

Stage 1: Database & Caching (PRIORITY) - Week 1

Stage 2: Horizontal Scaling Test - Week 2-3

Stage 3: Temporal Task Queue Migration - Month 2 (ONLY IF NEEDED)

Temporal Task Queue Decision Matrix

Safety Guarantees

Getting Started

Core CONA Logic

Development Setup

Guides

E2E Tests

RFC

​1. Concurrent Execution in Temporal Workflows

​Problem

​Solution

​2. Absence of a Dedicated Queue for Workflow Automation Tasks

​Problem

​Solution

​3. Lack of Caching

​Problem

​Solution

​4. Temporal Workflow Updates

​Problem

​Solution

​5. Server-Sent Events (SSE) Limitations

​Problem

​Solution

​Implementation Plan & Analysis

​Context: Accounting Queue Performance Analysis

​Key Findings

​1. Database Connections Are The Primary Bottleneck (60%)

​2. Horizontal Scaling Patterns

​Recommended Staged Approach

​Stage 1: Database & Caching (PRIORITY) - Week 1

​Stage 2: Horizontal Scaling Test - Week 2-3

​Stage 3: Temporal Task Queue Migration - Month 2 (ONLY IF NEEDED)

​Temporal Task Queue Decision Matrix

​Safety Guarantees

1. Concurrent Execution in Temporal Workflows

Problem

Solution

2. Absence of a Dedicated Queue for Workflow Automation Tasks

Problem

Solution

3. Lack of Caching

Problem

Solution

4. Temporal Workflow Updates

Problem

Solution

5. Server-Sent Events (SSE) Limitations

Problem

Solution

Implementation Plan & Analysis

Context: Accounting Queue Performance Analysis

Key Findings

1. Database Connections Are The Primary Bottleneck (60%)

2. Horizontal Scaling Patterns

Recommended Staged Approach

Stage 1: Database & Caching (PRIORITY) - Week 1

Stage 2: Horizontal Scaling Test - Week 2-3

Stage 3: Temporal Task Queue Migration - Month 2 (ONLY IF NEEDED)

Temporal Task Queue Decision Matrix

Safety Guarantees