1. Concurrent Execution in Temporal Workflows
Problem
High concurrency in workflow executions causes excessive memory usage and potential resource exhaustion.Solution
- Vertical Scaling – Increase memory and CPU allocation on Fly.io to handle higher concurrency loads.
- Improve Observability – Implement resource utilization tracking to identify optimal infrastructure configurations. Based on observability metrics, introduce automated scaling up or down.
- Cloud Migration Evaluation – Assess whether Fly.io provides sufficient observability and scalability. Consider migrating to a more robust cloud platform such as AWS EC2 or Google Cloud (GCP) if limitations are identified.
2. Absence of a Dedicated Queue for Workflow Automation Tasks
Problem
Two main issues exist:- Concurrent Request Overload – A lack of a queue processor causes incoming requests to be lost or rejected when Temporal workflows reach capacity or are scaling up.
- Inefficient Queue Workaround – A PostgreSQL table is currently being used as a pseudo-queue for storing incoming workflow requests. This design introduces performance bottlenecks due to frequent read operations and the absence of cleanup logic for expired entries. As the table grows, Temporal workflow execution slows down when fetching pending requests.
Solution
- Use Temporal Task Queues – Investigate whether Temporal Task Queues can handle all incoming tasks effectively.
- Alternative: Supabase PGMQ – If Temporal queues are insufficient, adopt pgmq in Supabase as an in-memory, durable message queue.
- Migration – Move the current PostgreSQL-based queue logic to the selected queueing system. Prefer using Temporal’s native Task Queues for tighter integration and reduced complexity.
3. Lack of Caching
Problem
Temporal workflows frequently query PostgreSQL to retrieve workflow-related data, resulting in latency due to complex and repeated read operations.Solution
- Introduce In-Memory Caching – Implement Redis as a caching layer integrated with Supabase PostgreSQL. Redis will store frequently accessed data in memory, significantly reducing query latency and improving workflow throughput.
4. Temporal Workflow Updates
Problem
Updating Temporal workflows may cause ongoing executions to hang or fail during deployment.Solution
- Consult Temporal Team – Engage with Temporal support to identify update-safe mechanisms or best practices.
- Container Replication Strategy – If the issue originates from Fly.io deployment behavior, ensure that running containers can be replicated during updates to maintain continuity of workflow execution.
5. Server-Sent Events (SSE) Limitations
Problem
The current SSE implementation on Fly.io performs adequately for MVP-level workloads but lacks scalability features such as caching, event replay, and state recovery for large-scale client connections.Solution
- Migrate to Supabase Realtime – Replace Fly.io SSE with Supabase Realtime, which provides built-in session handling, event persistence, and fault tolerance.
- Decommission SSE on Fly.io – After migration, remove SSE instances from Fly.io, terminate related services, and update documentation to reflect the new architecture.
Implementation Plan & Analysis
Context: Accounting Queue Performance Analysis
Current State:- Throughput: ~22 documents/min (1 job = 1 document)
- Worker concurrency: 3 documents at a time
- DB connections: 15 (bottleneck)
- Processing time: ~400ms per document
Key Findings
1. Database Connections Are The Primary Bottleneck (60%)
Queue queries account for < 1% of processing time:- Queue operations: ~2ms per document
- Actual accounting work: ~400ms per document (200x more time)
@@unique([org_id, document_id])
constraint).
Connection Pooling Status:
- ✅ Supavisor is enabled (Supabase’s connection pooler, replacement for PgBouncer)
- ✅
DATABASE_URL
uses transaction mode pooling (:6543?pgbouncer=true
) - ⚠️ Backend pool size limited to 15 connections due to Nano compute size
- 🎯 Supavisor can handle 1000s of clients but only opens 15 connections to Postgres
2. Horizontal Scaling Patterns
With Current Setup (PostgreSQL Queue):Recommended Staged Approach
Stage 1: Database & Caching (PRIORITY) - Week 1
Impact: 10-30x improvement | Cost: ~$146/mo | Effort: 2-3 days This stage unlocks horizontal scaling for both queue approaches. Actions:-
Upgrade Supabase Compute Size ($32-80/mo) ← CRITICAL BOTTLENECK
- Current: Nano compute (15 connections) - This is limiting horizontal scaling!
- Recommended: Small compute (60 connections) - $32/mo additional
- For 100x scale: Medium compute (120 connections) - $80/mo additional
- Go to: Project Settings → Database → Compute Size
- After upgrade, update
connection_limit
inpackages/database/src/direct-client.ts
: - ✅ Supavisor pooling is already enabled (verified via
scripts/verify-supavisor.ts
)
-
Implement Redis Caching ($50/mo)
- Cache posting matrix rules (100ms → 5ms)
- Cache chart of accounts
- Cache organization settings
- Reduces processing time: 400ms → 150ms per document (2.6x faster)
-
Optimize Worker Configuration
-
Increase Batch Processing
-
Deploy Multiple Workers
Stage 2: Horizontal Scaling Test - Week 2-3
Impact: 54x total | Cost: ~$274/mo | Effort: 1 week Actions:- Scale to 10 worker instances
- Add database indexes for queue fetching
- Monitor for queue table contention
- Implement cleanup job for old queue entries
- Supabase Medium Compute: +$80/mo (120 connections)
- Redis caching: +$50/mo
- Fly.io workers: Scale to 10 instances (4 CPU, 4GB): +$144/mo
- Alternative: 5 larger instances (8 CPU, 8GB): +$144/mo (better CPU per job)
- Total: ~$274/mo additional
- If throughput is sufficient (< 1,200 documents/min): STOP HERE
- Keep PostgreSQL queue (simpler, good SQL observability)
- Total additional cost: ~$274/mo
- Complexity: Low
- If hitting queue contention (need > 2,500 documents/min): Proceed to Stage 3
Stage 3: Temporal Task Queue Migration - Month 2 (ONLY IF NEEDED)
Impact: 136-227x | Cost: ~$574-1,054/mo | Effort: 3-4 weeks When to do this:- ✅ Current system hitting >2,500 documents/min ceiling
- ✅ Observing queue table contention in logs
- ✅ Need to scale beyond 20 workers
- ✅ Budget allows $800+/month
- ✅ Team capacity for 3-week migration
-
Add Version Tracking to Documents (Maintains safety)
-
Implement Optimistic Locking in Activity
-
Create New Temporal Workflow
-
Update Enqueue Logic
- Supabase Large Compute: +$220/mo (240 connections)
- Redis caching: +$50/mo
- Fly.io workers: 20-50 instances (4 CPU, 4GB): +$304-784/mo
- Alternative: 10-20 larger instances (8 CPU, 8GB): +$304-608/mo
- Total: ~$574-1,054/mo additional
Temporal Task Queue Decision Matrix
Use PostgreSQL Queue if:- ✅ Need < 2,500 documents/min
- ✅ Team prefers SQL-based observability
- ✅ Want to minimize migration complexity
- ✅ Budget is $400-600/month
- ✅ 10-20 workers sufficient
- ✅ Need > 3,000 documents/min
- ✅ Plan to scale to 50+ workers
- ✅ Observing queue table contention
- ✅ Want “infinite” horizontal scaling (up to DB CPU limit)
- ✅ Budget allows $800-1,500/month
Safety Guarantees
Both systems provide equivalent safety for version tracking: PostgreSQL Queue:- Version stored in queue table
- Checked before and after processing
- Transaction-based atomicity
- Version stored in documents table
- Optimistic locking in transaction
- Retry on version mismatch