Files

T

HoloPanio 6d935e7180 feat: Redis opportunity cache, CW API retry/logging, adaptive TTLs

- Add Redis-backed opportunity cache with background refresh (30s interval)
- Fix concurrency bug: use lazy thunks instead of eager promises for batching
- Add withCwRetry utility with exponential backoff for transient CW errors
- Add adaptive TTL algorithms (primary, sub-resource, products) based on opportunity activity
- Add include query param on GET /sales/opportunities/:id (notes,contacts,products)
- Add opt-in CW API logger (LOG_CW_API env var) with timestamped files in cw-api-logs/
- Add debug-scripts/analyze-cw-calls.py for API call analysis
- Add computeSubResourceCacheTTL and computeProductsCacheTTL algorithms with tests
- Increase CW API timeout from 15s to 30s
- Unblock cache refresh from startup chain (remove await)
- Prioritize recently updated opportunities in refresh cycle
- Add CACHING.md documentation
- Update API_ROUTES.md with caching details and include param
- Update copilot instructions to require CACHING.md sync
- Add dev:log script for CW API call logging during development

2026-03-02 23:23:24 -06:00

12 KiB

Raw Blame History

Caching Architecture

This document describes the caching layer used in the Optima API, covering the Redis-backed opportunity cache, TTL algorithms, background refresh mechanics, retry logic, and debugging tools.

Overview

The API caches expensive ConnectWise (CW) API responses in Redis to reduce latency and avoid CW rate limits. The primary cache layer is the opportunity cache (src/modules/cache/opportunityCache.ts), which proactively warms data for all non-closed opportunities on a background interval.

Key design principles

Adaptive TTLs — cache durations are computed dynamically based on how "hot" an opportunity is (recently updated = shorter TTL = fresher data).
Background refresh — a 30-second interval scans all open opportunities and re-fetches only expired cache keys.
Bounded concurrency — CW API calls are throttled via thunk-based batching to prevent overwhelming the upstream API.
Graceful degradation — transient CW errors (timeouts, network failures) are caught, logged, and retried on the next cycle rather than crashing the process.
Priority ordering — most recently updated opportunities are refreshed first so active deals get fresh data before stale ones.

What is cached

Each non-closed opportunity can have up to 7 cached payloads in Redis:

Cache Key Pattern	Data	Source
`opp:cw-data:{cwOpportunityId}`	Raw CW opportunity response	`GET /sales/opportunities/:id`
`opp:activities:{cwOpportunityId}`	CW activities array	`GET /sales/activities?conditions=opportunity/id=:id`
`opp:notes:{cwOpportunityId}`	CW notes array	`GET /sales/opportunities/:id/notes`
`opp:contacts:{cwOpportunityId}`	CW contacts array	`GET /sales/opportunities/:id/contacts`
`opp:products:{cwOpportunityId}`	Forecast + procurement products blob	`GET /sales/opportunities/:id/forecast` + `GET /procurement/products`
`opp:company-cw:{cw_CompanyId}`	Hydrated company + contacts blob	`GET /company/companies/:id` + contacts endpoints
`opp:site:{cwCompanyId}:{cwSiteId}`	Company site data	`GET /company/companies/:id/sites/:siteId`

TTL Algorithms

Three algorithms compute cache TTLs. All share the same input signals:

closedFlag — whether the opportunity is closed
closedDate — when it was closed
expectedCloseDate — projected close date (forward-looking signal)
lastUpdated — last CW modification date (backward-looking signal)

Primary TTL (`computeCacheTTL`)

File: src/modules/algorithms/computeCacheTTL.ts

Used for: opportunity CW data, activities, company CW data.

#	Condition	TTL	Human
1a	Closed > 30 days ago	`null`	Do not cache
1b	Closed within 30 days	900,000 ms	15 minutes
2	`expectedCloseDate` or `lastUpdated` within 5 days	30,000 ms	30 seconds
3	`expectedCloseDate` or `lastUpdated` within 14 days	60,000 ms	60 seconds
4	Everything else	900,000 ms	15 minutes

Rules are evaluated top-to-bottom; first match wins.

Sub-Resource TTL (`computeSubResourceCacheTTL`)

File: src/modules/algorithms/computeSubResourceCacheTTL.ts

Used for: notes, contacts.

#	Condition	TTL	Human
1a	Closed > 30 days ago	`null`	Do not cache
1b	Closed within 30 days	300,000 ms	5 minutes
2	Within 5 days	60,000 ms	60 seconds
3	Within 14 days	120,000 ms	2 minutes
4	Everything else	300,000 ms	5 minutes

Products TTL (`computeProductsCacheTTL`)

File: src/modules/algorithms/computeProductsCacheTTL.ts

Used for: forecast + procurement products.

#	Condition	TTL	Human
1	Status is Won/Lost/Pending Won/Pending Lost	`null`	No cache
2	Main cache TTL is `null`	`null`	No cache
3	`lastUpdated` within 3 days	15,000 ms	15 seconds
4	Everything else	1,800,000 ms	30 minutes

Products on terminal-status opportunities are never proactively cached. Non-hot products use a lazy on-demand cache — they're fetched when requested and cached for 30 minutes.

Site TTL

Sites use a fixed TTL of 30 minutes (1,800,000 ms). Site/address data rarely changes. Sites are not proactively warmed by the background refresh — they are populated lazily on the first detail-view request.

Background Refresh

Function: refreshOpportunityCache() in src/modules/cache/opportunityCache.ts

Interval: Every 30 seconds, triggered from src/index.ts.

Refresh cycle

Query DB — fetch all non-closed opportunities + recently closed (within 30 days), ordered by cwLastUpdated DESC (most recently active first).
Batch EXISTS check — use a single Redis pipeline to check which cache keys already exist (5 EXISTS commands per opportunity: oppCwData, activities, notes, contacts, products).
Build thunk list — for each opportunity with missing keys, push a thunk (lazy function) into the task list. No HTTP requests fire at this point.
Execute with bounded concurrency — process thunks in batches of CONCURRENCY (currently 6), with a BATCH_DELAY_MS (currently 250ms) pause between batches. Each thunk is only invoked inside the batch loop.
Emit events — cache:opportunities:refresh:started and cache:opportunities:refresh:completed events are emitted for the event debugger.

Concurrency control

The thunk pattern is critical. Previously, tasks were pushed as already-executing promises (refreshTasks.push(fetchAndCache(...))), which meant all HTTP requests fired simultaneously regardless of the batching loop. The fix was changing the array type from Promise<void>[] to (() => Promise<void>)[] so requests only start when explicitly invoked: batch.map((fn) => fn()).

Current tuning

Parameter	Value	Effect
`CONCURRENCY`	6	Max simultaneous CW API requests per batch
`BATCH_DELAY_MS`	250	Milliseconds between batches
Refresh interval	30 seconds	How often the full sweep runs

At these settings, a full sweep of ~500 expired keys completes in ~1-2 minutes with zero CW errors and ~230ms median latency.

Retry Logic (`withCwRetry`)

File: src/modules/cw-utils/withCwRetry.ts

Wraps CW API calls with exponential backoff retry on transient errors.

Retryable errors

ECONNABORTED (timeout)
ECONNRESET
ETIMEDOUT
ECONNREFUSED
ERR_NETWORK
ENETUNREACH
HTTP 5xx server errors

Default configuration

Parameter	Default	Description
`maxAttempts`	3	Total attempts including the first
`baseDelayMs`	1,000	Delay before first retry (doubles each retry: 1s → 2s → 4s)
`label`	—	Optional tag for log messages

Usage

import { withCwRetry } from "./withCwRetry";

const response = await withCwRetry(
  () => connectWiseApi.get(`/company/companies/${id}`),
  { label: `fetchCompany#${id}`, maxAttempts: 3, baseDelayMs: 1_500 },
);

Non-transient errors (404, 400, etc.) are re-thrown immediately without retry.

CW API Logger

File: src/modules/cw-utils/cwApiLogger.ts

Axios interceptor that logs every CW API call to a JSONL file. Logging is opt-in — set the LOG_CW_API environment variable to enable it. Each process start creates a new timestamped file in the cw-api-logs/ directory (e.g., cw-api-logs/2026-03-02T14-30-05.123Z.jsonl).

Enabling logging

# Via the dev:log shorthand script
bun run dev:log

# Or manually with any command
LOG_CW_API=1 bun run dev

Log entry fields

Field	Type	Description
`timestamp`	string (ISO-8601)	When the request completed
`method`	string	HTTP method
`url`	string	Request URL (relative or absolute)
`baseURL`	string	Axios baseURL
`status`	number \| null	HTTP status (null on network error)
`durationMs`	number	Wall-clock time in milliseconds
`error`	string \| null	Error code + message, if any
`timeout`	number	Configured timeout in ms

Analysis

Run the analyzer script to analyze the most recent log file:

bun run utils:analyze_cw

Or specify a particular file:

python3 debug-scripts/analyze-cw-calls.py cw-api-logs/2026-03-02T14-30-05.123Z.jsonl

This executes debug-scripts/analyze-cw-calls.py which produces:

Overview (total calls, error rate, time span)
Duration statistics (min, max, mean, p50, p90, p95, p99, distribution histogram)
Error breakdown by type and endpoint
Top 20 slowest calls
Per-endpoint stats (count, errors, mean, p50, p95, max, total time)
Timeline (per-minute throughput and errors)
Concurrency hotspot detection
Summary with recommendations

To clear all logs:

rm -rf cw-api-logs/

Cache Invalidation

Mutation endpoints invalidate the relevant cache keys so the next read fetches fresh data from CW:

Mutation	Cache invalidated
Create/update/delete note	`opp:notes:{cwOpportunityId}` via `invalidateNotesCache()`
Create/update/delete contact	`opp:contacts:{cwOpportunityId}` via `invalidateContactsCache()`
Add/update/resequence products	`opp:products:{cwOpportunityId}` via `invalidateProductsCache()`
Refresh opportunity	All keys for that opportunity (via re-fetch)

ConnectWise API Configuration

The shared Axios instance (connectWiseApi) is configured in src/constants.ts:

Setting	Value	Purpose
`baseURL`	`https://ttscw.totaltech.net/v4_6_release/apis/3.0/`	CW API base
`timeout`	30,000 ms (30s)	Per-request timeout
Logger	`attachCwApiLogger()`	Writes to `cw-api-calls.jsonl`

Architecture diagram

src/index.ts
  │
  ├─ setInterval(refreshOpportunityCache, 30s)
  │
  └─► src/modules/cache/opportunityCache.ts
        │
        ├─ prisma.opportunity.findMany(orderBy: cwLastUpdated DESC)
        ├─ redis.pipeline().exists(...)  ← batch key check
        │
        ├─ Build thunk list (lazy functions)
        │
        └─ Execute thunks with CONCURRENCY=6, DELAY=250ms
             │
             ├─► fetchAndCacheOppCwData()      ─► opportunityCw.fetch()
             ├─► fetchAndCacheActivities()     ─► activityCw.fetchByOpportunityDirect()
             ├─► fetchAndCacheNotes()          ─► opportunityCw.fetchNotes()
             ├─► fetchAndCacheContacts()       ─► opportunityCw.fetchContacts()
             ├─► fetchAndCacheProducts()       ─► opportunityCw.fetchProducts() + fetchProcurementProducts()
             ├─► fetchAndCacheCompanyCwData()  ─► fetchCwCompanyById() + contacts
             └─► fetchAndCacheSite()           ─► fetchCompanySite() (lazy only)
                   │
                   └─► connectWiseApi.get(...)  ← withCwRetry + cwApiLogger interceptors
                         │
                         └─► Redis SET with computed TTL

File reference

File	Purpose
`src/modules/cache/opportunityCache.ts`	Cache read/write helpers, background refresh logic
`src/modules/algorithms/computeCacheTTL.ts`	Primary adaptive TTL algorithm
`src/modules/algorithms/computeSubResourceCacheTTL.ts`	Sub-resource (notes, contacts) TTL algorithm
`src/modules/algorithms/computeProductsCacheTTL.ts`	Products TTL algorithm
`src/modules/cw-utils/withCwRetry.ts`	Retry wrapper with exponential backoff
`src/modules/cw-utils/cwApiLogger.ts`	Axios interceptor for JSONL call logging
`src/modules/cw-utils/fetchCompany.ts`	Company fetch with retry
`src/constants.ts`	CW Axios instance config (timeout, logger)
`src/index.ts`	Refresh interval registration
`debug-scripts/analyze-cw-calls.py`	CW API call analysis script

12 KiB Raw Blame History