feat: Redis opportunity cache, CW API retry/logging, adaptive TTLs
- Add Redis-backed opportunity cache with background refresh (30s interval) - Fix concurrency bug: use lazy thunks instead of eager promises for batching - Add withCwRetry utility with exponential backoff for transient CW errors - Add adaptive TTL algorithms (primary, sub-resource, products) based on opportunity activity - Add include query param on GET /sales/opportunities/:id (notes,contacts,products) - Add opt-in CW API logger (LOG_CW_API env var) with timestamped files in cw-api-logs/ - Add debug-scripts/analyze-cw-calls.py for API call analysis - Add computeSubResourceCacheTTL and computeProductsCacheTTL algorithms with tests - Increase CW API timeout from 15s to 30s - Unblock cache refresh from startup chain (remove await) - Prioritize recently updated opportunities in refresh cycle - Add CACHING.md documentation - Update API_ROUTES.md with caching details and include param - Update copilot instructions to require CACHING.md sync - Add dev:log script for CW API call logging during development
This commit is contained in:
+297
@@ -0,0 +1,297 @@
|
||||
# Caching Architecture
|
||||
|
||||
This document describes the caching layer used in the Optima API, covering the Redis-backed opportunity cache, TTL algorithms, background refresh mechanics, retry logic, and debugging tools.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The API caches expensive ConnectWise (CW) API responses in **Redis** to reduce latency and avoid CW rate limits. The primary cache layer is the **opportunity cache** (`src/modules/cache/opportunityCache.ts`), which proactively warms data for all non-closed opportunities on a background interval.
|
||||
|
||||
### Key design principles
|
||||
|
||||
- **Adaptive TTLs** — cache durations are computed dynamically based on how "hot" an opportunity is (recently updated = shorter TTL = fresher data).
|
||||
- **Background refresh** — a 30-second interval scans all open opportunities and re-fetches only expired cache keys.
|
||||
- **Bounded concurrency** — CW API calls are throttled via thunk-based batching to prevent overwhelming the upstream API.
|
||||
- **Graceful degradation** — transient CW errors (timeouts, network failures) are caught, logged, and retried on the next cycle rather than crashing the process.
|
||||
- **Priority ordering** — most recently updated opportunities are refreshed first so active deals get fresh data before stale ones.
|
||||
|
||||
---
|
||||
|
||||
## What is cached
|
||||
|
||||
Each non-closed opportunity can have up to 7 cached payloads in Redis:
|
||||
|
||||
| Cache Key Pattern | Data | Source |
|
||||
|---|---|---|
|
||||
| `opp:cw-data:{cwOpportunityId}` | Raw CW opportunity response | `GET /sales/opportunities/:id` |
|
||||
| `opp:activities:{cwOpportunityId}` | CW activities array | `GET /sales/activities?conditions=opportunity/id=:id` |
|
||||
| `opp:notes:{cwOpportunityId}` | CW notes array | `GET /sales/opportunities/:id/notes` |
|
||||
| `opp:contacts:{cwOpportunityId}` | CW contacts array | `GET /sales/opportunities/:id/contacts` |
|
||||
| `opp:products:{cwOpportunityId}` | Forecast + procurement products blob | `GET /sales/opportunities/:id/forecast` + `GET /procurement/products` |
|
||||
| `opp:company-cw:{cw_CompanyId}` | Hydrated company + contacts blob | `GET /company/companies/:id` + contacts endpoints |
|
||||
| `opp:site:{cwCompanyId}:{cwSiteId}` | Company site data | `GET /company/companies/:id/sites/:siteId` |
|
||||
|
||||
---
|
||||
|
||||
## TTL Algorithms
|
||||
|
||||
Three algorithms compute cache TTLs. All share the same input signals:
|
||||
|
||||
- `closedFlag` — whether the opportunity is closed
|
||||
- `closedDate` — when it was closed
|
||||
- `expectedCloseDate` — projected close date (forward-looking signal)
|
||||
- `lastUpdated` — last CW modification date (backward-looking signal)
|
||||
|
||||
### Primary TTL (`computeCacheTTL`)
|
||||
|
||||
**File:** `src/modules/algorithms/computeCacheTTL.ts`
|
||||
|
||||
Used for: opportunity CW data, activities, company CW data.
|
||||
|
||||
| # | Condition | TTL | Human |
|
||||
|---|---|---|---|
|
||||
| 1a | Closed > 30 days ago | `null` | Do not cache |
|
||||
| 1b | Closed within 30 days | 900,000 ms | 15 minutes |
|
||||
| 2 | `expectedCloseDate` or `lastUpdated` within **5 days** | 30,000 ms | 30 seconds |
|
||||
| 3 | `expectedCloseDate` or `lastUpdated` within **14 days** | 60,000 ms | 60 seconds |
|
||||
| 4 | Everything else | 900,000 ms | 15 minutes |
|
||||
|
||||
Rules are evaluated top-to-bottom; first match wins.
|
||||
|
||||
### Sub-Resource TTL (`computeSubResourceCacheTTL`)
|
||||
|
||||
**File:** `src/modules/algorithms/computeSubResourceCacheTTL.ts`
|
||||
|
||||
Used for: notes, contacts.
|
||||
|
||||
| # | Condition | TTL | Human |
|
||||
|---|---|---|---|
|
||||
| 1a | Closed > 30 days ago | `null` | Do not cache |
|
||||
| 1b | Closed within 30 days | 300,000 ms | 5 minutes |
|
||||
| 2 | Within **5 days** | 60,000 ms | 60 seconds |
|
||||
| 3 | Within **14 days** | 120,000 ms | 2 minutes |
|
||||
| 4 | Everything else | 300,000 ms | 5 minutes |
|
||||
|
||||
### Products TTL (`computeProductsCacheTTL`)
|
||||
|
||||
**File:** `src/modules/algorithms/computeProductsCacheTTL.ts`
|
||||
|
||||
Used for: forecast + procurement products.
|
||||
|
||||
| # | Condition | TTL | Human |
|
||||
|---|---|---|---|
|
||||
| 1 | Status is Won/Lost/Pending Won/Pending Lost | `null` | No cache |
|
||||
| 2 | Main cache TTL is `null` | `null` | No cache |
|
||||
| 3 | `lastUpdated` within **3 days** | 15,000 ms | 15 seconds |
|
||||
| 4 | Everything else | 1,800,000 ms | 30 minutes |
|
||||
|
||||
Products on terminal-status opportunities are never proactively cached. Non-hot products use a **lazy on-demand** cache — they're fetched when requested and cached for 30 minutes.
|
||||
|
||||
### Site TTL
|
||||
|
||||
Sites use a fixed TTL of **30 minutes** (1,800,000 ms). Site/address data rarely changes. Sites are **not** proactively warmed by the background refresh — they are populated lazily on the first detail-view request.
|
||||
|
||||
---
|
||||
|
||||
## Background Refresh
|
||||
|
||||
**Function:** `refreshOpportunityCache()` in `src/modules/cache/opportunityCache.ts`
|
||||
|
||||
**Interval:** Every 30 seconds, triggered from `src/index.ts`.
|
||||
|
||||
### Refresh cycle
|
||||
|
||||
1. **Query DB** — fetch all non-closed opportunities + recently closed (within 30 days), ordered by `cwLastUpdated DESC` (most recently active first).
|
||||
2. **Batch EXISTS check** — use a single Redis pipeline to check which cache keys already exist (5 EXISTS commands per opportunity: oppCwData, activities, notes, contacts, products).
|
||||
3. **Build thunk list** — for each opportunity with missing keys, push a **thunk** (lazy function) into the task list. No HTTP requests fire at this point.
|
||||
4. **Execute with bounded concurrency** — process thunks in batches of `CONCURRENCY` (currently **6**), with a `BATCH_DELAY_MS` (currently **250ms**) pause between batches. Each thunk is only invoked inside the batch loop.
|
||||
5. **Emit events** — `cache:opportunities:refresh:started` and `cache:opportunities:refresh:completed` events are emitted for the event debugger.
|
||||
|
||||
### Concurrency control
|
||||
|
||||
The thunk pattern is critical. Previously, tasks were pushed as already-executing promises (`refreshTasks.push(fetchAndCache(...))`), which meant all HTTP requests fired simultaneously regardless of the batching loop. The fix was changing the array type from `Promise<void>[]` to `(() => Promise<void>)[]` so requests only start when explicitly invoked: `batch.map((fn) => fn())`.
|
||||
|
||||
### Current tuning
|
||||
|
||||
| Parameter | Value | Effect |
|
||||
|---|---|---|
|
||||
| `CONCURRENCY` | 6 | Max simultaneous CW API requests per batch |
|
||||
| `BATCH_DELAY_MS` | 250 | Milliseconds between batches |
|
||||
| Refresh interval | 30 seconds | How often the full sweep runs |
|
||||
|
||||
At these settings, a full sweep of ~500 expired keys completes in ~1-2 minutes with zero CW errors and ~230ms median latency.
|
||||
|
||||
---
|
||||
|
||||
## Retry Logic (`withCwRetry`)
|
||||
|
||||
**File:** `src/modules/cw-utils/withCwRetry.ts`
|
||||
|
||||
Wraps CW API calls with exponential backoff retry on transient errors.
|
||||
|
||||
### Retryable errors
|
||||
|
||||
- `ECONNABORTED` (timeout)
|
||||
- `ECONNRESET`
|
||||
- `ETIMEDOUT`
|
||||
- `ECONNREFUSED`
|
||||
- `ERR_NETWORK`
|
||||
- `ENETUNREACH`
|
||||
- HTTP 5xx server errors
|
||||
|
||||
### Default configuration
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|---|---|---|
|
||||
| `maxAttempts` | 3 | Total attempts including the first |
|
||||
| `baseDelayMs` | 1,000 | Delay before first retry (doubles each retry: 1s → 2s → 4s) |
|
||||
| `label` | — | Optional tag for log messages |
|
||||
|
||||
### Usage
|
||||
|
||||
```ts
|
||||
import { withCwRetry } from "./withCwRetry";
|
||||
|
||||
const response = await withCwRetry(
|
||||
() => connectWiseApi.get(`/company/companies/${id}`),
|
||||
{ label: `fetchCompany#${id}`, maxAttempts: 3, baseDelayMs: 1_500 },
|
||||
);
|
||||
```
|
||||
|
||||
Non-transient errors (404, 400, etc.) are re-thrown immediately without retry.
|
||||
|
||||
---
|
||||
|
||||
## CW API Logger
|
||||
|
||||
**File:** `src/modules/cw-utils/cwApiLogger.ts`
|
||||
|
||||
Axios interceptor that logs every CW API call to a JSONL file. Logging is **opt-in** — set the `LOG_CW_API` environment variable to enable it. Each process start creates a new timestamped file in the `cw-api-logs/` directory (e.g., `cw-api-logs/2026-03-02T14-30-05.123Z.jsonl`).
|
||||
|
||||
### Enabling logging
|
||||
|
||||
```bash
|
||||
# Via the dev:log shorthand script
|
||||
bun run dev:log
|
||||
|
||||
# Or manually with any command
|
||||
LOG_CW_API=1 bun run dev
|
||||
```
|
||||
|
||||
### Log entry fields
|
||||
|
||||
| Field | Type | Description |
|
||||
|---|---|---|
|
||||
| `timestamp` | string (ISO-8601) | When the request completed |
|
||||
| `method` | string | HTTP method |
|
||||
| `url` | string | Request URL (relative or absolute) |
|
||||
| `baseURL` | string | Axios baseURL |
|
||||
| `status` | number \| null | HTTP status (null on network error) |
|
||||
| `durationMs` | number | Wall-clock time in milliseconds |
|
||||
| `error` | string \| null | Error code + message, if any |
|
||||
| `timeout` | number | Configured timeout in ms |
|
||||
|
||||
### Analysis
|
||||
|
||||
Run the analyzer script to analyze the most recent log file:
|
||||
|
||||
```bash
|
||||
bun run utils:analyze_cw
|
||||
```
|
||||
|
||||
Or specify a particular file:
|
||||
|
||||
```bash
|
||||
python3 debug-scripts/analyze-cw-calls.py cw-api-logs/2026-03-02T14-30-05.123Z.jsonl
|
||||
```
|
||||
|
||||
This executes `debug-scripts/analyze-cw-calls.py` which produces:
|
||||
|
||||
- Overview (total calls, error rate, time span)
|
||||
- Duration statistics (min, max, mean, p50, p90, p95, p99, distribution histogram)
|
||||
- Error breakdown by type and endpoint
|
||||
- Top 20 slowest calls
|
||||
- Per-endpoint stats (count, errors, mean, p50, p95, max, total time)
|
||||
- Timeline (per-minute throughput and errors)
|
||||
- Concurrency hotspot detection
|
||||
- Summary with recommendations
|
||||
|
||||
To clear all logs:
|
||||
|
||||
```bash
|
||||
rm -rf cw-api-logs/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cache Invalidation
|
||||
|
||||
Mutation endpoints invalidate the relevant cache keys so the next read fetches fresh data from CW:
|
||||
|
||||
| Mutation | Cache invalidated |
|
||||
|---|---|
|
||||
| Create/update/delete note | `opp:notes:{cwOpportunityId}` via `invalidateNotesCache()` |
|
||||
| Create/update/delete contact | `opp:contacts:{cwOpportunityId}` via `invalidateContactsCache()` |
|
||||
| Add/update/resequence products | `opp:products:{cwOpportunityId}` via `invalidateProductsCache()` |
|
||||
| Refresh opportunity | All keys for that opportunity (via re-fetch) |
|
||||
|
||||
---
|
||||
|
||||
## ConnectWise API Configuration
|
||||
|
||||
The shared Axios instance (`connectWiseApi`) is configured in `src/constants.ts`:
|
||||
|
||||
| Setting | Value | Purpose |
|
||||
|---|---|---|
|
||||
| `baseURL` | `https://ttscw.totaltech.net/v4_6_release/apis/3.0/` | CW API base |
|
||||
| `timeout` | 30,000 ms (30s) | Per-request timeout |
|
||||
| Logger | `attachCwApiLogger()` | Writes to `cw-api-calls.jsonl` |
|
||||
|
||||
---
|
||||
|
||||
## Architecture diagram
|
||||
|
||||
```
|
||||
src/index.ts
|
||||
│
|
||||
├─ setInterval(refreshOpportunityCache, 30s)
|
||||
│
|
||||
└─► src/modules/cache/opportunityCache.ts
|
||||
│
|
||||
├─ prisma.opportunity.findMany(orderBy: cwLastUpdated DESC)
|
||||
├─ redis.pipeline().exists(...) ← batch key check
|
||||
│
|
||||
├─ Build thunk list (lazy functions)
|
||||
│
|
||||
└─ Execute thunks with CONCURRENCY=6, DELAY=250ms
|
||||
│
|
||||
├─► fetchAndCacheOppCwData() ─► opportunityCw.fetch()
|
||||
├─► fetchAndCacheActivities() ─► activityCw.fetchByOpportunityDirect()
|
||||
├─► fetchAndCacheNotes() ─► opportunityCw.fetchNotes()
|
||||
├─► fetchAndCacheContacts() ─► opportunityCw.fetchContacts()
|
||||
├─► fetchAndCacheProducts() ─► opportunityCw.fetchProducts() + fetchProcurementProducts()
|
||||
├─► fetchAndCacheCompanyCwData() ─► fetchCwCompanyById() + contacts
|
||||
└─► fetchAndCacheSite() ─► fetchCompanySite() (lazy only)
|
||||
│
|
||||
└─► connectWiseApi.get(...) ← withCwRetry + cwApiLogger interceptors
|
||||
│
|
||||
└─► Redis SET with computed TTL
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## File reference
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| `src/modules/cache/opportunityCache.ts` | Cache read/write helpers, background refresh logic |
|
||||
| `src/modules/algorithms/computeCacheTTL.ts` | Primary adaptive TTL algorithm |
|
||||
| `src/modules/algorithms/computeSubResourceCacheTTL.ts` | Sub-resource (notes, contacts) TTL algorithm |
|
||||
| `src/modules/algorithms/computeProductsCacheTTL.ts` | Products TTL algorithm |
|
||||
| `src/modules/cw-utils/withCwRetry.ts` | Retry wrapper with exponential backoff |
|
||||
| `src/modules/cw-utils/cwApiLogger.ts` | Axios interceptor for JSONL call logging |
|
||||
| `src/modules/cw-utils/fetchCompany.ts` | Company fetch with retry |
|
||||
| `src/constants.ts` | CW Axios instance config (timeout, logger) |
|
||||
| `src/index.ts` | Refresh interval registration |
|
||||
| `debug-scripts/analyze-cw-calls.py` | CW API call analysis script |
|
||||
Reference in New Issue
Block a user