feat: Redis opportunity cache, CW API retry/logging, adaptive TTLs

- Add Redis-backed opportunity cache with background refresh (30s interval) - Fix concurrency bug: use lazy thunks instead of eager promises for batching - Add withCwRetry utility with exponential backoff for transient CW errors - Add adaptive TTL algorithms (primary, sub-resource, products) based on opportunity activity - Add include query param on GET /sales/opportunities/:id (notes,contacts,products) - Add opt-in CW API logger (LOG_CW_API env var) with timestamped files in cw-api-logs/ - Add debug-scripts/analyze-cw-calls.py for API call analysis - Add computeSubResourceCacheTTL and computeProductsCacheTTL algorithms with tests - Increase CW API timeout from 15s to 30s - Unblock cache refresh from startup chain (remove await) - Prioritize recently updated opportunities in refresh cycle - Add CACHING.md documentation - Update API_ROUTES.md with caching details and include param - Update copilot instructions to require CACHING.md sync - Add dev:log script for CW API call logging during development
2026-03-02 23:23:24 -06:00
parent fe71248e88
commit 6d935e7180
33 changed files with 2634 additions and 176 deletions
@@ -0,0 +1,297 @@
+# Caching Architecture
+
+This document describes the caching layer used in the Optima API, covering the Redis-backed opportunity cache, TTL algorithms, background refresh mechanics, retry logic, and debugging tools.
+
+---
+
+## Overview
+
+The API caches expensive ConnectWise (CW) API responses in **Redis** to reduce latency and avoid CW rate limits. The primary cache layer is the **opportunity cache** (`src/modules/cache/opportunityCache.ts`), which proactively warms data for all non-closed opportunities on a background interval.
+
+### Key design principles
+
+- **Adaptive TTLs** — cache durations are computed dynamically based on how "hot" an opportunity is (recently updated = shorter TTL = fresher data).
+- **Background refresh** — a 30-second interval scans all open opportunities and re-fetches only expired cache keys.
+- **Bounded concurrency** — CW API calls are throttled via thunk-based batching to prevent overwhelming the upstream API.
+- **Graceful degradation** — transient CW errors (timeouts, network failures) are caught, logged, and retried on the next cycle rather than crashing the process.
+- **Priority ordering** — most recently updated opportunities are refreshed first so active deals get fresh data before stale ones.
+
+---
+
+## What is cached
+
+Each non-closed opportunity can have up to 7 cached payloads in Redis:
+
+| Cache Key Pattern | Data | Source |
+|---|---|---|
+| `opp:cw-data:{cwOpportunityId}` | Raw CW opportunity response | `GET /sales/opportunities/:id` |
+| `opp:activities:{cwOpportunityId}` | CW activities array | `GET /sales/activities?conditions=opportunity/id=:id` |
+| `opp:notes:{cwOpportunityId}` | CW notes array | `GET /sales/opportunities/:id/notes` |
+| `opp:contacts:{cwOpportunityId}` | CW contacts array | `GET /sales/opportunities/:id/contacts` |
+| `opp:products:{cwOpportunityId}` | Forecast + procurement products blob | `GET /sales/opportunities/:id/forecast` + `GET /procurement/products` |
+| `opp:company-cw:{cw_CompanyId}` | Hydrated company + contacts blob | `GET /company/companies/:id` + contacts endpoints |
+| `opp:site:{cwCompanyId}:{cwSiteId}` | Company site data | `GET /company/companies/:id/sites/:siteId` |
+
+---
+
+## TTL Algorithms
+
+Three algorithms compute cache TTLs. All share the same input signals:
+
+- `closedFlag` — whether the opportunity is closed
+- `closedDate` — when it was closed
+- `expectedCloseDate` — projected close date (forward-looking signal)
+- `lastUpdated` — last CW modification date (backward-looking signal)
+
+### Primary TTL (`computeCacheTTL`)
+
+**File:** `src/modules/algorithms/computeCacheTTL.ts`
+
+Used for: opportunity CW data, activities, company CW data.
+
+| # | Condition | TTL | Human |
+|---|---|---|---|
+| 1a | Closed > 30 days ago | `null` | Do not cache |
+| 1b | Closed within 30 days | 900,000 ms | 15 minutes |
+| 2 | `expectedCloseDate` or `lastUpdated` within **5 days** | 30,000 ms | 30 seconds |
+| 3 | `expectedCloseDate` or `lastUpdated` within **14 days** | 60,000 ms | 60 seconds |
+| 4 | Everything else | 900,000 ms | 15 minutes |
+
+Rules are evaluated top-to-bottom; first match wins.
+
+### Sub-Resource TTL (`computeSubResourceCacheTTL`)
+
+**File:** `src/modules/algorithms/computeSubResourceCacheTTL.ts`
+
+Used for: notes, contacts.
+
+| # | Condition | TTL | Human |
+|---|---|---|---|
+| 1a | Closed > 30 days ago | `null` | Do not cache |
+| 1b | Closed within 30 days | 300,000 ms | 5 minutes |
+| 2 | Within **5 days** | 60,000 ms | 60 seconds |
+| 3 | Within **14 days** | 120,000 ms | 2 minutes |
+| 4 | Everything else | 300,000 ms | 5 minutes |
+
+### Products TTL (`computeProductsCacheTTL`)
+
+**File:** `src/modules/algorithms/computeProductsCacheTTL.ts`
+
+Used for: forecast + procurement products.
+
+| # | Condition | TTL | Human |
+|---|---|---|---|
+| 1 | Status is Won/Lost/Pending Won/Pending Lost | `null` | No cache |
+| 2 | Main cache TTL is `null` | `null` | No cache |
+| 3 | `lastUpdated` within **3 days** | 15,000 ms | 15 seconds |
+| 4 | Everything else | 1,800,000 ms | 30 minutes |
+
+Products on terminal-status opportunities are never proactively cached. Non-hot products use a **lazy on-demand** cache — they're fetched when requested and cached for 30 minutes.
+
+### Site TTL
+
+Sites use a fixed TTL of **30 minutes** (1,800,000 ms). Site/address data rarely changes. Sites are **not** proactively warmed by the background refresh — they are populated lazily on the first detail-view request.
+
+---
+
+## Background Refresh
+
+**Function:** `refreshOpportunityCache()` in `src/modules/cache/opportunityCache.ts`
+
+**Interval:** Every 30 seconds, triggered from `src/index.ts`.
+
+### Refresh cycle
+
+1. **Query DB** — fetch all non-closed opportunities + recently closed (within 30 days), ordered by `cwLastUpdated DESC` (most recently active first).
+2. **Batch EXISTS check** — use a single Redis pipeline to check which cache keys already exist (5 EXISTS commands per opportunity: oppCwData, activities, notes, contacts, products).
+3. **Build thunk list** — for each opportunity with missing keys, push a **thunk** (lazy function) into the task list. No HTTP requests fire at this point.
+4. **Execute with bounded concurrency** — process thunks in batches of `CONCURRENCY` (currently **6**), with a `BATCH_DELAY_MS` (currently **250ms**) pause between batches. Each thunk is only invoked inside the batch loop.
+5. **Emit events** — `cache:opportunities:refresh:started` and `cache:opportunities:refresh:completed` events are emitted for the event debugger.
+
+### Concurrency control
+
+The thunk pattern is critical. Previously, tasks were pushed as already-executing promises (`refreshTasks.push(fetchAndCache(...))`), which meant all HTTP requests fired simultaneously regardless of the batching loop. The fix was changing the array type from `Promise<void>[]` to `(() => Promise<void>)[]` so requests only start when explicitly invoked: `batch.map((fn) => fn())`.
+
+### Current tuning
+
+| Parameter | Value | Effect |
+|---|---|---|
+| `CONCURRENCY` | 6 | Max simultaneous CW API requests per batch |
+| `BATCH_DELAY_MS` | 250 | Milliseconds between batches |
+| Refresh interval | 30 seconds | How often the full sweep runs |
+
+At these settings, a full sweep of ~500 expired keys completes in ~1-2 minutes with zero CW errors and ~230ms median latency.
+
+---
+
+## Retry Logic (`withCwRetry`)
+
+**File:** `src/modules/cw-utils/withCwRetry.ts`
+
+Wraps CW API calls with exponential backoff retry on transient errors.
+
+### Retryable errors
+
+- `ECONNABORTED` (timeout)
+- `ECONNRESET`
+- `ETIMEDOUT`
+- `ECONNREFUSED`
+- `ERR_NETWORK`
+- `ENETUNREACH`
+- HTTP 5xx server errors
+
+### Default configuration
+
+| Parameter | Default | Description |
+|---|---|---|
+| `maxAttempts` | 3 | Total attempts including the first |
+| `baseDelayMs` | 1,000 | Delay before first retry (doubles each retry: 1s → 2s → 4s) |
+| `label` | — | Optional tag for log messages |
+
+### Usage
+
+```ts
+import { withCwRetry } from "./withCwRetry";
+
+const response = await withCwRetry(
+  () => connectWiseApi.get(`/company/companies/${id}`),
+  { label: `fetchCompany#${id}`, maxAttempts: 3, baseDelayMs: 1_500 },
+);
+```
+
+Non-transient errors (404, 400, etc.) are re-thrown immediately without retry.
+
+---
+
+## CW API Logger
+
+**File:** `src/modules/cw-utils/cwApiLogger.ts`
+
+Axios interceptor that logs every CW API call to a JSONL file. Logging is **opt-in** — set the `LOG_CW_API` environment variable to enable it. Each process start creates a new timestamped file in the `cw-api-logs/` directory (e.g., `cw-api-logs/2026-03-02T14-30-05.123Z.jsonl`).
+
+### Enabling logging
+
+```bash
+# Via the dev:log shorthand script
+bun run dev:log
+
+# Or manually with any command
+LOG_CW_API=1 bun run dev
+```
+
+### Log entry fields
+
+| Field | Type | Description |
+|---|---|---|
+| `timestamp` | string (ISO-8601) | When the request completed |
+| `method` | string | HTTP method |
+| `url` | string | Request URL (relative or absolute) |
+| `baseURL` | string | Axios baseURL |
+| `status` | number \| null | HTTP status (null on network error) |
+| `durationMs` | number | Wall-clock time in milliseconds |
+| `error` | string \| null | Error code + message, if any |
+| `timeout` | number | Configured timeout in ms |
+
+### Analysis
+
+Run the analyzer script to analyze the most recent log file:
+
+```bash
+bun run utils:analyze_cw
+```
+
+Or specify a particular file:
+
+```bash
+python3 debug-scripts/analyze-cw-calls.py cw-api-logs/2026-03-02T14-30-05.123Z.jsonl
+```
+
+This executes `debug-scripts/analyze-cw-calls.py` which produces:
+
+- Overview (total calls, error rate, time span)
+- Duration statistics (min, max, mean, p50, p90, p95, p99, distribution histogram)
+- Error breakdown by type and endpoint
+- Top 20 slowest calls
+- Per-endpoint stats (count, errors, mean, p50, p95, max, total time)
+- Timeline (per-minute throughput and errors)
+- Concurrency hotspot detection
+- Summary with recommendations
+
+To clear all logs:
+
+```bash
+rm -rf cw-api-logs/
+```
+
+---
+
+## Cache Invalidation
+
+Mutation endpoints invalidate the relevant cache keys so the next read fetches fresh data from CW:
+
+| Mutation | Cache invalidated |
+|---|---|
+| Create/update/delete note | `opp:notes:{cwOpportunityId}` via `invalidateNotesCache()` |
+| Create/update/delete contact | `opp:contacts:{cwOpportunityId}` via `invalidateContactsCache()` |
+| Add/update/resequence products | `opp:products:{cwOpportunityId}` via `invalidateProductsCache()` |
+| Refresh opportunity | All keys for that opportunity (via re-fetch) |
+
+---
+
+## ConnectWise API Configuration
+
+The shared Axios instance (`connectWiseApi`) is configured in `src/constants.ts`:
+
+| Setting | Value | Purpose |
+|---|---|---|
+| `baseURL` | `https://ttscw.totaltech.net/v4_6_release/apis/3.0/` | CW API base |
+| `timeout` | 30,000 ms (30s) | Per-request timeout |
+| Logger | `attachCwApiLogger()` | Writes to `cw-api-calls.jsonl` |
+
+---
+
+## Architecture diagram
+
+```
+src/index.ts
+  │
+  ├─ setInterval(refreshOpportunityCache, 30s)
+  │
+  └─► src/modules/cache/opportunityCache.ts
+        │
+        ├─ prisma.opportunity.findMany(orderBy: cwLastUpdated DESC)
+        ├─ redis.pipeline().exists(...)  ← batch key check
+        │
+        ├─ Build thunk list (lazy functions)
+        │
+        └─ Execute thunks with CONCURRENCY=6, DELAY=250ms
+             │
+             ├─► fetchAndCacheOppCwData()      ─► opportunityCw.fetch()
+             ├─► fetchAndCacheActivities()     ─► activityCw.fetchByOpportunityDirect()
+             ├─► fetchAndCacheNotes()          ─► opportunityCw.fetchNotes()
+             ├─► fetchAndCacheContacts()       ─► opportunityCw.fetchContacts()
+             ├─► fetchAndCacheProducts()       ─► opportunityCw.fetchProducts() + fetchProcurementProducts()
+             ├─► fetchAndCacheCompanyCwData()  ─► fetchCwCompanyById() + contacts
+             └─► fetchAndCacheSite()           ─► fetchCompanySite() (lazy only)
+                   │
+                   └─► connectWiseApi.get(...)  ← withCwRetry + cwApiLogger interceptors
+                         │
+                         └─► Redis SET with computed TTL
+```
+
+---
+
+## File reference
+
+| File | Purpose |
+|---|---|
+| `src/modules/cache/opportunityCache.ts` | Cache read/write helpers, background refresh logic |
+| `src/modules/algorithms/computeCacheTTL.ts` | Primary adaptive TTL algorithm |
+| `src/modules/algorithms/computeSubResourceCacheTTL.ts` | Sub-resource (notes, contacts) TTL algorithm |
+| `src/modules/algorithms/computeProductsCacheTTL.ts` | Products TTL algorithm |
+| `src/modules/cw-utils/withCwRetry.ts` | Retry wrapper with exponential backoff |
+| `src/modules/cw-utils/cwApiLogger.ts` | Axios interceptor for JSONL call logging |
+| `src/modules/cw-utils/fetchCompany.ts` | Company fetch with retry |
+| `src/constants.ts` | CW Axios instance config (timeout, logger) |
+| `src/index.ts` | Refresh interval registration |
+| `debug-scripts/analyze-cw-calls.py` | CW API call analysis script |