Ssh And Vps¶
PM2 needs --update-env to pick up env file changes¶
From legacy section: Competitive Audit Automation (v1.1/v1.2)
Pattern: Added
PAGESPEED_API_KEY=...to/opt/evolve/config/competitive-audit.env, ranpm2 restart competitive-audit, expected the new env var to be visible to the running process. It wasn't — PM2 keeps the parent shell env from when the process was first started. Subsequent audits still hit the unauth quota. Rule: After any env-file change for a PM2-managed process, restart withpm2 restart {name} --update-env. Verify the new var is reachable at runtime withnode -e "import('./src/config.js').then(m => console.log('key set:', !!m.config.pagespeed?.apiKey))"—pm2 envshows only PM2-managed vars, not whatdotenv.config()loads inside the Node process. Date: 2026-04-24
PM2 restart mid-Claude-stream silently strands audits in generating_report¶
From legacy section: Competitive Audit Automation (v1.1/v1.2)
Pattern: Three Medium audits (ca726b91, f7506ade, 00cb45a7) all reached
generating_reportand got stuck there because Jim restarted PM2 to deploy two code changes (website-cleaner + EMAIL_OVERRIDE_TO env update) mid-pipeline. Claude streaming state is in-memory; process kill drops the stream. The audit row stays ingenerating_reportuntil the 40-minute stuck-watchdog catches it. No automatic retry. No surfaced error during the 40-min window — just silence. Rule: Before anypm2 restart competitive-audit, query the DB for in-flight audits:sqlite3 /opt/evolve/data/competitive-audit/app.db "SELECT id, status FROM audits WHERE status NOT IN ('completed','failed','refunded')". If anything is mid-pipeline, either (a) wait for it to finish OR (b) accept that you'll need to retry it via/api/internal/retry/:id?from=publishafterward (preserves LD scan data, only re-runs Claude — saves ~10 min + LD credits per audit). Long-term: ship the resumable-Claude future-improvement so restarts auto-recover. Date: 2026-04-25
claude.js promptCache is per-tier and survives a single restart cycle — pm2 restart is the only invalidation¶
From legacy section: Competitive Audit Automation (v1.1/v1.2)
Pattern:
loadSystemPrompt(tier)inclaude.js:18populates a module-levelpromptCacheMap keyed by normalized tier name on first call per tier per pm2 boot. The user's mental model was "prompts load fresh per-call" — wrong for warm processes. After deploying a prompt-file change via scp, the only way the running process picks it up ispm2 restart. With pm2 staying up across multiple deploy iterations, this is invisible and very easy to miss. Confused the timeline reconstruction during this morning's diagnosis (was the wrapper-fixed prompt loaded by the run that was missing the wrapper?). Rule: Any prompt-file change requirespm2 restart competitive-auditbefore the next audit can pick it up. Document the cache + invalidation in any deploy runbook for this app. Future-improvement candidate: watch the prompts directory and bust the cache on mtime change so scp-and-go works without an explicit restart, removing one footgun. Inflight-audit guard from the existing PM2-restart-mid-stream lesson still applies — query DB before restarting. Date: 2026-04-25
Rapid sshd polling triggers fail2ban; HTTP polling endpoints sidestep the lockout¶
From legacy section: Competitive Audit Automation (v1.1/v1.2)
Pattern: During the morning checklist, polling Howell's status via
ssh evolve@... sqlite3 ...every 60s for ~10 min triggered fail2ban (or some IP-block layer) on port 22 specifically — port 443 stayed open the entire time, host was fine, sshd was the only thing rejected.fail2ban-client unban --alldidn't immediately recover (the iptables chain may persist past the unban, or block was at a separate layer). Lost ~30 min waiting for the lockout to clear. Rule: When polling audit status during retries, use the existingGET /api/audit/:id/statusHTTPS endpoint instead of ssh+sqlite. Same data, no ssh round-trips, no fail2ban risk. Reference:infra/vps/apps/competitive-audit/src/routes/audit.js:376. Reserve ssh for things that genuinely need shell access (running CLI scripts, log inspection); the audit pipeline exposes everything else over HTTPS. Date: 2026-04-25
Slot-gate wait time eats stuck-watchdog budget — refresh updated_at on slot acquire¶
From legacy section: Competitive Audit Automation (v1.1/v1.2)
Pattern: Putnam Place (paid Full audit) was created at 03:57 UTC. Pipeline set status=
generating_reportat 03:59 (which bumpedupdated_at). Then it waited 36 min in the slot-gate behind a slow Ianniello research call. Acquired slot at 04:35, started Claude work. At 04:53 (50 min fromgenerating_reportstatus set), the stuck-watchdog measuredupdated_at < now - 50 minand marked Putnamfailed— even though Claude had only been working for 18 min. Putnam happened to recover because the orchestrator's latersetState({completed})clobbered the watchdog's stale UPDATE — but that was sheer luck. Without the race-win, a paying buyer would have been silently dropped. Rule: When a slot-gate or queue introduces wait time before real work begins, the work-time-bounded watchdog must not measure FROM queue-entry. Fix shape:acquireSlot()writes a freshupdated_atto the audit row right when the slot is granted, so the watchdog measures only "time Claude has been actually running." Implementation: passauditIdthroughwithClaudeSlot(fn, context)from the report functions; slot-gate runsUPDATE audits SET updated_at = datetime('now') WHERE id = ?. Best-effort — never block the slot acquisition on DB latency. Reference:services/slot-gate.js:refreshAuditTimestamp. Date: 2026-04-26
slot-gate "queued: N" semantics — N is wait-position, NOT slot-empty count¶
From legacy section: Competitive Audit Automation (v1.1/v1.2)
Pattern: While diagnosing Rock Academy's status, saw log line
"claude slot-gate: waiting for free slot","queued":0,"max":1and incorrectly inferred "slot is empty, audit is acquiring immediately." Actually means "0 audits ahead of me in the wait queue (I'm next), max 1 in slot, currently waiting." Misread caused me to declare Rock Academy retry as "failed" before realizing it was still actively running its Claude call. Rule:withClaudeSlot()'squeued: Nlog field is the position-in-queue at the moment the audit STARTED waiting — N=0 means "I'm next when the current slot-holder releases." Does NOT mean the slot is empty. To determine if a specific audit is currently in the slot, look for the matching"acquired slot after wait"event withwaited_msset; until that event fires, the audit is queued. Don't infer slot-empty state fromqueued: 0alone. Date: 2026-04-27
VPS deploy path is /opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/, NOT /opt/evolve/apps/competitive-audit/¶
From legacy section: SEO NEO / Workbook
Pattern: Deploying a one-file fix to
services/ghl.js. README atinfra/vps/apps/competitive-audit/README.mdsays rsync to/opt/evolve/apps/competitive-audit/. Ran the rsync, gotchange_dir failed: No such file or directory (2)— the path doesn't exist on the server. Checkedpm2 jlistand found the actual cwd:/opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/. The service is running from the cloned git repo path, NOT the symlink-stripped/opt/evolve/apps/path the README references. README andbin/deploy-verify.mjsare both stale on this. Rule: When deploying to the competitive-audit VPS, the canonical path is/opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/. Verify before rsync:ssh evolve@2.24.192.235 "pm2 jlist | python3 -c 'import json,sys; [print(p[\"name\"], p[\"pm2_env\"][\"cwd\"]) for p in json.load(sys.stdin)]'"— that's the source of truth. Why: Confirmed 2026-05-11 during GHL retry deploy. README path 404'd, real path is/opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/. README +bin/deploy-verify.mjsboth need to be updated as a follow-up. How to apply: Any rsync/scp/ssh against the competitive-audit VPS, use/opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/<path>until the README + deploy script are corrected. Once corrected, this lesson can be retired. Date: 2026-05-12