Skip to content

Ssh And Vps

PM2 needs --update-env to pick up env file changes

From legacy section: Competitive Audit Automation (v1.1/v1.2)

Pattern: Added PAGESPEED_API_KEY=... to /opt/evolve/config/competitive-audit.env, ran pm2 restart competitive-audit, expected the new env var to be visible to the running process. It wasn't — PM2 keeps the parent shell env from when the process was first started. Subsequent audits still hit the unauth quota. Rule: After any env-file change for a PM2-managed process, restart with pm2 restart {name} --update-env. Verify the new var is reachable at runtime with node -e "import('./src/config.js').then(m => console.log('key set:', !!m.config.pagespeed?.apiKey))"pm2 env shows only PM2-managed vars, not what dotenv.config() loads inside the Node process. Date: 2026-04-24


PM2 restart mid-Claude-stream silently strands audits in generating_report

From legacy section: Competitive Audit Automation (v1.1/v1.2)

Pattern: Three Medium audits (ca726b91, f7506ade, 00cb45a7) all reached generating_report and got stuck there because Jim restarted PM2 to deploy two code changes (website-cleaner + EMAIL_OVERRIDE_TO env update) mid-pipeline. Claude streaming state is in-memory; process kill drops the stream. The audit row stays in generating_report until the 40-minute stuck-watchdog catches it. No automatic retry. No surfaced error during the 40-min window — just silence. Rule: Before any pm2 restart competitive-audit, query the DB for in-flight audits: sqlite3 /opt/evolve/data/competitive-audit/app.db "SELECT id, status FROM audits WHERE status NOT IN ('completed','failed','refunded')". If anything is mid-pipeline, either (a) wait for it to finish OR (b) accept that you'll need to retry it via /api/internal/retry/:id?from=publish afterward (preserves LD scan data, only re-runs Claude — saves ~10 min + LD credits per audit). Long-term: ship the resumable-Claude future-improvement so restarts auto-recover. Date: 2026-04-25


claude.js promptCache is per-tier and survives a single restart cycle — pm2 restart is the only invalidation

From legacy section: Competitive Audit Automation (v1.1/v1.2)

Pattern: loadSystemPrompt(tier) in claude.js:18 populates a module-level promptCache Map keyed by normalized tier name on first call per tier per pm2 boot. The user's mental model was "prompts load fresh per-call" — wrong for warm processes. After deploying a prompt-file change via scp, the only way the running process picks it up is pm2 restart. With pm2 staying up across multiple deploy iterations, this is invisible and very easy to miss. Confused the timeline reconstruction during this morning's diagnosis (was the wrapper-fixed prompt loaded by the run that was missing the wrapper?). Rule: Any prompt-file change requires pm2 restart competitive-audit before the next audit can pick it up. Document the cache + invalidation in any deploy runbook for this app. Future-improvement candidate: watch the prompts directory and bust the cache on mtime change so scp-and-go works without an explicit restart, removing one footgun. Inflight-audit guard from the existing PM2-restart-mid-stream lesson still applies — query DB before restarting. Date: 2026-04-25


Rapid sshd polling triggers fail2ban; HTTP polling endpoints sidestep the lockout

From legacy section: Competitive Audit Automation (v1.1/v1.2)

Pattern: During the morning checklist, polling Howell's status via ssh evolve@... sqlite3 ... every 60s for ~10 min triggered fail2ban (or some IP-block layer) on port 22 specifically — port 443 stayed open the entire time, host was fine, sshd was the only thing rejected. fail2ban-client unban --all didn't immediately recover (the iptables chain may persist past the unban, or block was at a separate layer). Lost ~30 min waiting for the lockout to clear. Rule: When polling audit status during retries, use the existing GET /api/audit/:id/status HTTPS endpoint instead of ssh+sqlite. Same data, no ssh round-trips, no fail2ban risk. Reference: infra/vps/apps/competitive-audit/src/routes/audit.js:376. Reserve ssh for things that genuinely need shell access (running CLI scripts, log inspection); the audit pipeline exposes everything else over HTTPS. Date: 2026-04-25


Slot-gate wait time eats stuck-watchdog budget — refresh updated_at on slot acquire

From legacy section: Competitive Audit Automation (v1.1/v1.2)

Pattern: Putnam Place (paid Full audit) was created at 03:57 UTC. Pipeline set status=generating_report at 03:59 (which bumped updated_at). Then it waited 36 min in the slot-gate behind a slow Ianniello research call. Acquired slot at 04:35, started Claude work. At 04:53 (50 min from generating_report status set), the stuck-watchdog measured updated_at < now - 50 min and marked Putnam failed — even though Claude had only been working for 18 min. Putnam happened to recover because the orchestrator's later setState({completed}) clobbered the watchdog's stale UPDATE — but that was sheer luck. Without the race-win, a paying buyer would have been silently dropped. Rule: When a slot-gate or queue introduces wait time before real work begins, the work-time-bounded watchdog must not measure FROM queue-entry. Fix shape: acquireSlot() writes a fresh updated_at to the audit row right when the slot is granted, so the watchdog measures only "time Claude has been actually running." Implementation: pass auditId through withClaudeSlot(fn, context) from the report functions; slot-gate runs UPDATE audits SET updated_at = datetime('now') WHERE id = ?. Best-effort — never block the slot acquisition on DB latency. Reference: services/slot-gate.js:refreshAuditTimestamp. Date: 2026-04-26


slot-gate "queued: N" semantics — N is wait-position, NOT slot-empty count

From legacy section: Competitive Audit Automation (v1.1/v1.2)

Pattern: While diagnosing Rock Academy's status, saw log line "claude slot-gate: waiting for free slot","queued":0,"max":1 and incorrectly inferred "slot is empty, audit is acquiring immediately." Actually means "0 audits ahead of me in the wait queue (I'm next), max 1 in slot, currently waiting." Misread caused me to declare Rock Academy retry as "failed" before realizing it was still actively running its Claude call. Rule: withClaudeSlot()'s queued: N log field is the position-in-queue at the moment the audit STARTED waiting — N=0 means "I'm next when the current slot-holder releases." Does NOT mean the slot is empty. To determine if a specific audit is currently in the slot, look for the matching "acquired slot after wait" event with waited_ms set; until that event fires, the audit is queued. Don't infer slot-empty state from queued: 0 alone. Date: 2026-04-27


VPS deploy path is /opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/, NOT /opt/evolve/apps/competitive-audit/

From legacy section: SEO NEO / Workbook

Pattern: Deploying a one-file fix to services/ghl.js. README at infra/vps/apps/competitive-audit/README.md says rsync to /opt/evolve/apps/competitive-audit/. Ran the rsync, got change_dir failed: No such file or directory (2) — the path doesn't exist on the server. Checked pm2 jlist and found the actual cwd: /opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/. The service is running from the cloned git repo path, NOT the symlink-stripped /opt/evolve/apps/ path the README references. README and bin/deploy-verify.mjs are both stale on this. Rule: When deploying to the competitive-audit VPS, the canonical path is /opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/. Verify before rsync: ssh evolve@2.24.192.235 "pm2 jlist | python3 -c 'import json,sys; [print(p[\"name\"], p[\"pm2_env\"][\"cwd\"]) for p in json.load(sys.stdin)]'" — that's the source of truth. Why: Confirmed 2026-05-11 during GHL retry deploy. README path 404'd, real path is /opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/. README + bin/deploy-verify.mjs both need to be updated as a follow-up. How to apply: Any rsync/scp/ssh against the competitive-audit VPS, use /opt/evolve/repo/Evolve-Agency/infra/vps/apps/competitive-audit/<path> until the README + deploy script are corrected. Once corrected, this lesson can be retired. Date: 2026-05-12