Everything on this page is published for educational and informational purposes only. Nothing here is investment, financial, legal, tax, or trading advice, a recommendation to buy or sell any security or contract, or a solicitation of any kind. Trading futures, options, equities, and crypto involves substantial risk of loss and is not suitable for every investor. Past performance — including any backtests, demos, or examples shown — does not guarantee future results. Consult a licensed professional before acting on anything you read here.
The model goes live. Now the real work starts.
Production trading is not a problem of clever signals. It is a problem of running a small operation that survives outages, drift, regime change, and the noisy alert channel at 3am. This module covers the runbooks every desk repeats, the six incident classes that account for almost all production pain, the recovery tree that keeps you from improvising during one, the twelve monitors that earn their alerts, and the twenty operating gates that decide whether you sleep.
Ten procedures. Every one repeated. None improvised when it matters.
Deploys, rollbacks, incident response, drift response, broker failover. Each has a trigger, the exact steps in order, and a verify check at the end. Filter by category.
Ten runbook entries cover the procedures every desk repeats. Each one carries a trigger, the exact steps, and a verify check. Document yours; do not improvise live.
Six classes of incident. Severity decides whether you can throttle or must stop.
Broker outage, feature drift, PnL divergence, stale data, config drift, risk breach. Each card opens to the four-step playbook: detection, immediate action, containment, postmortem. The split between P0 and P1 is the line between stop and slow.
Six categories cover ~95% of real incidents in production trading. Severity decides whether you can throttle or must stop. The playbook decides what comes next.
Six symptoms. One tree per symptom. No improvisation.
Pick a symptom; answer the questions; arrive at one of five actions: kill-switch, throttle, rollback, investigate, resume. The tree is opinionated by design — it refuses to skip reconciliation, never silently relaxes a risk limit, and never resumes on hope.
Order route looks healthy but acknowledgements are silent.
Is the broker status page green?
Six entry points cover the symptoms most desks see in a quarter. The tree never lets you skip reconciliation, never relaxes a risk limit silently, and never resumes on hope.
Twelve monitors. Four layers. The split decides whether you sleep.
Infrastructure, data, model, PnL — each with metrics, thresholds, channels, and frequencies that have actually paid for themselves. Pageable alerts wake on-call; Slack alerts queue; dashboards inform decisions but never wake anyone.
| Layer | Metric | Threshold | Channel | Frequency |
|---|---|---|---|---|
| Data | > 2x bar interval, any symbol | Page on-call | every 5s | |
| Data | PSI > 0.25 on any top-10 feature | Slack #alerts | daily | |
| Data | Bar count mismatch primary vs. backup | Slack #alerts | hourly | |
| Infrastructure | Healthcheck fails 2x in 60s | Page on-call | every 30s | |
| Infrastructure | p95 > 500ms for 3 windows | Page on-call | every 10s | |
| Infrastructure | > 1000 unprocessed | Slack #alerts | every 1m | |
| Model | KS-stat vs. baseline > 0.15 | Slack #alerts | hourly | |
| Model | Top-5 set changes by ≥ 2 features | Slack #alerts | weekly | |
| Model | p99 > 50% of decay horizon | Page on-call | every 1m | |
| PnL | Outside 95% band 3 days in a row | Page on-call | daily | |
| PnL | Realised - expected > 2 bps for 5 days | Slack #alerts | daily | |
| PnL | > 25% of daily PnL from one name | Slack #alerts | daily |
Twelve monitors across infrastructure, data, model, and PnL. Pageable alerts wake on-call; Slack alerts queue. The split decides whether you sleep.
Twenty gates. Five frequencies — pre-launch through quarterly.
Pre-launch runs once. Daily runs every session. Weekly grooms tickets and audits alert noise. Monthly drills DR and recalibrates the cost model. Quarterly retires the models that should be retired.
Twenty items across five cadences. The pre-launch list runs once before any capital moves; the daily list runs every session. Audit cadence-by-cadence — not item-by-item.
Take the whole module offline. 32 pages. Free. No login.
The full Operations, CLI & Recovery module — every runbook, every incident class, every monitor, every cadence gate — in one printable PDF you can keep at the desk.
This is one of six modules in the Nexural Automation curriculum. The library page maps every module, shows the dependency graph, and links the master 14-page Curriculum Index PDF.
browse the full PDF library →