Everything on this page is published for educational and informational purposes only. Nothing here is investment, financial, legal, tax, or trading advice, a recommendation to buy or sell any security or contract, or a solicitation of any kind. Trading futures, options, equities, and crypto involves substantial risk of loss and is not suitable for every investor. Past performance — including any backtests, demos, or examples shown — does not guarantee future results. Consult a licensed professional before acting on anything you read here.
Garbage in, lies out. The data layer decides whether your edge is real.
Most backtests that fail in production fail at the data layer, not the strategy. This module covers the vendors worth using, the point-in-time pipeline that prevents look-ahead bias, the corporate-action math that keeps history continuous, the storage formats that survive at scale, and the twelve checks every research pipeline must run before you trust a single backtest result.
Eight vendors. Pick by asset class, price, and PIT quality.
Filter by asset class. Sort by price tier. The point-in-time column is the one that matters most for honest research — green vendors give you survivorship-free history; red vendors do not.
| Coverage | PIT | Best for | ||
|---|---|---|---|---|
| Yahoo Finance EquitiesFXCrypto | EOD bars, adjusted close, undocumented coverage gaps | Free (unofficial) free | no PIT | Tutorials and toy examples. Never production. |
| Norgate Data EquitiesFutures | US/AU/CA equities EOD with full delisting history, since 1980s | $30 – $80 / mo low | point-in-time | Daily-bar backtests that need survivorship-free history. |
| IEX Cloud Equities | US equities consolidated tape, since 2017 | $9 – $499 / mo low | partial PIT | Hobbyist research and low-volume production. |
| Alpaca Market Data EquitiesCrypto | US equities (SIP) + crypto, real-time and historical | $0 – $99 / mo low | no PIT | Live strategy execution where data and broker are unified. |
| Interactive Brokers EquitiesFuturesOptionsFX | Global equities, futures, FX (via TWS / API) | Subscription bundles low | no PIT | Multi-asset live trading; not for serious historical backtesting. |
| Polygon.io EquitiesOptionsCryptoFX | US equities + options tick, real-time, since 2003 | $29 – $1,999 / mo mid | partial PIT | Retail to mid-frequency US strategies needing real-time. |
| Databento EquitiesFuturesOptions | MBO/MBP tick, all US exchanges, since 2018 | $0.10 – $5 / GB usage high | point-in-time | Microstructure research and HFT backtesting. |
| CRSP Equities | US equities daily + monthly, since 1925, academic gold standard | Institutional license high | point-in-time | Academic research and regulatory-grade backtests. |
Five stages. Raw vendor bytes → research-ready features.
Click any stage to see the detail. Each layer is append-only; downstream layers read and write, never modify upstream. That immutability is what makes a backtest reproducible six months later.
Download raw vendor files (CSV, Parquet, FIX dumps) into an immutable bucket — S3, Backblaze, or even versioned local disk. Always store the vendor's original bytes alongside the SHA256 hash and the timestamp you pulled. Never modify raw files — every downstream stage reads from this layer and writes to a new one.
raw/{vendor}/{date}.csv.gz + SHA256 manifestEditing the raw file in place. If you discover a bias later, you have no audit trail. The raw layer must be append-only.
Splits, dividends, spinoffs. The math that keeps history continuous.
Six common actions. Each card has the definition, the back-adjustment recipe, and a real-world example with numbers. Always back-adjust history; never forward-adjust the present.
Issuer issues N new shares for every 1 existing share. Each share's price divides by N. Total market cap unchanged. Holder receives N times as many shares.
Divide all historical prices BEFORE the ex-date by N. Multiply all historical volumes BEFORE the ex-date by N. Result: a continuous, comparable price series across the split.
AAPL closed at $499.23 on 2020-08-28 (pre-split) and opened at $127.58 on 2020-08-31 (post-split, 4:1). After back-adjustment, pre-split prices are divided by 4: 499.23 → 124.81. Now a moving average crossing the split date works.
Common (~50/yr in S&P 500)
CSV is the wrong answer. Parquet is the right one. Mostly.
Five formats benchmarked on 1 year and 10 years of US-equity minute bars. Toggle the dataset size to see how each scales. Click a row to see pros and cons.
Parquet (snappy)
docs →.parquetColumnar, typed, compressed. Predicate pushdown. Read by pandas, Spark, DuckDB, Polars.
Not human-readable. Slight write overhead vs CSV. Schema evolution requires care.
Benchmark numbers are representative for a single M2 Mac mini reading from local NVMe. Cloud object stores (S3, R2) add ~50-200ms cold-start latency per read regardless of format, so columnar wins by even more there.
Twelve checks. If you cannot tick all twelve, you do not have a backtest.
Four categories: completeness, correctness, alignment, and survivorship. Tick boxes as you implement each check in your pipeline. Expand any check to see the one-line code recipe.
Take the whole module offline. 28 pages. Free. No login.
The full Data Engineering module — every vendor note, every pipeline stage, every corporate-action recipe, every storage benchmark, every quality check — in one printable PDF.
This is one of six modules in the Nexural Automation curriculum. The library page maps every module, shows the dependency graph, and links the master 14-page Curriculum Index PDF.
browse the full PDF library →