// educational content · not financial advice

Everything on this page is published for educational and informational purposes only. Nothing here is investment, financial, legal, tax, or trading advice, a recommendation to buy or sell any security or contract, or a solicitation of any kind. Trading futures, options, equities, and crypto involves substantial risk of loss and is not suitable for every investor. Past performance — including any backtests, demos, or examples shown — does not guarantee future results. Consult a licensed professional before acting on anything you read here.

// module · data engineering

Garbage in, lies out. The data layer decides whether your edge is real.

Most backtests that fail in production fail at the data layer, not the strategy. This module covers the vendors worth using, the point-in-time pipeline that prevents look-ahead bias, the corporate-action math that keeps history continuous, the storage formats that survive at scale, and the twelve checks every research pipeline must run before you trust a single backtest result.

vendors compared

pipeline stages

corp action types

quality checks

// in this module

5 stages, raw → ready

5 formats benchmarked

// 01 · vendors

Eight vendors. Pick by asset class, price, and PIT quality.

Filter by asset class. Sort by price tier. The point-in-time column is the one that matters most for honest research — green vendors give you survivorship-free history; red vendors do not.

	Coverage		PIT	Best for
Yahoo Finance EquitiesFXCrypto	EOD bars, adjusted close, undocumented coverage gaps	Free (unofficial) free	no PIT	Tutorials and toy examples. Never production.
Norgate Data EquitiesFutures	US/AU/CA equities EOD with full delisting history, since 1980s	$30 – $80 / mo low	point-in-time	Daily-bar backtests that need survivorship-free history.
IEX Cloud Equities	US equities consolidated tape, since 2017	$9 – $499 / mo low	partial PIT	Hobbyist research and low-volume production.
Alpaca Market Data EquitiesCrypto	US equities (SIP) + crypto, real-time and historical	$0 – $99 / mo low	no PIT	Live strategy execution where data and broker are unified.
Interactive Brokers EquitiesFuturesOptionsFX	Global equities, futures, FX (via TWS / API)	Subscription bundles low	no PIT	Multi-asset live trading; not for serious historical backtesting.
Polygon.io EquitiesOptionsCryptoFX	US equities + options tick, real-time, since 2003	$29 – $1,999 / mo mid	partial PIT	Retail to mid-frequency US strategies needing real-time.
Databento EquitiesFuturesOptions	MBO/MBP tick, all US exchanges, since 2018	$0.10 – $5 / GB usage high	point-in-time	Microstructure research and HFT backtesting.
CRSP Equities	US equities daily + monthly, since 1925, academic gold standard	Institutional license high	point-in-time	Academic research and regulatory-grade backtests.

// 02 · point-in-time pipeline

Five stages. Raw vendor bytes → research-ready features.

Click any stage to see the detail. Each layer is append-only; downstream layers read and write, never modify upstream. That immutability is what makes a backtest reproducible six months later.

Raw ingest

Download raw vendor files (CSV, Parquet, FIX dumps) into an immutable bucket — S3, Backblaze, or even versioned local disk. Always store the vendor's original bytes alongside the SHA256 hash and the timestamp you pulled. Never modify raw files — every downstream stage reads from this layer and writes to a new one.

Output

raw/{vendor}/{date}.csv.gz + SHA256 manifest

Common pitfall

Editing the raw file in place. If you discover a bias later, you have no audit trail. The raw layer must be append-only.

// 03 · corporate actions

Splits, dividends, spinoffs. The math that keeps history continuous.

Six common actions. Each card has the definition, the back-adjustment recipe, and a real-world example with numbers. Always back-adjust history; never forward-adjust the present.

Definition

Issuer issues N new shares for every 1 existing share. Each share's price divides by N. Total market cap unchanged. Holder receives N times as many shares.

Back-adjustment

Divide all historical prices BEFORE the ex-date by N. Multiply all historical volumes BEFORE the ex-date by N. Result: a continuous, comparable price series across the split.

Example

AAPL closed at $499.23 on 2020-08-28 (pre-split) and opened at $127.58 on 2020-08-31 (post-split, 4:1). After back-adjustment, pre-split prices are divided by 4: 499.23 → 124.81. Now a moving average crossing the split date works.

Frequency

Common (~50/yr in S&P 500)

// 04 · storage formats

CSV is the wrong answer. Parquet is the right one. Mostly.

Five formats benchmarked on 1 year and 10 years of US-equity minute bars. Toggle the dataset size to see how each scales. Click a row to see pros and cons.

Benchmark: US equity 1-minute bars

Disk size (GB, lower = better)

Cold read (seconds, lower = better)

Parquet (snappy)

docs →

.parquet

Pros

Columnar, typed, compressed. Predicate pushdown. Read by pandas, Spark, DuckDB, Polars.

Cons

Not human-readable. Slight write overhead vs CSV. Schema evolution requires care.

Benchmark numbers are representative for a single M2 Mac mini reading from local NVMe. Cloud object stores (S3, R2) add ~50-200ms cold-start latency per read regardless of format, so columnar wins by even more there.

// 05 · quality checks

Twelve checks. If you cannot tick all twelve, you do not have a backtest.

Four categories: completeness, correctness, alignment, and survivorship. Tick boxes as you implement each check in your pipeline. Expand any check to see the one-line code recipe.

Progress

0 / 12

// 06 · printable companion

Take the whole module offline. 28 pages. Free. No login.

The full Data Engineering module — every vendor note, every pipeline stage, every corporate-action recipe, every storage benchmark, every quality check — in one printable PDF.

↓ download PDF (28 pages)← back to automation guide

// see the library · Module 04

This is one of six modules in the Nexural Automation curriculum. The library page maps every module, shows the dependency graph, and links the master 14-page Curriculum Index PDF.

browse the full PDF library →

Garbage in, lies out. The data layer decides whether your edge is real.

Eight vendors. Pick by asset class, price, and PIT quality.

Five stages. Raw vendor bytes → research-ready features.

Splits, dividends, spinoffs. The math that keeps history continuous.

Forward split

Reverse split

Cash dividend

Stock dividend

Spinoff

Cash merger / acquisition

CSV is the wrong answer. Parquet is the right one. Mostly.

Parquet (snappy)

Twelve checks. If you cannot tick all twelve, you do not have a backtest.

Take the whole module offline. 28 pages. Free. No login.