Skip to content
v3 redesign is live — welcome to the trading cockpit.
Market updates, stock news, and futures insights — 3x/week, freeSubscribe free
Skip to content
build.logtraders.online=2,166trades.60s=74swings.ranked=308edge.latency_ms=42ms
// educational content · not financial advice

Everything on this page is published for educational and informational purposes only. Nothing here is investment, financial, legal, tax, or trading advice, a recommendation to buy or sell any security or contract, or a solicitation of any kind. Trading futures, options, equities, and crypto involves substantial risk of loss and is not suitable for every investor. Past performance — including any backtests, demos, or examples shown — does not guarantee future results. Consult a licensed professional before acting on anything you read here.

// module · data engineering

Garbage in, lies out. The data layer decides whether your edge is real.

Most backtests that fail in production fail at the data layer, not the strategy. This module covers the vendors worth using, the point-in-time pipeline that prevents look-ahead bias, the corporate-action math that keeps history continuous, the storage formats that survive at scale, and the twelve checks every research pipeline must run before you trust a single backtest result.

vendors compared
8
pipeline stages
5
corp action types
6
quality checks
12
// 01 · vendors

Eight vendors. Pick by asset class, price, and PIT quality.

Filter by asset class. Sort by price tier. The point-in-time column is the one that matters most for honest research — green vendors give you survivorship-free history; red vendors do not.

CoveragePITBest for
Yahoo Finance
EquitiesFXCrypto
EOD bars, adjusted close, undocumented coverage gaps
Free (unofficial)
free
no PITTutorials and toy examples. Never production.
Norgate Data
EquitiesFutures
US/AU/CA equities EOD with full delisting history, since 1980s
$30 – $80 / mo
low
point-in-timeDaily-bar backtests that need survivorship-free history.
IEX Cloud
Equities
US equities consolidated tape, since 2017
$9 – $499 / mo
low
partial PITHobbyist research and low-volume production.
Alpaca Market Data
EquitiesCrypto
US equities (SIP) + crypto, real-time and historical
$0 – $99 / mo
low
no PITLive strategy execution where data and broker are unified.
Interactive Brokers
EquitiesFuturesOptionsFX
Global equities, futures, FX (via TWS / API)
Subscription bundles
low
no PITMulti-asset live trading; not for serious historical backtesting.
Polygon.io
EquitiesOptionsCryptoFX
US equities + options tick, real-time, since 2003
$29 – $1,999 / mo
mid
partial PITRetail to mid-frequency US strategies needing real-time.
Databento
EquitiesFuturesOptions
MBO/MBP tick, all US exchanges, since 2018
$0.10 – $5 / GB usage
high
point-in-timeMicrostructure research and HFT backtesting.
CRSP
Equities
US equities daily + monthly, since 1925, academic gold standard
Institutional license
high
point-in-timeAcademic research and regulatory-grade backtests.
// 02 · point-in-time pipeline

Five stages. Raw vendor bytes → research-ready features.

Click any stage to see the detail. Each layer is append-only; downstream layers read and write, never modify upstream. That immutability is what makes a backtest reproducible six months later.

Raw ingest

Download raw vendor files (CSV, Parquet, FIX dumps) into an immutable bucket — S3, Backblaze, or even versioned local disk. Always store the vendor's original bytes alongside the SHA256 hash and the timestamp you pulled. Never modify raw files — every downstream stage reads from this layer and writes to a new one.

Output
raw/{vendor}/{date}.csv.gz + SHA256 manifest
Common pitfall

Editing the raw file in place. If you discover a bias later, you have no audit trail. The raw layer must be append-only.

// 03 · corporate actions

Splits, dividends, spinoffs. The math that keeps history continuous.

Six common actions. Each card has the definition, the back-adjustment recipe, and a real-world example with numbers. Always back-adjust history; never forward-adjust the present.

Definition

Issuer issues N new shares for every 1 existing share. Each share's price divides by N. Total market cap unchanged. Holder receives N times as many shares.

Back-adjustment

Divide all historical prices BEFORE the ex-date by N. Multiply all historical volumes BEFORE the ex-date by N. Result: a continuous, comparable price series across the split.

Example

AAPL closed at $499.23 on 2020-08-28 (pre-split) and opened at $127.58 on 2020-08-31 (post-split, 4:1). After back-adjustment, pre-split prices are divided by 4: 499.23 → 124.81. Now a moving average crossing the split date works.

Frequency

Common (~50/yr in S&P 500)

// 04 · storage formats

CSV is the wrong answer. Parquet is the right one. Mostly.

Five formats benchmarked on 1 year and 10 years of US-equity minute bars. Toggle the dataset size to see how each scales. Click a row to see pros and cons.

Benchmark: US equity 1-minute bars
Disk size (GB, lower = better)
Cold read (seconds, lower = better)

Parquet (snappy)

docs →
.parquet
Pros

Columnar, typed, compressed. Predicate pushdown. Read by pandas, Spark, DuckDB, Polars.

Cons

Not human-readable. Slight write overhead vs CSV. Schema evolution requires care.

Benchmark numbers are representative for a single M2 Mac mini reading from local NVMe. Cloud object stores (S3, R2) add ~50-200ms cold-start latency per read regardless of format, so columnar wins by even more there.

// 05 · quality checks

Twelve checks. If you cannot tick all twelve, you do not have a backtest.

Four categories: completeness, correctness, alignment, and survivorship. Tick boxes as you implement each check in your pipeline. Expand any check to see the one-line code recipe.

Progress
0 / 12
0%
// 06 · printable companion

Take the whole module offline. 28 pages. Free. No login.

The full Data Engineering module — every vendor note, every pipeline stage, every corporate-action recipe, every storage benchmark, every quality check — in one printable PDF.

// see the library · Module 04

This is one of six modules in the Nexural Automation curriculum. The library page maps every module, shows the dependency graph, and links the master 14-page Curriculum Index PDF.

browse the full PDF library →