Version: latest

Databento data catalog

Tutorial for PoseiTrader a high-performance algorithmic trading platform and event driven backtester.

View source on GitHub.

info

We are currently working on this tutorial.

Overview

This tutorial will walk through how to set up a Posei Parquet data catalog with various Databento schemas.

Prerequisites

Python 3.11+ installed
JupyterLab or similar installed (pip install -U jupyterlab)
PoseiTrader latest release installed (pip install -U posei_trader)
databento Python client library installed to make data requests (pip install -U databento)
Databento account

Requesting data

We'll use a Databento historical client for the rest of this tutorial. You can either initialize one by passing your Databento API key to the constructor, or implicitly use the DATABENTO_API_KEY environment variable (as shown).

                                import databento as db

client = db.Historical()  # This will use the DATABENTO_API_KEY environment variable (recommended best practice)

It's important to note that every historical streaming request from timeseries.get_range will incur a cost (even for the same data), therefore we need to:

Know and understand the cost prior to making a request
Not make requests for the same data more than once (not efficient)
Persist the responses to disk by writing zstd compressed DBN files (so that we don't have to request again)

We can use a metadata get_cost endpoint from the Databento API to get a quote on the cost, prior to each request. Each request sequence will first request the cost of the data, and then make a request only if the data doesn't already exist on disk.

Note the response returned is in USD, displayed as fractional cents.

The following request is only for a small amount of data (as used in this Medium article Building high-frequency trading signals in Python with Databento and sklearn), just to demonstrate the basic workflow.

                                from pathlib import Path

from databento import DBNStore

We'll prepare a directory for the raw Databento DBN format data, which we'll use for the rest of the tutorial.

                                DATABENTO_DATA_DIR = Path("databento")
DATABENTO_DATA_DIR.mkdir(exist_ok=True)

                                # Request cost quote (USD) - this endpoint is 'free'
client.metadata.get_cost(
    dataset="GLBX.MDP3",
    symbols=["ES.n.0"],
    stype_in="continuous",
    schema="mbp-10",
    start="2023-12-06T14:30:00",
    end="2023-12-06T20:30:00",
)

                              

Use the historical API to request for the data used in the Medium article.

                                path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"

if not path.exists():
    # Request data
    client.timeseries.get_range(
        dataset="GLBX.MDP3",
        symbols=["ES.n.0"],
        stype_in="continuous",
        schema="mbp-10",
        start="2023-12-06T14:30:00",
        end="2023-12-06T20:30:00",
        path=path,  # <-- Passing a `path` parameter will ensure the data is written to disk
    )

                              

Inspect the data by reading from disk and convert to a pandas.DataFrame

                                data = DBNStore.from_file(path)

df = data.to_df()
df

Write to data catalog

                                import shutil
from pathlib import Path

from posei_trader.adapters.databento.loaders import DatabentoDataLoader
from posei_trader.model import InstrumentId
from posei_trader.persistence.catalog import ParquetDataCatalog

                              

                                CATALOG_PATH = Path.cwd() / "catalog"

# Clear if it already exists
if CATALOG_PATH.exists():
    shutil.rmtree(CATALOG_PATH)
CATALOG_PATH.mkdir()

# Create a catalog instance
catalog = ParquetDataCatalog(CATALOG_PATH)

                              

Now that we've prepared the data catalog, we need a DatabentoDataLoader which we'll use to decode and load the data into Posei objects.

                                loader = DatabentoDataLoader()

                              

Next, we'll load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient) by setting as_legacy_cython=False.

We also pass an instrument_id, which is not required but makes data loading faster as symbology mapping is not required.

                                path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"
instrument_id = InstrumentId.from_str("ES.n.0")  # This should be the raw symbol (update)

depth10 = loader.from_dbn_file(
    path=path,
    instrument_id=instrument_id,
    as_legacy_cython=False,
)

                              

                                # Write data to catalog (this takes ~20 seconds or ~250,000/second for writing MBP-10 at the moment)
catalog.write_data(depth10)

                                # Test reading from catalog
depths = catalog.order_book_depth10()
len(depths)

                              

Preparing a month of AAPL trades

Now we'll expand on this workflow by preparing a month of AAPL trades on the Nasdaq exchange using the Databento trade schema, which will translate to Posei TradeTick objects.

                                # Request cost quote (USD) - this endpoint is 'free'
client.metadata.get_cost(
    dataset="XNAS.ITCH",
    symbols=["AAPL"],
    schema="trades",
    start="2024-01",
)

                              

When requesting historical data with the Databento Historical data client, ensure you pass a path parameter to write the data to disk.

                                path = DATABENTO_DATA_DIR / "aapl-xnas-202401.trades.dbn.zst"

if not path.exists():
    # Request data
    client.timeseries.get_range(
        dataset="XNAS.ITCH",
        symbols=["AAPL"],
        schema="trades",
        start="2024-01",
        path=path,  # <-- Passing a `path` parameter
    )

                              

Inspect the data by reading from disk and convert to a pandas.DataFrame

                                data = DBNStore.from_file(path)

df = data.to_df()
df

We'll use an InstrumentId of "AAPL.XNAS", where XNAS is the ISO 10383 MIC (Market Identifier Code) for the Nasdaq venue.

While passing an instrument_id to the loader isn't strictly necessary, it speeds up data loading by eliminating the need for symbology mapping. Additionally, setting the as_legacy_cython option to False further optimizes the process since we'll be writing the loaded data to the catalog. Although we could use legacy Cython objects, this method is more efficient for loading.

                                instrument_id = InstrumentId.from_str("AAPL.XNAS")

trades = loader.from_dbn_file(
    path=path,
    instrument_id=instrument_id,
    as_legacy_cython=False,
)

                              

Here we'll organize our data as a file per month, this is an arbitrary choice as a file per day could be just as valid.

It may also be a good idea to create a function which can return the correct basename_template value for a given chunk of data.

                                # Write data to catalog
catalog.write_data(trades, basename_template="2024-01")

                                trades = catalog.trade_ticks([instrument_id])

                              

                                len(trades)

                              

Overview​

Prerequisites​

Requesting data​

Write to data catalog​

Preparing a month of AAPL trades​

Overview

Prerequisites

Requesting data

Write to data catalog

Preparing a month of AAPL trades