Databento data catalog
Tutorial for PoseiTrader a high-performance algorithmic trading platform and event driven backtester.
We are currently working on this tutorial.
Overview
This tutorial will walk through how to set up a Posei Parquet data catalog with various Databento schemas.
Prerequisites
- Python 3.11+ installed
-
JupyterLab
or similar installed (
pip install -U jupyterlab
) -
PoseiTrader
latest release installed (
pip install -U posei_trader
) -
databento
Python client library installed to make data
requests (
pip install -U databento
) - Databento account
Requesting data
We'll use a Databento historical client for
the rest of this tutorial. You can either
initialize one by passing your Databento API key
to the constructor, or implicitly use the
DATABENTO_API_KEY
environment
variable (as shown).
import databento as db
client = db.Historical() # This will use the DATABENTO_API_KEY environment variable (recommended best practice)
It's important to note that every
historical streaming request from
timeseries.get_range
will incur a
cost (even for the same data), therefore we need
to:
- Know and understand the cost prior to making a request
- Not make requests for the same data more than once (not efficient)
- Persist the responses to disk by writing zstd compressed DBN files (so that we don't have to request again)
We can use a metadata get_cost endpoint from the Databento API to get a quote on the cost, prior to each request. Each request sequence will first request the cost of the data, and then make a request only if the data doesn't already exist on disk.
Note the response returned is in USD, displayed as fractional cents.
The following request is only for a small amount of data (as used in this Medium article Building high-frequency trading signals in Python with Databento and sklearn), just to demonstrate the basic workflow.
from pathlib import Path
from databento import DBNStore
We'll prepare a directory for the raw Databento DBN format data, which we'll use for the rest of the tutorial.
DATABENTO_DATA_DIR = Path("databento")
DATABENTO_DATA_DIR.mkdir(exist_ok=True)
# Request cost quote (USD) - this endpoint is 'free'
client.metadata.get_cost(
dataset="GLBX.MDP3",
symbols=["ES.n.0"],
stype_in="continuous",
schema="mbp-10",
start="2023-12-06T14:30:00",
end="2023-12-06T20:30:00",
)
Use the historical API to request for the data used in the Medium article.
path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"
if not path.exists():
# Request data
client.timeseries.get_range(
dataset="GLBX.MDP3",
symbols=["ES.n.0"],
stype_in="continuous",
schema="mbp-10",
start="2023-12-06T14:30:00",
end="2023-12-06T20:30:00",
path=path, # <-- Passing a `path` parameter will ensure the data is written to disk
)
Inspect the data by reading from disk and convert to a pandas.DataFrame
data = DBNStore.from_file(path)
df = data.to_df()
df
Write to data catalog
import shutil
from pathlib import Path
from posei_trader.adapters.databento.loaders import DatabentoDataLoader
from posei_trader.model import InstrumentId
from posei_trader.persistence.catalog import ParquetDataCatalog
CATALOG_PATH = Path.cwd() / "catalog"
# Clear if it already exists
if CATALOG_PATH.exists():
shutil.rmtree(CATALOG_PATH)
CATALOG_PATH.mkdir()
# Create a catalog instance
catalog = ParquetDataCatalog(CATALOG_PATH)
Now that we've prepared the data catalog, we
need a DatabentoDataLoader
which
we'll use to decode and load the data into
Posei objects.
loader = DatabentoDataLoader()
Next, we'll load Rust pyo3 objects to write
to the catalog (we could use legacy Cython
objects, but this is slightly more efficient) by
setting as_legacy_cython=False
.
We also pass an instrument_id
, which
is not required but makes data loading faster as
symbology mapping is not required.
path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"
instrument_id = InstrumentId.from_str("ES.n.0") # This should be the raw symbol (update)
depth10 = loader.from_dbn_file(
path=path,
instrument_id=instrument_id,
as_legacy_cython=False,
)
# Write data to catalog (this takes ~20 seconds or ~250,000/second for writing MBP-10 at the moment)
catalog.write_data(depth10)
# Test reading from catalog
depths = catalog.order_book_depth10()
len(depths)
Preparing a month of AAPL trades
Now we'll expand on this workflow by
preparing a month of AAPL trades on the Nasdaq
exchange using the Databento
trade
schema, which will translate to
Posei TradeTick
objects.
# Request cost quote (USD) - this endpoint is 'free'
client.metadata.get_cost(
dataset="XNAS.ITCH",
symbols=["AAPL"],
schema="trades",
start="2024-01",
)
When requesting historical data with the Databento
Historical
data client, ensure you
pass a path
parameter to write the
data to disk.
path = DATABENTO_DATA_DIR / "aapl-xnas-202401.trades.dbn.zst"
if not path.exists():
# Request data
client.timeseries.get_range(
dataset="XNAS.ITCH",
symbols=["AAPL"],
schema="trades",
start="2024-01",
path=path, # <-- Passing a `path` parameter
)
Inspect the data by reading from disk and convert to a pandas.DataFrame
data = DBNStore.from_file(path)
df = data.to_df()
df
We'll use an InstrumentId
of
"AAPL.XNAS"
, where XNAS is
the ISO 10383 MIC (Market Identifier Code) for the
Nasdaq venue.
While passing an instrument_id
to the
loader isn't strictly necessary, it speeds up
data loading by eliminating the need for symbology
mapping. Additionally, setting the
as_legacy_cython
option to False
further optimizes the process since we'll be
writing the loaded data to the catalog. Although
we could use legacy Cython objects, this method is
more efficient for loading.
instrument_id = InstrumentId.from_str("AAPL.XNAS")
trades = loader.from_dbn_file(
path=path,
instrument_id=instrument_id,
as_legacy_cython=False,
)
Here we'll organize our data as a file per month, this is an arbitrary choice as a file per day could be just as valid.
It may also be a good idea to create a function
which can return the correct
basename_template
value for a given
chunk of data.
# Write data to catalog
catalog.write_data(trades, basename_template="2024-01")
trades = catalog.trade_ticks([instrument_id])
len(trades)