Skip to content

Extending Readers

Add support for a new GNSS data format by implementing the GNSSDataReader abstract base class. canvod-readers uses the ABC pattern to enforce a consistent contract — any reader that passes the checklist below can be used anywhere GNSSDataReader is accepted.


Implementation Checklist

  •   1. Inherit correctly


    class MyReader(GNSSDataReader) — just one parent! GNSSDataReader is already a Pydantic BaseModel with fpath and file validation built in.

  •   2. Implement abstract methods


    file_hash, to_ds(), iter_epochs(), start_time, end_time, systems, num_satellites. (num_epochs has a default that counts via iter_epochs() — override for O(1) if your format stores the count.)

  •   3. Use DatasetBuilder (recommended)


    Use DatasetBuilder in your to_ds() to construct the output Dataset. It handles coordinate arrays, frequency resolution, dtype enforcement, and calls validate_dataset() automatically.

  •   4. Write tests


    Test structure, file hash, error paths, and the validation round-trip. Aim for >90 % coverage.


Contract Constants

The output Dataset contract is defined by importable constants in canvod.readers.base — these are the single source of truth:

from canvod.readers.base import (
    REQUIRED_DIMS,       # ("epoch", "sid")
    REQUIRED_COORDS,     # {name: dtype, ...}
    REQUIRED_ATTRS,      # {"Created", "Software", "Institution", "File Hash"}
    DEFAULT_REQUIRED_VARS,  # ["SNR"]
)

Use validate_dataset(ds) to check any Dataset against them. It collects all violations and raises a single ValueError listing every problem.


Step-by-Step Implementation

Step 1 — Reader Class

GNSSDataReader is a Pydantic BaseModel with fpath: Path and file validation built in. You only need to set reader-specific config:

from pydantic import ConfigDict
from canvod.readers.base import GNSSDataReader

class MyFormatReader(GNSSDataReader):
    """Reader for My Custom Format."""

    model_config = ConfigDict(frozen=True)   # no arbitrary_types needed
    # no fpath field needed — inherited from GNSSDataReader

Step 2 — File Hash

from canvod.readers.gnss_specs.utils import file_hash

class MyFormatReader(GNSSDataReader):
    ...

    @property
    def file_hash(self) -> str:
        """16-character SHA-256 prefix of the file — used for deduplication."""
        return file_hash(self.fpath)

Step 3 — Metadata Properties

class MyFormatReader(GNSSDataReader):
    ...

    @property
    def start_time(self) -> datetime:
        return self._parse_start_time()

    @property
    def end_time(self) -> datetime:
        return self._parse_end_time()

    @property
    def systems(self) -> list[str]:
        return self._parse_systems()   # e.g. ["G", "E"]

    # num_epochs has a default (iterates via iter_epochs);
    # override for O(1) if your format stores the count in the header.

    @property
    def num_satellites(self) -> int:
        return self._count_satellites()

Step 4 — Epoch Iterator

from collections.abc import Iterator

class MyFormatReader(GNSSDataReader):
    ...

    def iter_epochs(self) -> Iterator:
        """Lazily yield one epoch at a time — keep memory bounded."""
        with self.fpath.open("rb") as f:
            self._skip_header(f)
            for raw in self._raw_epoch_generator(f):
                yield self._decode_epoch(raw)

Step 5 — Dataset Conversion with DatasetBuilder

DatasetBuilder handles coordinate assembly, frequency resolution, dtype enforcement, and validation — so your to_ds() stays simple:

from canvod.readers.builder import DatasetBuilder

class MyFormatReader(GNSSDataReader):
    ...

    def to_ds(
        self,
        keep_data_vars: list[str] | None = None,
        **kwargs,
    ) -> xr.Dataset:
        builder = DatasetBuilder(self)
        for epoch in self.iter_epochs():
            ei = builder.add_epoch(epoch.timestamp)
            for obs in epoch.observations:
                sig = builder.add_signal(
                    sv=obs.sv, band=obs.band, code=obs.code
                )
                builder.set_value(ei, sig, "SNR", obs.snr_value)
        return builder.build(
            keep_data_vars=keep_data_vars,
            extra_attrs={"Source Format": "My Custom Format"},
        )
Manual Dataset construction (advanced)

If you need more control than DatasetBuilder provides, you can construct the Dataset manually using SignalIDMapper and validate_dataset():

import numpy as np
import xarray as xr
from canvod.readers.gnss_specs.signals import SignalIDMapper
from canvod.readers.gnss_specs.metadata import SNR_METADATA, COORDS_METADATA
from canvod.readers.base import validate_dataset

class MyFormatReader(GNSSDataReader):
    ...

    def to_ds(
        self,
        keep_data_vars: list[str] | None = None,
        **kwargs,
    ) -> xr.Dataset:
        all_epochs = list(self.iter_epochs())
        mapper = SignalIDMapper()

        # Build SID index, coordinate arrays, data arrays...
        # (see existing readers for full example)

        ds = xr.Dataset(
            data_vars={"SNR": (("epoch", "sid"), snr, SNR_METADATA)},
            coords={...},
            attrs={**self._build_attrs(), "Source Format": "My Custom Format"},
        )

        # MANDATORY — validate before returning
        validate_dataset(ds, required_vars=keep_data_vars)
        return ds

Step 6 — to_ds_and_auxiliary() (optional)

If your format embeds metadata beyond observations (like SBF embeds satellite geometry), override to_ds_and_auxiliary() to collect both datasets in a single file scan:

def to_ds_and_auxiliary(
    self,
    keep_data_vars: list[str] | None = None,
    **kwargs,
) -> tuple[xr.Dataset, dict[str, xr.Dataset]]:
    obs_ds = ...   # build obs dataset
    meta_ds = ...  # build metadata dataset
    return obs_ds, {"my_format_meta": meta_ds}

The default implementation calls to_ds() and returns an empty dict.


Validation Requirements

assert "epoch" in ds.dims
assert "sid"   in ds.dims
from canvod.readers.base import REQUIRED_COORDS

# REQUIRED_COORDS = {
#     "epoch":       "datetime64[ns]",
#     "sid":         "object",     # string
#     "sv":          "object",
#     "system":      "object",
#     "band":        "object",
#     "code":        "object",
#     "freq_center": "float32",    # must be float32
#     "freq_min":    "float32",
#     "freq_max":    "float32",
# }
from canvod.readers.base import REQUIRED_ATTRS

# REQUIRED_ATTRS = {
#     "Created",
#     "Software",
#     "Institution",
#     "File Hash",    # for storage deduplication
# }
# SNR required by default
assert "SNR" in ds.data_vars

# All variables must be (epoch, sid)
for var in ds.data_vars:
    assert ds[var].dims == ("epoch", "sid")

Testing

# tests/test_my_format_reader.py
import pytest
from pathlib import Path
from my_package.readers import MyFormatReader

class TestMyFormatReader:

    def test_file_hash_is_deterministic(self, tmp_path):
        f = tmp_path / "test.dat"
        f.write_bytes(b"content")
        reader = MyFormatReader(fpath=f)
        assert reader.file_hash == reader.file_hash
        assert len(reader.file_hash) == 16

    def test_dataset_dimensions(self, real_test_file):
        ds = MyFormatReader(fpath=real_test_file).to_ds()
        assert "epoch" in ds.dims
        assert "sid"   in ds.dims

    def test_dataset_variables(self, real_test_file):
        ds = MyFormatReader(fpath=real_test_file).to_ds()
        assert "SNR" in ds.data_vars

    def test_sid_dimensions(self, real_test_file):
        ds = MyFormatReader(fpath=real_test_file).to_ds()
        for var in ds.data_vars:
            assert ds[var].dims == ("epoch", "sid")

    def test_file_hash_in_attrs(self, real_test_file):
        ds = MyFormatReader(fpath=real_test_file).to_ds()
        assert "File Hash" in ds.attrs
@pytest.mark.integration
def test_full_pipeline(real_test_file):
    reader = MyFormatReader(fpath=real_test_file)
    ds = reader.to_ds(keep_data_vars=["SNR"])

    # Filter GPS only
    gps = ds.where(ds.system == "G", drop=True)
    assert len(gps.sid) > 0

    # Sanity-check values
    assert float(gps.SNR.mean()) > 0
from canvod.readers.base import validate_dataset

def test_validation_passes(real_test_file):
    ds = MyFormatReader(fpath=real_test_file).to_ds()
    # validate_dataset() is already called inside to_ds() —
    # this test verifies it didn't raise
    validate_dataset(ds)   # should not raise

Audit Integration

Adding a new reader is not complete until the audit suite can validate its output against existing readers. This ensures intra-validation — that your new reader produces scientifically consistent results when processing the same observation data.

  1. Add a Tier 1 comparison — compare your reader's output against an existing reader on shared test data (same receiver, same time window). Follow the pattern in canvod.audit.runners.sbf_vs_rinex:

    from canvod.audit.runners import audit_sbf_vs_rinex
    
    # Your equivalent: audit_myformat_vs_rinex(...)
    result = audit_sbf_vs_rinex(sbf_store="...", rinex_store="...")
    assert result.passed
    
  2. Define tolerancesSNR should be bit-identical if the underlying data is the same. Angular values (θ, φ) may differ if ephemeris sources differ. Use the appropriate ToleranceTier (EXACT, NUMERICAL, or SCIENTIFIC) and document expected differences.

  3. Add integration tests in packages/canvod-audit/tests/test_integration.py — verify dataset structure, shared observables, and value ranges against real data from the test submodule.

  4. Run the full audit after implementation:

    just test-audit
    

See the Audit Suite for the full tier system.


Common Pitfalls

Wrong dtype for frequency coordinates

# WRONG — float64 fails the dtype check
freq_center = np.array([...], dtype=np.float64)

# CORRECT — DatasetBuilder handles this automatically
freq_center = np.array([...], dtype=np.float32)

Skipping validation

# WRONG — missing mandatory validation
def to_ds(self, **kwargs) -> xr.Dataset:
    ds = self._build_dataset()
    return ds   # ← will silently produce invalid datasets downstream

# CORRECT — DatasetBuilder.build() calls validate_dataset() for you
def to_ds(self, **kwargs) -> xr.Dataset:
    builder = DatasetBuilder(self)
    # ... add epochs, signals, values ...
    return builder.build()  # validates automatically

Wrong dimension names

# WRONG
data_vars={"SNR": (("time", "signal"), data)}

# CORRECT — DatasetBuilder uses the right names automatically
data_vars={"SNR": (("epoch", "sid"), data)}

Registering with ReaderFactory

from canvodpy import ReaderFactory
from my_package.readers import MyFormatReader

# Register by name
ReaderFactory.register("my_format", MyFormatReader)

# Create by name
reader = ReaderFactory.create("my_format", fpath="file.dat")

For RINEX files, ReaderFactory.create_from_file(path) auto-detects v2/v3 from the file header. Custom binary formats should use the name-based API above.