Icechunk Storage¶

Icechunk is a cloud-native transactional storage format for multidimensional arrays — Git-like versioning meets Zarr v3.

Versioned Writes

Every commit() produces an immutable snapshot with a hash-addressable ID. Roll back to any prior state with a single line.
ACID Transactions

Multiple writes are atomic — either all succeed or none are persisted. No partial writes, no corrupt chunks, no reader/writer races.
Cloud-Native

Local filesystem for development; S3, MinIO, or Cloudflare R2 for production. Zero code change to switch backends.
Zarr v3 Chunks

Zstd-compressed chunks, O(1) epoch-range reads, compatible with xarray.open_zarr() out of the box.

Why Icechunk over plain Zarr?¶

Feature	Icechunk	Zarr v3	NetCDF4	HDF5
Version control
Cloud-native
Atomic transactions
Chunked arrays
Deduplication

Storage Structure¶

stores/
  rosalia/
    rinex/
      .icechunk/          # Repository metadata + snapshots
      data/               # SHA-256 addressed chunk files
      refs/               # Branch heads
    vod/
      .icechunk/
      data/
      refs/

Chunk Strategy¶

DefaultMemory EstimateCustom Chunks

The default chunk shape is tuned for daily GNSS time series:

chunk_strategy = {"epoch": 34560, "sid": -1}

Dimension	Value	Rationale
`epoch`	34560	≈ 24 h at 2.5 s cadence — aligned to daily processing granularity
`sid`	−1 (unlimited)	All signal IDs in one chunk — VOD computes across all signals simultaneously

For a typical 72-SID dataset at 1 Hz:

# float32, 24 h × 72 SIDs
bytes_per_chunk = 86400 * 72 * 4   # ≈ 24 MB uncompressed
# Zstd level 5 typically achieves 4–8× for GNSS float data
bytes_compressed ≈ 3–6 MB per chunk

Override per read call — does not affect on-disk layout:

ds = reader.read(
    time_range=("2024-01-01", "2024-01-31"),
    chunks={"epoch": 3600, "sid": -1},  # 1-hour lazy chunks in memory
)

Configuration¶

# config/processing.yaml
icechunk:
  compression_algorithm: zstd
  compression_level: 5
  inline_threshold: 512
  get_concurrency: 1

  # Manifest preloading — loads coordinate manifests into memory at session open.
  # Worth enabling once stores grow beyond a few hundred commits.
  # manifest_preload_enabled: false
  # manifest_preload_max_refs: 100000000
  # manifest_preload_pattern: "epoch|sid"

Key	Default	Description
`compression_algorithm`	`zstd`	Icechunk internal compression — `zstd`, `lz4`, or `gzip`
`compression_level`	`5`	Compressor level (1 = fast, 22 = max for zstd)
`inline_threshold`	`512`	Bytes below which chunks are stored inline in the manifest
`get_concurrency`	`1`	Concurrent partial-value reads (increase for S3/GCS)
`manifest_preload_enabled`	`false`	Pre-load coordinate manifests into memory at session open
`manifest_preload_max_refs`	`100000000`	Cap on chunk refs preloaded
`manifest_preload_pattern`	`"epoch\\|sid"`	Regex for arrays to preload

Usage¶

Initialize / OpenWrite with VersioningVersion HistoryQuery Time Range

from canvod.store import MyIcechunkStore

# Open or create (filesystem)
store = MyIcechunkStore("/data/stores/rosalia/rinex")

# Open existing (read-only)
store = MyIcechunkStore("/data/stores/rosalia/rinex", read_only=True)

from canvod.site import Site

site = Site("Rosalia")

# Append one day of observations → creates snapshot
snapshot_id = site.rinex_store.append_dataset(
    ds,
    receiver_name="canopy_01",
)
print(f"Snapshot: {snapshot_id[:8]}")

# List all snapshots on main branch
history = site.rinex_store.list_snapshots()
for snap in history:
    print(snap.id[:8], snap.message, snap.written_at)

# Open a specific historical version
ds_v1 = site.rinex_store.read(
    receiver_name="canopy_01",
    time_range=("2024-01-01", "2024-01-31"),
    snapshot=history[-1].id,
)

ds = site.rinex_store.read(
    receiver_name="canopy_01",
    time_range=("2024-01-01", "2024-06-30"),
)

# Lazily loaded — only reads chunks covering the range
print(ds.epoch.values[[0, -1]])

Cloud Deployment¶

AWS S3MinIO / S3-CompatibleCloudflare R2

# No code change — set the store path to an S3 URI
store = MyIcechunkStore("s3://my-bucket/rosalia/rinex")

Configure credentials via environment variables or instance roles:

export AWS_DEFAULT_REGION=eu-central-1
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

import os
os.environ["AWS_ENDPOINT_URL"] = "https://minio.example.com"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"

store = MyIcechunkStore("s3://canvod-data/rosalia/rinex")

os.environ["AWS_ENDPOINT_URL"] = "https://<account_id>.r2.cloudflarestorage.com"
os.environ["AWS_ACCESS_KEY_ID"] = "<r2_access_key>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<r2_secret_key>"

store = MyIcechunkStore("s3://canvod-data/rosalia/rinex")

Local → Cloud

Switch from filesystem to S3 by changing the store_path string — no other code changes required.

Deduplication¶

canvod-store uses SHA-256 file hashes to skip re-ingesting the same file:

# In MyIcechunkStore.append_dataset()
if self._file_already_ingested(ds.attrs["File Hash"]):
    log.info("file_skipped", hash=ds.attrs["File Hash"][:8])
    return None

# Otherwise write + record hash
snapshot = self._write_and_commit(ds, ...)
self._record_ingested_hash(ds.attrs["File Hash"])
return snapshot

Hash source

The "File Hash" attribute is set by the reader (SbfReader.file_hash / Rnxv3Obs.file_hash) — a 16-character SHA-256 prefix of the raw file. Duplicate ingestion is impossible even if the same file is submitted twice.