Icechunk Storage¶
Icechunk is a cloud-native transactional storage format for multidimensional arrays — Git-like versioning meets Zarr v3.
-
Versioned Writes
Every
commit()produces an immutable snapshot with a hash-addressable ID. Roll back to any prior state with a single line. -
ACID Transactions
Multiple writes are atomic — either all succeed or none are persisted. No partial writes, no corrupt chunks, no reader/writer races.
-
Cloud-Native
Local filesystem for development; S3, MinIO, or Cloudflare R2 for production. Zero code change to switch backends.
-
Zarr v3 Chunks
Zstd-compressed chunks, O(1) epoch-range reads, compatible with
xarray.open_zarr()out of the box.
Why Icechunk over plain Zarr?¶
| Feature | Icechunk | Zarr v3 | NetCDF4 | HDF5 |
|---|---|---|---|---|
| Version control | ||||
| Cloud-native | ||||
| Atomic transactions | ||||
| Chunked arrays | ||||
| Deduplication |
Storage Structure¶
stores/
rosalia/
rinex/
.icechunk/ # Repository metadata + snapshots
data/ # SHA-256 addressed chunk files
refs/ # Branch heads
vod/
.icechunk/
data/
refs/
Chunk Strategy¶
The default chunk shape is tuned for daily GNSS time series:
chunk_strategy = {"epoch": 34560, "sid": -1}
| Dimension | Value | Rationale |
|---|---|---|
epoch |
34560 | ≈ 24 h at 2.5 s cadence — aligned to daily processing granularity |
sid |
−1 (unlimited) | All signal IDs in one chunk — VOD computes across all signals simultaneously |
For a typical 72-SID dataset at 1 Hz:
# float32, 24 h × 72 SIDs
bytes_per_chunk = 86400 * 72 * 4 # ≈ 24 MB uncompressed
# Zstd level 5 typically achieves 4–8× for GNSS float data
bytes_compressed ≈ 3–6 MB per chunk
Override per read call — does not affect on-disk layout:
ds = reader.read(
time_range=("2024-01-01", "2024-01-31"),
chunks={"epoch": 3600, "sid": -1}, # 1-hour lazy chunks in memory
)
Configuration¶
# config/processing.yaml
icechunk:
compression_algorithm: zstd
compression_level: 5
inline_threshold: 512
get_concurrency: 1
# Manifest preloading — loads coordinate manifests into memory at session open.
# Worth enabling once stores grow beyond a few hundred commits.
# manifest_preload_enabled: false
# manifest_preload_max_refs: 100000000
# manifest_preload_pattern: "epoch|sid"
| Key | Default | Description |
|---|---|---|
compression_algorithm |
zstd |
Icechunk internal compression — zstd, lz4, or gzip |
compression_level |
5 |
Compressor level (1 = fast, 22 = max for zstd) |
inline_threshold |
512 |
Bytes below which chunks are stored inline in the manifest |
get_concurrency |
1 |
Concurrent partial-value reads (increase for S3/GCS) |
manifest_preload_enabled |
false |
Pre-load coordinate manifests into memory at session open |
manifest_preload_max_refs |
100000000 |
Cap on chunk refs preloaded |
manifest_preload_pattern |
"epoch\|sid" |
Regex for arrays to preload |
Usage¶
from canvod.store import MyIcechunkStore
# Open or create (filesystem)
store = MyIcechunkStore("/data/stores/rosalia/rinex")
# Open existing (read-only)
store = MyIcechunkStore("/data/stores/rosalia/rinex", read_only=True)
from canvod.site import Site
site = Site("Rosalia")
# Append one day of observations → creates snapshot
snapshot_id = site.rinex_store.append_dataset(
ds,
receiver_name="canopy_01",
)
print(f"Snapshot: {snapshot_id[:8]}")
# List all snapshots on main branch
history = site.rinex_store.list_snapshots()
for snap in history:
print(snap.id[:8], snap.message, snap.written_at)
# Open a specific historical version
ds_v1 = site.rinex_store.read(
receiver_name="canopy_01",
time_range=("2024-01-01", "2024-01-31"),
snapshot=history[-1].id,
)
ds = site.rinex_store.read(
receiver_name="canopy_01",
time_range=("2024-01-01", "2024-06-30"),
)
# Lazily loaded — only reads chunks covering the range
print(ds.epoch.values[[0, -1]])
Cloud Deployment¶
# No code change — set the store path to an S3 URI
store = MyIcechunkStore("s3://my-bucket/rosalia/rinex")
Configure credentials via environment variables or instance roles:
export AWS_DEFAULT_REGION=eu-central-1
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
import os
os.environ["AWS_ENDPOINT_URL"] = "https://minio.example.com"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
store = MyIcechunkStore("s3://canvod-data/rosalia/rinex")
os.environ["AWS_ENDPOINT_URL"] = "https://<account_id>.r2.cloudflarestorage.com"
os.environ["AWS_ACCESS_KEY_ID"] = "<r2_access_key>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<r2_secret_key>"
store = MyIcechunkStore("s3://canvod-data/rosalia/rinex")
Local → Cloud
Switch from filesystem to S3 by changing the store_path string —
no other code changes required.
Deduplication¶
canvod-store uses SHA-256 file hashes to skip re-ingesting the same file:
# In MyIcechunkStore.append_dataset()
if self._file_already_ingested(ds.attrs["File Hash"]):
log.info("file_skipped", hash=ds.attrs["File Hash"][:8])
return None
# Otherwise write + record hash
snapshot = self._write_and_commit(ds, ...)
self._record_ingested_hash(ds.attrs["File Hash"])
return snapshot
Hash source
The "File Hash" attribute is set by the reader (SbfReader.file_hash /
Rnxv3Obs.file_hash) — a 16-character SHA-256 prefix of the raw file.
Duplicate ingestion is impossible even if the same file is submitted twice.