Preprocessing Pipeline¶
Before interpolation, auxiliary data must be restructured to match the (epoch × sid) layout of RINEX obs Datasets. This pipeline bridges the sv indexing of SP3/CLK files and the sid indexing of the obs Dataset.
The Dimension Mismatch Problem¶
SP3 / CLK files
Index by satellite vehicle (sv):
sp3_data.dims
# {'epoch': 96, 'sv': 32}
GPS G01 has one entry — one position per epoch.
RINEX obs Dataset
Index by signal ID (sid):
rinex_data.dims
# {'epoch': 2880, 'sid': 384}
GPS G01 maps to ~20 Signal IDs (L1C, L2W, L5Q, …).
Satellite position is frequency-independent (all signals leave the same antenna), so the sv → sid mapping simply replicates each position across all signal IDs for that satellite.
Pipeline Functions¶
| Function | Purpose | Input dims | Output dims |
|---|---|---|---|
preprocess_aux_for_interpolation() |
Minimal prep for interpolation | sv: 32 | sid: 384 |
prep_aux_ds() |
Full prep for Icechunk storage | sv: 32 | sid: ~2000 |
map_aux_sv_to_sid() |
Step 1: expand sv → sid | sv: 32 | sid: 384 |
pad_to_global_sid() |
Step 2: pad to all constellations | sid: 384 | sid: ~2000 |
normalize_sid_dtype() |
Step 3: convert to object dtype | — | dtype fixed |
strip_fillvalue() |
Step 4: remove _FillValue attrs |
— | attrs cleaned |
Step 1 — map_aux_sv_to_sid()¶
Each satellite position is replicated across all its signal IDs:
from canvod.auxiliary.preprocessing import map_aux_sv_to_sid
sp3_sid = map_aux_sv_to_sid(sp3_data)
# G01 position is identical for every signal ID of G01
sp3_sid["X"].sel(sid="G01|L1|C") # 12 345 678.9 m
sp3_sid["X"].sel(sid="G01|L2|W") # 12 345 678.9 m (same)
sp3_sid["X"].sel(sid="G01|L5|I") # 12 345 678.9 m (same)
Signal IDs generated for GPS G01 (~20 total):
G01|L1|C G01|L1|L G01|L1|P G01|L1|S G01|L1|W G01|L1|X G01|L1|Y
G01|L2|C G01|L2|D G01|L2|L G01|L2|M G01|L2|P G01|L2|S G01|L2|W
G01|L2|X G01|L2|Y G01|L5|I G01|L5|Q G01|L5|X
Signal ID format: "{SV}|{BAND}|{CODE}" — e.g. "G01|L1|C", "E08|E5a|Q".
Step 2 — pad_to_global_sid()¶
Pads the Dataset to include all possible Signal IDs across all supported constellations (~1 987 total). Required for Icechunk storage, where sequentially appended datasets must share the same coordinate space.
from canvod.auxiliary.preprocessing import pad_to_global_sid
sp3_global = pad_to_global_sid(sp3_sid)
sp3_global.sizes["sid"] # 1987
Missing SIDs are filled with NaN — they carry no observations and are stripped during VOD computation.
Step 3 — normalize_sid_dtype()¶
Converts the sid coordinate to object dtype for Zarr/Icechunk compatibility. Fixed-length Unicode string types (<U12, etc.) cause dtype conflicts when datasets from different files are concatenated and appended.
from canvod.auxiliary.preprocessing import normalize_sid_dtype
ds = normalize_sid_dtype(ds)
ds["sid"].dtype # dtype('O')
Step 4 — strip_fillvalue()¶
Removes _FillValue attributes that conflict with Icechunk's internal missing-data handling. NaN is the standard missing-value marker throughout the pipeline.
from canvod.auxiliary.preprocessing import strip_fillvalue
ds = strip_fillvalue(ds)
# No variable in ds.data_vars has a "_FillValue" encoding key
Scientific Note¶
Replicating satellite positions across signal IDs is scientifically valid:
Position is frequency-independent
All signals originate from the same satellite antenna phase centre. IGS SP3 final products have ~1 cm accuracy, with antenna offset corrections already applied. The replication introduces zero error.