Over the past week or so, I've been prototyping components of the end-to-end DRC pipeline based on interesting discussions about the KPF and UHE pipelines.
My goal with this prototype was to create a sort of straw man with a specific set of technologies and architectural choices to ground our planning/design of the production PANOSETI DRC pipeline.
I'll briefly describe this prototype system for reference:
3-Server Cluster in the RAL @ Berkeley

- Wei, Ben, and I set up a cluster of 3 Dell 1U R410 servers running Ubuntu 24.04, connected to a 1GbE switch, and installed in the RAL server room (Berkeley).
- Two servers have three 2TB HDDs in RAIDZ1 configurations, and the third server has a 500GB SSD for metadata caching and a RAID0 array consisting of a 3TB and a 2TB HDD. (disks from Dan's previous projects.)
- This hardware is about 14 years old, so we should expect much better performance on a more modern system.
**BeeGFS Distributed Filesystem**
- 11.6TiB of combined HDD and SSD capacity is presented to clients as a single data volume /mnt/beegfs, managed by the parallel file system software BeeGFS.
- BeeGFS can stripe files across HDDs and SSDs, enabling high I/O throughput.
- BeeGFS can also define storage pools to create a hierarchy of pools of fast, small SSD and slow, large HDD secondary storage.
Unified Data Directory: /mnt/beegfs/data

- Like the Keck pipeline, our raw data will be archived in a level-0 data product directory (/mnt/beegfs/data/L0) by rsync processes belonging to a special "l0-copiers" group with exclusive write access to this directory. Other groups can read this data, but cannot write to protect the raw data from problematic file I/O code.
- Note: BeeGFS allows us to set access control lists (ACLs) on directories and any new files/subdirectories created within them, so we could enforce these rules automatically.
**Dask + Zarr --> mini-HPC Cluster**
- The 3-server cluster plus one additional server provides a pool of 64 cores to a Dask SSHCluster scheduler to run various jobs.
- Each server runs a beegfs-client service, allowing them to directly read input data products from /mnt/beegfs/data/L0 and write output data products to /mnt/beegfs/data/L1+.
- The pipeline works entirely with Zarr-formatted files to enable Dask to build computation graphs with a high degree of parallelism on disjoint chunks of data.

Zarr + xarray data model
- Pipeline stages could take an xarray reference to a Zarr file as one of their inputs. This xarray object could provide an object-oriented, numpy-like interface to all our observational data.
- Zarr group/array hierarchies can be traversed with xarray using Python's dot notation.
- We can define the hierarchy, so this data model would be extensible for new data products. For further customization, we could subclass xarray classes and add custom methods.
- xarray automatically creates labeled ND arrays from Zarr files, which we could use to store time for images, time, temperature, etc.
Simple End-to-end Tests