Presenter: Lucas Sterzinger (University of California, Davis)
Description:
Many organizations are moving their data to cloud-hosted object storage, which allows them greater flexibility in cost, dataset size, access, and security. For multi-dimensional data, the Zarr format has emerged as a popular cloud storage format, with consolidated metadata and data chunks stored in separate objects that allow efficient parallel access. NetCDF4/HDF5 files have been a community standard for decades and remain an extremely popular format, however, they do not have consolidated metadata. Without consolidated metadata, accessing this data requires many small reads resulting in poor performance on the cloud. Transforming the vast existing NetCDF4/HDF5 data archives would require substantial computational resources and create a duplicate of the dataset, doubling storage requirements and complicating data version control, provenance, and archive protocols. A potential solution to this problem is to create a consolidated metadata file containing the byte-range locations of the data chunks and use it to access the NetCDF4/HDF5 data. ReferenceFileSystem, a new part of the Intake group’s fsspec (local and remote filesystem interfaces for Python) project, performs this task by creating a JSON file that allows a NetCDF4/HDF5 file to look like a file system. The data can then be read efficiently using the Zarr library directly. Using data from the GOES-East satellite hosted on Amazon Web Services, we demonstrate the effectiveness of this approach and provide a pathway to improving data access for the vast existing NetCDF4/HDF5 data archives.
More Information: https://lucassterzinger.com/2022-osm-poster/
Facebook:
Twitter:
Full list of Authors
- Lucas Sterzinger (University of California, Davis)
- Martin Durant (Anaconda)
- Richard Signell (Woods Hole Coastal and Marine Science Center, USGS)
- Chelle Gentemann (Farallon Institude)
- Kevin Paul (National Center for Atmospheric Research)
- Julia Kent (National Center for Atmospheric Research)
- ()
- ()
- ()
- ()
- ()
- ()
- ()
- ()
- ()
- ()
- ()
- ()
- ()
- ()
Kerchunk: Cloud-performant reading of NetCDF4/HDF5/Grib2 using the Zarr library
Category
Scientific Session > OD - Ocean Data Science, Analytics, and Management > OD12 Big Data for a Big Ocean 2022
Description
Presentation Preference: Either
Supporting Program: None
Student or Profesional? I am a Student