Search for a command to run...
The NOAA Integrated Ocean Observing System (IOOS) National Data Management and Cyberinfrastructure (DMAC) system is tasked with providing public access to data and products generated through observation and modeling of the nation's oceans, coasts, and Great Lakes. The standards-based approach implemented by IOOS has been successfully implemented by IOOS partners and stakeholders and has provided a consistent framework for both submitting and accessing data. However, the esoteric nature of the data formats, tools, and services required presents a barrier for many users. Occurring at a rapid pace, advances in oceanographic and meteorological technologies offer new ways to lower this barrier and provide enhanced efficiencies and stability to DMAC users. The Reaching for the Cloud (RFC) initiative leveraged one such advancement, cloud computing, which provides a clear option to build upon current data management procedures and metadata standards with similar cloud-native frameworks that are increasingly accessible, approachable, and cost-effective. This work led to key recommendations and a roadmap outlining how IOOS and its Regional Associations (RAs) can transition towards a service-based cloud ecosystem that will increase the use of IOOS data and promote connection to other disciplines and stakeholders. Over the course of the three-year effort, we identified requirements for this transition, demonstrated its value with a series of functional prototypes, determined associated usage and cost metrics, and developed recommendations for governance and operations. Not only does this move the community towards ensuring data is more accessible, it also considers how to reduce costs, improve data discoverability, and, most importantly, allow the IOOS National DMAC system to keep pace with rapid advancements and evolutions in oceanographic technologies. Through stakeholder outreach efforts early in the project, we identified management of gridded numerical model data (oceanographic and meteorological) as a key research area to focus our efforts. Using the current workflow for management of model data, we defined a series of prototypes to demonstrate various elements of that workflow including data ingest, storage and discovery, processing and analysis, and presentation. Starting with data ingest, we looked at how to identify, store, and enrich the data so that it is more easily accessible and ultimately focused on using zarr/kerchunk for optimizing data storage on the cloud. Converging on a specific data format is fraught with challenges but we were able to find a middle-road with kerchunk, which indexes the native gridded data (NetCDF/GRIB) into the zarr specification for fast selection of specific byte ranges. We use a common notification/queueing pattern using SNS/SQS notifications from S3 data stores to perform the indexing as data is available. We also generate 30 day and model-run aggregations which provide a virtual view of the data as a single dataset although it may physically be composed of many different files. We then addressed common use-cases such as cloud-native scientific processing workflows and web-based visualization. The kerchunk process makes data access from the cloud more efficient but it still requires a lot of domain knowledge, dependencies, and engineering to use that data. Although model data is freely available on the cloud now, it's hard to claim that data is truly FAIR given inherent data challenges such as different projections, formats, and storage schemes; coupled with infrastructure challenges of scaling, distribution, and storage. To address those challenges, we are serving the data through a data broker layer named Xpublish. Xpublish is a modern, open-source software package for serving multidimensional array-oriented gridded scientific data. Xpublish is designed to be a self-serve data access point to greatly simplify the process of retrieving and processing data on the cloud. The complexity of cloud system design is encapsulated in Xpublish so that the users and operators have minimal technical complexity to manage. We have prototyped several applications of Xpublish such as map tiling and data sub-setting. We have tested these services on various cloud infrastructure platforms and explored the nuances of choices such as serverless vs managed and deploying infrastructure as code (IaC). This exploratory work has informed RPS and our clients of a clear path forward for working with scientific data in the cloud. The RFC initiative has resulted in the creation of several new open-source communities and tools to aid in the formatting and serving of multidimensional array-oriented gridded data. We will describe these new tools, where they fit in the ecosystem, and how they can be used to start serving data from the cloud.