SkyhookDM is an extension of Ceph for the scalable storage of tables and for offloading common data management operations on them, including selection, projection, aggregation, and indexing, as well as user-defined functions. The goal of SkyhookDM is to transparently scale out data management operations across many storage servers leveraging the scale-out and availability properties of Ceph while significantly reducing the use of CPU cycles and interconnect bandwidth for unnecessary data transfers. The SkyhookDM architecture is also designed to transparently optimize for future storage devices of increasing heterogeneity and specialization.

Tables are stored using Apache Arrow (for column-based processing) serialization. The SkyhookDM client developed for HEP exposes the Arrow Dataset API with a new Dataset class called RadosDataset. This extension is intended to become part of Apache Arrow. Other SkyhookDM clients include a foreign data wrapper for PostgreSQL as well as Python clients for SQL.

SkyhookDM is currently an incubator project at the Center for Research on Open Source Software at the University of California Santa Cruz.

Team

Presentations

Publications