Overview

SkyhookDM is an extension of Ceph for the scalable storage of tables and for offloading common data management operations on them, including selection, projection, aggregation, and indexing, as well as user-defined functions. The goal of SkyhookDM is to transparently scale out data management operations across many storage servers leveraging the scale-out and availability properties of Ceph while significantly reducing the use of CPU cycles and interconnect bandwidth for unnecessary data transfers. The SkyhookDM architecture is also designed to transparently optimize for future storage devices of increasing heterogeneity and specialization. All the data movements from the Ceph OSDs to the client happen in Apache Arrow format.

We extended the Arrow Dataset API with a new FileFormat API called RadosParquetFileFormat that allows pushing down scan operations to the Storage layer transparently which helps use the under utilized resources on the Storage servers and speed up query execution by moving only the data that is required over the network. This extension is intended to eventually become part of Apache Arrow. Also, other SkyhookDM clients include a foreign data wrapper for PostgreSQL as well as Python clients for SQL.

SkyhookDM is currently an incubator project at the Center for Research on Open Source Software at the University of California Santa Cruz.

Architecture

SkyhookDM Architecture

Announcements

  • February, 2021 - SkyhookDM v0.1.0 Released !
  • February, 2021 - Guide for getting started with SkyhookDM.

Team

Presentations

Publications