
SkyhookDM is an extension of Ceph for the scalable storage of tables and for offloading common data management operations on them, including selection, projection, aggregation, and indexing, as well as user-defined functions. The goal of SkyhookDM is to transparently scale out data management operations across many storage servers leveraging the scale-out and availability properties of Ceph while significantly reducing the use of CPU cycles and interconnect bandwidth for unnecessary data transfers. The SkyhookDM architecture is also designed to transparently optimize for future storage devices of increasing heterogeneity and specialization.
Tables are stored using Apache Arrow (for column-based processing) serialization. The SkyhookDM client developed for HEP exposes the Arrow Dataset API with a new Dataset class called RadosDataset
. This extension is intended to become part of Apache Arrow. Other SkyhookDM clients include a foreign data wrapper for PostgreSQL as well as Python clients for SQL.
SkyhookDM is currently an incubator project at the Center for Research on Open Source Software at the University of California Santa Cruz.
Team
- Jeff LeFevre
- Ivo Jimenez
- esmaeil-mirvakili
- JayjeetAtGithub
- Xiaowei (Aaron) Chu
- Carlos Maltzahn
Presentations
- 2 Aug 2020 - "The value of open source to universities: UC Santa Cruz tests the water", Carlos Maltzahn, Interview for a Linux Professional Institute Blog Post by Andy Oram
- 30 Jun 2020 - "The Ceph Project", Carlos Maltzahn, UC Berkeley Cloud Meetup 015
- 11 Jun 2020 - "Some lessons learned from creating and using the Ceph open source storage system", Carlos Maltzahn, BCS Open Source Specialist Group: Open source softgware for scientific and parallel computing
- 5 Jun 2020 - "How $2 Million Dollars Helped Build CROSS with Dr. Carlos Maltzahn", Carlos Maltzahn, Sustain Podcast
- 26 May 2020 - "Skyhook Data Management: programmable object storage for databases", Jeff LeFevre, Fujitsu Labs
- 15 May 2020 - "Industry-supported seeding of developer communities around university research prototypes", Carlos Maltzahn, OpenDP Community Meeting
- 27 Feb 2020 - "SkyhookDM: Programmable Storage for Datasets", Carlos Maltzahn, IRIS-HEP Poster Session
- 24 Feb 2020 - "Scaling databases and file apis with programmable ceph object storage", Carlos Maltzahn, 2020 Linux Storage and Filesystems Conference (Vault’20, co-located with FAST’20 and NSDI’20)
- 19 Nov 2019 - "Panel presentation on Enabling Data Services for HPC", Carlos Maltzahn, Enabling Data Services for HPC (BoF at SC19)
- 5 Nov 2019 - "Mapping datasets to object storage", Jeff LeFevre, CHEP 2019
- 24 Oct 2019 - "Education, research, and technology transfer in open source software: new possibilities for universities", Carlos Maltzahn, École Polytechnique Fédérale de Lausanne (EPFL)
- 21 Oct 2019 - "Education, research, and technology transfer in open source software: new possibilities for universities", Carlos Maltzahn, Friedrich-Alexander Universität, Erlangen-Nürnberg
- 19 Oct 2019 - "Center for Research in Open Source Software", Carlos Maltzahn, Google Summer of Code Mentor Summit
- 3 Oct 2019 - "Skyhook Data Management: Scaling Databases and Applications with Open Source Extensible Storage", Jeff LeFevre, CROSS Research Symposium 2019
- 15 Aug 2019 - "Update on the Center for Research in Open Source Software", Carlos Maltzahn, Seminar at New Mexico Consortium, Los Alamos
- 20 Jun 2019 - "MBWU: Benefit Quantification for Data Acess Function Offloading", Carlos Maltzahn, HPC I/O in the Data Center Workshop (HPC-IODC 2019)
- 24 Apr 2019 - "Skyhook: Programmable Object Storage for Analysis", Jeff LeFevre, IRIS-HEP Topical Meetings
- 13 Mar 2019 - "How to Leverage Research Universities", Carlos Maltzahn, Linux Foundation Open Source Leadership Summit (OSLS 2019)
- 26 Feb 2019 - "Skyhook: programmable storage for databases", Carlos Maltzahn, Vault'19
- 26 Feb 2019 - "Skyhook: programmable storage for databases", Jeff LeFevre, Vault'19
- 25 Jan 2019 - "Programmable Storage Systems: For I/O that doesn’t fit under the rug", Carlos Maltzahn, Seminar at Amazon AWS
- 14 Dec 2018 - "IN53A-04: Reproducible, Automated and Portable Computational and Data Science Experimentation Pipelines with Popper (with Ivo Jimenez)", Carlos Maltzahn, IN53A: Enabling Transparency and Reproducibility in Geoscience Through Practical Provenance and Cloud-Based Workflows I (AGU Fall Meeting)
- 11 Dec 2018 - "Programmable Storage Systems: For I/O that doesn’t fit under the rug", Carlos Maltzahn, Seminar at VMware
- 16 Nov 2017 - "SkyhookDB - Leveraging object storage toward database elasticity in the cloud", Jeff LeFevre, DOMA Workshop 2017 (Flatiron Institute)
Publications
- Reproducible, Scalable Benchmarks for SkyhookDM using Popper, Chakraborty J. IRIS-HEP Summer 2020 Fellowship Report (12 Oct 2020).
- Scale-out Edge Storage Systems with Embedded Storage Nodes to Get Better Availability and Cost-Efficiency At the Same Time, Jianshen Liu, Matthew Leon Curry, Carlos Maltzahn, and Philip Kufeldt, 3rd USENIX Workshop on Hot Topics in Edge Computing (HotEdge ’20), Santa Clara, CA, June 25-26 2020 (26 May 2020).
- SkyhookDM: Data Processing in Ceph with Programmable Storage, Jeff LeFevre and Carlos Maltzahn, USENIX ;login: Magazine (12 May 2020).
- Is big data performance reproducible in modern cloud networks?, Alexandru Uta, Alexandru Custura, Dmitry Duplyakin, Ivo Jimenez, Jan Rellermeyer, Carlos Maltzahn, Robert Ricci, and Alexandru Iosup, 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20), Santa Clara, CA, February 25-27 2020 (26 Feb 2020).
- Scaling databases and file APIs with programmable Ceph object storage, Jeff LeFevre and Carlos Maltzahn, 2020 Linux Storage and Filesystems Conference (Vault'20, co-located with FAST'20 and NSDI'20), Santa Clara, CA, February 24-25 2020 (24 Feb 2020).
- Popper 2.0: A Container-native Workflow Execution Engine For Testing Complex Applications and Validating Scientific Claims, Jayjeet Chakraborty, Ivo Jimenez, Carlos Maltzahn, Arshul Mansoori, and Quincy Wofford, Poster at 2020 Exaxcale Computing Project Annual Meeting, Houston, TX, February 3-7, 2020, 2020 (03 Feb 2020).
- Towards Physical Design Management in Storage Systems, Kathryn Dahlgren, Jeff LeFevre, Ashay Shirwadkar, Ken Iizawa, Aldrin Montana, Peter Alvaro, Carlos Maltzahn, 4th International Parallel Data Systems Workshop (PDSW 2019, co-located with SC’19), Denver, CO, November 18, 2019. (18 Nov 2019) [NSF PAR].
- SkyhookDM: Mapping Scientific Datasets to Programmable Storage, Aaron Chu and Jeff LeFevre and Carlos Maltzahn and Aldrin Montana and Peter Alvaro and Dana Robinson and Quincey Koziol, arXiv:2007.01789 [cs.DS] (Submitted to CHEP 2019) (08 Nov 2019).
- Reproducible Computer Network Experiments: A Case Study Using Popper, Andrea David, Mariette Souppe, Ivo Jimenez, Katia Obraczka, Sam Mansfield, Kerry Veenstra, Carlos Maltzahn, 2nd International Workshop on Practical Reproducible Evaluation of Computer Systems (P-RECS, co-located with HPDC’19), Phoenix, AZ, June 24, 2019. (24 Jun 2019).
- MBWU: Benefit Quantification for Data Access Function Offloading, Jianshen Liu, Philip Kufeldt, Carlos Maltzahn, HPC I/O in the Data Center Workshop (HPC-IODC 2019, co-located with ISC-HPC 2019), Frankfurt, Germany, June 20, 2019. (20 Jun 2019).
- Skyhook: Programmable storage for databases, Jeff LeFevre, Noah Watkins, Michael Sevilla, and Carlos Maltzahn, 2020 Linux Storage and Filesystems Conference (Vault'19, co-located with FAST'19), Santa Clara, CA, February 25-26 2019 (25 Feb 2019).
- Spotting Black Swans With Ease: The Case for a Practical Reproducibility Platform, Ivo Jimenez, Carlos Maltzahn, st Workshop on Reproducible, Customizable and Portable Workflows for HPC (ResCuE-HPC’18, co-located with SC’18), Dallas, TX, November 11, 2018. (11 Nov 2018).
- Taming performance variability, Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci, 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18), Carlsbad, CA, October 8-10, 2018. (08 Oct 2018).