Skyhook Data Management

Overview

HL-LHC will challenge existing storage systems with regards to both data volumes and velocity: analysis datasets will grow significantly and users will demand higher event rates to improve the time-to-insight. The SkyHook DM project is an investment into computational storage, relying on the insight that the data organization of HEP files is highly structured. The service is able to recognize the layout of files and “push down” structured queries from client to server, taking advantage of the computational capacity in the storage hardware and reducing data movement significantly.

Combining data management with storage also creates the opportunity for new services that can help avoid dataset copies and thereby can significantly save storage space. Data management-enabled storage systems can provide views by combining parts of multiple datasets. For HEP this means that columns from one table can be combined with columns from a different table without creating copies. For this to work, these storage systems need to store sufficient metadata and naming conventions about datasets. This makes them a natural place for maintaining this metadata and servicing it to other tools in convenient formats.

Skyhook Data Management is an extension of the Ceph open source distributed storage system for the scalable storage of tables and for offloading common data management operations on them, including selection, projection, aggregation, and indexing, as well as user-defined functions. The goal of SkyhookDM is to transparently scale out data management operations across many storage servers leveraging the scale-out and availability properties of Ceph while significantly reducing the use of CPU cycles and interconnect bandwidth for unnecessary data transfers. The SkyhookDM architecture is also designed to transparently optimize for future storage devices of increasing heterogeneity and specialization. All the data movements from the Ceph OSDs to the client happen in Apache Arrow format.

SkyhookDM is currently an incubator project at the Center for Research on Open Source Software at the University of California Santa Cruz.

Salient Features

  • Enables pushing down filters, projections, compute operations to the Storage backend for minimal data transfer over the network and linear scalability.

  • Allows storing data in Parquet files for minimizing Disk I/O though predicate and projection pushdown.

  • Allows writing files to a POSIX filesystem interface.

  • Minimal deployment overhead either via Rook or Ceph-Deploy.

  • Plugs-in seamlessly into the Arrow Dataset API and leverages all its functionality like dataset discovering, partition pruning, etc.

  • Works with latest Apache Arrow and latest Ceph versions.

Architecture

SkyhookDM Architecture

In the storage layer, we extend the Ceph Object Store with plugins built using the Object Class SDK to allow scanning objects containing Parquet data inside the Ceph OSDs. We utilize the Apache Arrow framework for building the data processing logic in the plugins. On the client side, we extend CephFS with a SkyhookDirectObjectAccess API that allows invoking Object Class methods on RADOS objects to perform query operations. We export our implementation by creating a new FileFormat in Apache Arrow called SkyhookFileFormat that uses the SkyhookDirectObjectAcess API to offload Parquet file scanning to the storage layer.

Performance Evaluation

performance

We compare the query latencies of filtering a 1.2 billion row dataset via Parquet and Skyhook file formats with 1%, 10%, and 100% row selectivities. As shown in the above plot, Parquet performance doesn’t improve on scaling out from 4 to 16 nodes as it stays bottlenecked on the client’s CPUs. On the other hand, performance of Skyhook improves as it can distribute CPU consumption amongst the storage nodes and can scale out almost linearly.

rpq

The above two plots shows how Parquet (top) stays bottlenecked on the client CPU while Skyhook (bottom) distributes CPU usage between the storage nodes and allows scale out.

Ongoing Work

  • Working on deploying Skyhook in the UNL and SSL clusters.

  • Working on making the Coffea-Skyhook integration more user-friendly.

  • Working on joining HEP datasets using DuckDB/Arrow/Fugue.

  • A middleware to allow writing Parquet files containing Nanoevents from ServiceX to SkyhookDM via CephFS.

Important Links

Announcements

Fellows

Team

Presentations

Publications