The sheer volume of HL-LHC events will emphasize the need for rapid data reduction and the ability to pull individual columns from complex datasets into high-speed data analysis facilities.

ServiceX is a data extraction and delivery delivery service. Users provide a dataset identifier and a selection statement that specifies filters and columns. ServiceX brings up parallel workers to open the files in the dataset and uses experiment approved frameworks to extract the data and store it in popular columnar file formats for easy analysis using familiar tooling.

Benefits

  • ServiceX can be co-located with datasets to provide fast and efficient data reduction strategies.
  • Interface is easy to learn and the processes are extremely reliable to make it easy for analyzers to get their job done without needing to learn complex libraries and closely monitor hand-written batch jobs
  • Extracts data from experiment-specific data formats using approved frameworks.
  • Data can be provided as popular and powerful Awkward Arrays which integrates with the ecosystem of analysis tools within IRIS-HEP.
  • Results can be written to an object store or to SkyHook Volumes
  • Results are cached locally to allow for easy straightforward reuse of data.
  • Transformers are run out of tagged docker images to allow for precise reproducibility
  • Works seamlessly as a data backend to Coffea and TRExFitter analysis tools

Performance

Recent testing on 10 TB xAOD input sample where we requested 100 columns from 7 collections (~30% of file). We were able to scale up to 1,000 workers the River SSL Cluster to get results in less than 30 minutes. Output rate was in excess of 300MB/s.

How It Works

The dataset lookup service is extensible and developers can create new services to meet the needs of a specific community. Currently, datasets can be requested using a Rucio DID or as a reference to a CERN OpenData dataset. Users can also provide a set of accessible file URLs.

The files located by the lookup service are passed to transformer workers which are spun up to filter data and extract columns. These transformer workers run out of docker images which are based on experiment approved images. Their operation is controlled by a low-level selection language called Qastle. This language allows for filtering of events as well as data projections and simple calculations of derived values.

Qastle is useful for concisely communicating these transform requests, however is not intended for end-users. Instead, there are translators that transpile high level selection languages. There are currently transpilers for func_adl and for the T-Cut language.

Architecture

ServiceX uses a micro-service architecture that runs on a Kubernetes cluster. Users interact with the service via a REST interface provided by a Flask App. The app also provides orchestration of the other components.

The Qastle queries are translated into event framework code by code generator services. These can generate C++ or Python Uproot code which is executed by the transformers.

The transformers are launched as an auto-scaling job by the flask app. Each worker is fed from a transactional RabbitMQ queue.

Reference

Full documentation at the Servicex ReadTheDocs pages.

poster ServiceX talk at CHEP 2019

ServiceX talk at HSF DAWG – DOMA Access meeting

Keep in Touch

Sign up for the ServiceX Mailing List to be informed of new releases of the system.

Team

Presentations

Publications