ServiceX is a data extraction and delivery delivery service. Users provide a dataset identifier and a selection statement that specifies filters and columns. ServiceX brings up parallel workers to open the files in the dataset and uses experiment approved frameworks to extract the data and store it in popular columnar file formats for easy analysis using familiar tooling.
- Service can be co-located with datasets to provide fast and efficient data reduction strategies.
- Interface is easy to learn and the processes are extremely reliable to make it easy for analyzers to get their job done without needing to learn complex libraries and babysit batch jobs
- Extracts data from experiment-specific data formats using approved frameworks.
- Data can be provided as popular and powerful Awkward Arrays
- Results are cached to allow for easy re-analysis
- Transformers are run out of tagged docker images to allow for precise reproducibility
- Works seamlessly as a data backend to Coffea and TRExFitter analysis tools
Recent testing on 10 TB xAOD input sample where we requested 100 columns from 7 collections (~30% of file). We were able to scale up to 1,000 workers the River SSL Cluster to get results in less than 30 minutes. Output rate was in excess of 300MB/s.
How It Works
The dataset lookup service is extensible and developers can create new services to meet the needs of a specific community. Currently, datasets can be requested using a Rucio DID or as a reference to a CERN OpenData dataset. Users can also provide a set of accessible file URLs.
The files located by the lookup service are passed to transformer workers which are spun up to filter data and extract columns. These transformer workers run out of docker images which are based on experiment approved images. Their operation is controlled by a low-level selection language called Qastle. This language allows for filtering of events as well as data projections and simple calculations of derived values.
Qastle is useful for concisely communicating these transform requests, however is not intended for end-users. Instead, there are translators that transpile high level selection languages. There are currently transpilers for func_adl and for the T-Cut language.
ServiceX uses a micro-service architecture that runs on a Kubernetes cluster. Users interact with the service via a REST interface provided by a Flask App. The app also provides orchestration of the other components.
The Qastle queries are translated into event framework code by code generator services. These can generate C++ or Python Uproot code which is executed by the transformers.
The transformers are launched as an auto-scaling job by the flask app. Each worker is fed from a transactional RabbitMQ queue.
Full documentation at servicex.readthedocs
- Andrew Eckart
- Ben Galewsky
- Rob Gardner
- Lindsey Gray
- Mark Neubauer
- Jim Pivarski
- Ilija Vukotic
- Gordon Watts
- Marc Weinberg
- 9 Jul 2021 - "Using Python, coffea, and ServiceX to Rediscover the Higgs. Twice." , Gordon Watts, PyHEP 2021
- 16 Oct 2020 - "ServiceX Front End Status", Gordon Watts, ServiceX Meeting
- 2 Oct 2020 - "ServiceX Front End Status", Gordon Watts, ServiceX Meeting
- 8 Jul 2020 - "ServiceX and Kubernetes", Ben Galewsky, LHCG Grid Deployment Board meeting
- 26 May 2020 - "Parallel Sessions: Plans and Goals", Gordon Watts, IRIS-HEP Team Retreat
- 4 Nov 2019 - "A Distributed, Caching, Columnar Data Delivery Service(X)", Ben Galewsky, CHEP 2019 Conference
- 19 Jun 2019 - "ServiceX", Ben Galewsky, Analysis Systems Topical Workshop
- ServiceX A Distributed, Caching, Columnar Data Delivery Service, ServiceX A Distributed, Caching, Columnar Data Delivery Service B. Galewsky, R. Gardner, L. Gray, M. Neubauer, J. Pivarski, M. Proffitt, I. Vukotic, G. Watts, M. Weinberg EPJ Web Conf. 245 04043 (2020) DOI: 10.1051/epjconf/202024504043 (16 Nov 2020).
- Snowmass 2021 Letter of Interest: Analysis Ecosystem at the HL-LHC, G. Watts Snowmass 2021 Letter of Interest (10 Sep 2020).