The sheer volume of HL-LHC events will emphasize the need for rapid data reduction and the ability to pull individual columns from complex datasets into high-speed data analysis facilities.
ServiceX is a data extraction and delivery delivery service. Users provide a dataset identifier and a selection statement that specifies filters and columns. ServiceX brings up parallel workers to open the files in the dataset and uses experiment approved frameworks to extract the data and store it in popular columnar file formats for easy analysis using familiar tooling.
- ServiceX can be co-located with datasets to provide fast and efficient data reduction strategies.
- Interface is easy to learn and the processes are extremely reliable to make it easy for analyzers to get their job done without needing to learn complex libraries and closely monitor hand-written batch jobs
- Extracts data from experiment-specific data formats using approved frameworks.
- Data can be provided as popular and powerful Awkward Arrays which integrates with the ecosystem of analysis tools within IRIS-HEP.
- Results can be written to an object store or to SkyHook Volumes
- Results are cached locally to allow for easy straightforward reuse of data.
- Transformers are run out of tagged docker images to allow for precise reproducibility
- Works seamlessly as a data backend to Coffea and TRExFitter analysis tools
Recent testing on 10 TB xAOD input sample where we requested 100 columns from 7 collections (~30% of file). We were able to scale up to 1,000 workers the River SSL Cluster to get results in less than 30 minutes. Output rate was in excess of 300MB/s.
How It Works
The dataset lookup service is extensible and developers can create new services to meet the needs of a specific community. Currently, datasets can be requested using a Rucio DID or as a reference to a CERN OpenData dataset. Users can also provide a set of accessible file URLs.
The files located by the lookup service are passed to transformer workers which are spun up to filter data and extract columns. These transformer workers run out of docker images which are based on experiment approved images. Their operation is controlled by a low-level selection language called Qastle. This language allows for filtering of events as well as data projections and simple calculations of derived values.
Qastle is useful for concisely communicating these transform requests, however is not intended for end-users. Instead, there are translators that transpile high level selection languages. There are currently transpilers for func_adl and for the T-Cut language.
ServiceX uses a micro-service architecture that runs on a Kubernetes cluster. Users interact with the service via a REST interface provided by a Flask App. The app also provides orchestration of the other components.
The Qastle queries are translated into event framework code by code generator services. These can generate C++ or Python Uproot code which is executed by the transformers.
The transformers are launched as an auto-scaling job by the flask app. Each worker is fed from a transactional RabbitMQ queue.
Full documentation at the Servicex ReadTheDocs pages.
ServiceX talk at HSF DAWG – DOMA Access meeting
Keep in Touch
Sign up for the ServiceX Mailing List to be informed of new releases of the system.
- Andrew Eckart
- Ben Galewsky
- Rob Gardner
- Lindsey Gray
- Mark Neubauer
- Jim Pivarski
- Ilija Vukotic
- Gordon Watts
- Marc Weinberg
- Michael Johnson
- 15 Sep 2022 - "End-to-end physics analysis with Open Data: the Analysis Grand Challenge", Alexander Held, PyHEP 2022 (virtual) Workshop
- 23 May 2022 - "Analysis user experience with the Python HEP ecosystem", Jim Pivarski, Analysis Ecosystems Workshop II
- 25 Apr 2022 - "From data delivery to statistical inference with CMS Open Data", Alexander Held, IRIS-HEP AGC Tools 2022 Workshop
- 30 Nov 2021 - "ServiceX - Making everything Columnar", Gordon Watts, ACAT 2021
- 3 Nov 2021 - "From data delivery to statistical inference: ServiceX, coffea, cabinetry & pyhf", Alexander Held, IRIS-HEP AGC Tools 2021 Workshop
- 9 Jul 2021 - "Using Python, coffea, and ServiceX to Rediscover the Higgs. Twice." , Gordon Watts, PyHEP 2021
- 26 Oct 2020 - "Data Selection & Delivery for Analysis: ServiceX & funcADL", Mason Proffitt, Future Analysis Systems and Facilities
- 16 Oct 2020 - "ServiceX Front End Status", Gordon Watts, ServiceX Meeting
- 2 Oct 2020 - "ServiceX Front End Status", Gordon Watts, ServiceX Meeting
- 8 Jul 2020 - "ServiceX and Kubernetes", Ben Galewsky, LHCG Grid Deployment Board meeting
- 26 May 2020 - "Parallel Sessions: Plans and Goals", Gordon Watts, IRIS-HEP Team Retreat
- 4 Nov 2019 - "A Distributed, Caching, Columnar Data Delivery Service(X)", Ben Galewsky, CHEP 2019 Conference
- 19 Jun 2019 - "ServiceX", Ben Galewsky, Analysis Systems Topical Workshop
- Towards Real-World Applications of ServiceX, an Analysis Data Transformation System, K. Choi, A. Eckart, B. Galewsky, R. Gardner, M. Neubauer, P. Onyisi, M. Proffitt, I. Vukotic and G. Watts (23 Aug 2021).
- ServiceX A Distributed, Caching, Columnar Data Delivery Service, ServiceX A Distributed, Caching, Columnar Data Delivery Service B. Galewsky, R. Gardner, L. Gray, M. Neubauer, J. Pivarski, M. Proffitt, I. Vukotic, G. Watts, M. Weinberg EPJ Web Conf. 245 04043 (2020) DOI: 10.1051/epjconf/202024504043 (16 Nov 2020).
- Snowmass 2021 Letter of Interest: Analysis Ecosystem at the HL-LHC, G. Watts Snowmass 2021 Letter of Interest (10 Sep 2020).