HEP Tables
Archived
hep-tables presents the physicist with a uniform interface for data query and histogramming. It coordinates access to services from data delivery to local distributed clusters, removing the need for the user to code this boiler and interface code.
- Data is fetched from
ServiceX - That data is processed by
coffeaand similar tools usingawkwardarray. - The data arriving back from
ServiceXis distributed to DASK for faster processing. - Fits into the same ecosystem that tools like
pyhfand (cabinetry)[https://iris-hep.org/projects/cabinetry.html] inhabit.
Further, it does this with a fairly straightforward array-like syntax:
- Initial dataset to histogram are specified in a coherent and unified way.
- Syntax is inspired by
pandasandnumpyarray syntax. - Supports awkward array usage as well. Awkward array is the standard language for manipulating
Features
- Basic array processing features
- Lambda variable capture to allow for multi-object relationships
- Basic histogramming
- Uses ServiceX and
awkwardandcoffeaas back ends
Road map
It is best to check the repositories mentioned above for the most recent status. But some future projects
- Integration of
coffeaas a backend processor to default to multi-CPU/processor work. - Ability to run
numba(ornumba-like) code - Ability to run C++ code
- Running in a facility with the user having a very simple light-weight python front-end package.
- Add
skyhookas a backend for caching and fast processing close to the data.
Status
At the moment this is a prototype package. It’s development is being driven by the requirements of an analysis in ATLAS. The first version has been implemented, and we are now taking a step back to understand how best to drive this work forward.
Repositories and links
- Some initial documentation exists in the form of a tour to show off what it can do.
- Three packages make up this project currently:
- dataframe_expressions - User facing API, converts array expressions into AST’s. Other packages then interpret this in order to execute or act on the user’s desire. Includes support for leaf referencing, slicing, lambda functions, and
numpyintegration. - hep_tables - Interprets a
dataframeexpression and converts it tofunc_adlto be executed on aServiceXDatasetSource. It can only interpret as much as whatfunc_adl(or ServiceX can do: return data from the service. - hl_tables - Plug-in architecture allows multiple back-ends for execution. Currently supports
hep_tablesto run data fetch and basic queries and also an immediateawkwardarray processor. The array processor can generate histograms among other things.
- dataframe_expressions - User facing API, converts array expressions into AST’s. Other packages then interpret this in order to execute or act on the user’s desire. Includes support for leaf referencing, slicing, lambda functions, and
- The three packages are being changed such that
hep_tableswill be the high level package,hl_tableswill be retired, and plug-ins will be build in separate repositories.
Team
Presentations
- 19 May 2021 - "hep_tables - Heterogeneous Array Programming for HEP", Gordon Watts, CHEP 2021
- 26 Oct 2020 - "An Integrated Data Query Pipeline: HEPTables", Gordon Watts, IRIS-HEP Future Analysis Systems and Facilities Blueprint Workshop
- 8 May 2020 - "hep_tables: An Introduction", Gordon Watts, Coffea Developers Meeting
Publications
- hep_tables: Heterogeneous Array Programming for HEP, Watts, Gordon, EPJ Web Conf. 251 03061 (2021) (21 Mar 2021) [NSF PAR].