Coffea-Casa Analysis Facility

The HL-LHC era will represent more than order-magnitude increase of event counts for analysts. The increased data volume will force physicists to adopt new methods and approaches; what fit comfortably on a laptop for LHC will require a distributed system for the next generation.

Coffea-Casa is a prototype analysis facility, which provides services for “low latency columnar analysis”, enabling rapid processing of data in a column-wise fashion. This provides an interactive experience and quick “initial results” while scaling to the full scale of datasets.

These services, based on the Dask parallelism library and Jupyter notebooks, aim to dramatically lower the time for analysis and provide an easily scalable and user-friendly computational environment that will simplify and accelerate the delivery of particle physics measurements. The facility is built on top of a Kubernetes cluster and integrates dedicated resources along with resources allocated via fairshare through the local HTCondor system. In addition to the user-facing interfaces such as Dask, the facility also manages access control through a common single-sign-on authentication & authorization for data access (the data access strategy aligns with the new authorization technologies used by OSG-LHC).

After authentication (e.g., via the CERN SSO), the user is presented with a Jupyter notebook interface that can be populated with code from a Git repository specified by the user. When the notebook is executed, the processing automatically scales out to available resources (such as the Nebraska Tier-2 facility for the SSL instance at Nebraska), giving the user transparent interactive access to a large computing resource. The CMS instance of the facility has access to the entire CMS data set, thanks to the global data federation and local caches. It supports the Coffea framework, which provides a declarative programming interface that treats the data in its natural columnar form.

An important feature is access to a “column service” like ServiceX; if a user is working with a compact data format (such as CMS NanoAOD or ATLAS PHYSLITE) that is missing a data element that the user needs, the facility can be used to serve that “column” from a remote site. This allows only the compact data formats to be stored locally and augmented only as needed, a critical strategy for CMS and ATLAS to control the costs of storage in the HL-LHC era.

Generic design schema of Coffea-Casa analysis facility

Core software components and other developed plugins that were used in the design of Coffea-Casa analysis facility:

Core components used in design of Coffea-Casa analysis facility

GitHub Project GitHub issues GitHub pull requests Actions Status PyPI version PyPI platforms

Documentation: Documentation Status

Contact us: GitHub Discussion

More information could be found in the corresponding repository:

Recent accomplishments and plans

Recent accomplishments:

  • The CMS facility, deployed at the Nebraska Tier-2 center, is accommodating first CMS users: try it!. More then 150 users have used the CMS facility over last years.

CMS Coffea-Casa Jupyterlab interface with Dask Labextention powered cluster

  • For non-CMS users, we have enabled the Opendata coffea-casa facility: try it!. More then 60 users have used the Opendata facility over last years.

Opendata Coffea-Casa login web-page enabling a user access to Jupyterlab interface with Dask Labextention powered cluster

Future plans for 2023:

  • Recruit more physics analysis groups to facility use.
  • Benchmark various software components and packages deployed at Coffea-Casa analysis facilities at the University of Nebraska-Lincoln and at the University of Chicago.
  • Prepare and execute the Analysis Grand Challenge at Coffea-Casa Analysis Facilities deployed at the University of Nebraska-Lincoln and the University of Chicago and as well other facilities.

Recent videos and tutorials

Fellows

Team

Presentations

Publications