Open IRIS-HEP fellow projects

This page lists a number of known software R&D projects of interest to IRIS-HEP researchers. (This page will be updated from time to time, so check back and reload to see if new projects have been added.) Contact the mentors for more information about any of these projects! Be sure you have read the guidelines.

  • Geometric Machine Learning for Particle Tracking: Particle track reconstruction (trajectory finding) is a critical software component for meeting the physics goals of the HL-LHC.This project uses geometric machine learning methods (particularly graph neural networks) to address this problem. There are several possible areas a fellow could contribute to: (a) extending work on ‘one-shot’ tracking architectures that use object condensation or instance segmentation approaches to identify tracks and extract track parameters in a single algorithm, (b) contributing to an on-going study of equivariant GNNs for tracking by modifying an existing Lorentz-invariant network to work with tracking data, implementing rotational equivariant networks for tracking, or studying explainability methods for GNNs to characterize the impact of including invariance/equivariance, and (c) accelerating these tracking methods by modifying existing OpenCL implementations of the different components of GNN tracking (graph construction, GNN inference, and clustering) and linking them together into a full pipeline. The specific project will be developed with the fellow. For all projects a knowledge of python is necessary; for projects (a) and (b), experience with Pytorch is helpful but not required, while for project (c) some experience with C++ and/or FPGAs is advantageous. (Contact(s): Savannah Thais )
  • Prototyping of a US-CMS Data Lake: A federated data concept (“Data Lake”) has been proposed by US-CMS as a scalable solution that enables HL-LHC scaling of the US-CMS data distribution infrastructure. In particular, it would lower operational costs and allow better network capacity management while maintaining overall CPU efficiency. This project involves the implementation and performance evaluation of a Data Lake prototype, using resources at the San Diego Supercomputing Center (SDSC) and Pacific Research Platform (PRP). (Contact(s): Diego Davila Frank Wuerthwein )
  • Exploring the FAIR principles for preservation of UFO models: The FAIR (Findable, Accessible, Interoperable, Reusable) principles represent a set of standards that allow transparent and tractable progression in preservation of scientific data, tools, and software. One important set of tools used for Monte Carlo simulation of new physics models are the so called Universal Feynrules Outputs (UFO) models that are used in conjunction with event simulator like MadGraph. This project will be dedicated to developing a set of community standards in close association with the theorists, phenomenologists, and experimentalists to address FAIR preservation of such models. The project will encompass development of tools and CI-enabled repositories to allow a version controlled and well-documented preservation of these UFO models that can be then easily and reliably used by physicists. (Contact(s): Avik Roy Mark Neubauer Matthew Feickert )
  • Reading (and possibly writing) RNTuples in Uproot: Uproot is a Python library that reads and writes ROOT files, the file format for nearly all particle physics data. RNTuple is a new, fully columnar data format for ROOT, which is intended as the eventual replacement for TTree (paper, plans, and spec). Thanks to its columnar structure, data structures of any complexity can be zero-copy converted (apart from decompression) into Awkward Arrays. In this project, the successful candidate would develop the interface code in Uproot to read RNTuple data into Awkward Arrays. If time permits, the candidate may attempt to implement RNTuple-writing and RNTuple reading in sister projects in Julia-lang as well. (Contact(s): Jim Pivarski )
  • Efficient implementation of algorithms to reconstruct charged particles trajectories: A Common Tracking Software (ACTS) is a general tracking software toolkit for High Energy Physics experiments. The ACTS collaboration has launched several R&D lines (vecmem, detray, and traccc) for GPU acceleration by parallelizing the track reconstruction. vecmem is a memory management toolkit which provides users with convenient GPU interface. detray is a geometry builder which translates the CPU geometry into GPU one. traccc incorporates the other R&D lines to demonstrate GPU tracking pipeline which includes hit clusterization, seed finding, and Kalman filtering. We seek for candidates who can contribute to the traccc project. The examples of possible projects are (1) the development of traccc algorithm, (2) the measurement and optimization of tracking performance, and (3) the acceleration of existing algorithms by using a caching allocator or realizing multi-threaded environment. A proper project will be assigned to the fellow based on their skill set and research interest. The successful candidates need a C++ programming skill, while experience on GPU APIs such as CUDA and SYCL will be advantageous. (Contact(s): Beomki Yeo Heather Gray )
  • Metrics to define user activities and engagement on the various coffea-casa Analysis Facility deployments: coffea-casa is a prototype of analysis facility (AF), which provides services for “low latency columnar analysis”, enabling rapid processing of data in a column-wise fashion. These services, based on Dask and Jupyter notebooks, aim to dramatically lower time for analysis and provide an easily-scalable and user-friendly computational environment that will simplify, facilitate, and accelerate the delivery of HEP results. The goal of the project is to define a set of various user engagement metrics, collected from Jupytehub and other AF services, as well from underlying infrastructure (e.g. Kubernetes) and available through Elasticsearch. Expected results are the development of the various metrics, a data collection infrastructure for them, and possibly visualization dashboards. (Contact(s): Brian Bockelman Ken Bloom Oksana Shadura )
  • New powerful axis types and accumulators for Boost.Histogram: Boost.Histogram is a flexible high-performance library in C++. It is designed as a extensible modular system made from orthogonal parts. Axis classes are one of these components, which convert user values into indices. Boost.Histogram already supports the common axis types, but interesting new ones are still missing, like the composite axis, an axis for hexagonal bins, for generic polygons, for HealPix. Also interesting is to add support for rational numbers in a regular axis. Another orthogonal component of Boost Histograms are the accumulators. Common accumulators are supported, but the library lacks an efficiency accumulator, and a collector. If time permits, we can also look into deeper optimizations of the library. (Contact(s): Henry Schreiner Hans Dembinski )
  • Enabling support for MiniAOD Transformer for ServiceX Data Delivery Service: ServiceX is a distributed, cloud-native application that extracts columnar data from HEP event data and delivers it to an analyst. The func_adl data query language is used to tell ServiceX how to extract the data (the columns, simple cuts, etc.). The func_adl data query language has two backends that are currently part of ServiceX - one based on C++ for ATLAS data and CMS data, and one based on columnar processing using uproot and awkward arrays. The C++ backend currently runs only on the ATLAS binary format (xAOD) and CMS binary format (CMS AOD). This project will modify the C++ backend to also run on CMS MiniAOD binary files (available publicly as a part of Run 2 CMS Opendata release). The MiniAOD transformer is an important ingredient for a physics analysis workflow envisioned in the Analysis Grand Challenge. (Contact(s): Gordon Watts Ben Galewsky Oksana Shadura Alexander Held )
  • Benchmarking of prototype analysis system components: The Analysis Grand Challenge of IRIS-HEP focuses on performing a high energy physics analysis at scale, including all relevant features encountered by analyzers in this context. It is performed using tools and technologies developed within both IRIS-HEP and the broader community, making use of the Python ecosystem and the required cyberinfrastructure to run at scale. This project will happen after a first preliminary benchmarking has been performed, and it will build on that: the prospective fellow will use pieces of an example physics analysis to study the performance of different system components in more detail. Fellows are expected to have prior Python experience and interest in working with a diverse stack of analysis tools available in the ecosystem. (Contact(s): Oksana Shadura Alexander Held )
  • Muon Collider Tracking Software: A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two possible phases. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers. In the second phase, if time allows, we will focus on the optimization of the algorithm and the detector layout in order to take full advantage of such information. (Contact(s): Sergo Jindariani Simone Pagan Griso )
  • Developing tutorials with LLVM/Cling: LLVM/Cling enables a truly interactive C++ coding environment, whether in a standalone shell or a Jupyter notebook environment. Experience with other programming languages demonstrates that this sort of interactive environment is an excellent tool for learning. The project goal is to modernize current tutorial materials, and to create new tutorials based on current best practices in C++ using the interactive capabilities of Cling to provide an engaging student experience. Students are expected to have prior C++ experience and interest in explaining modern C++ constructs to others. (Contact(s): David Lange Vassil Vassilev )
  • Implementing the Hierarchical Cluster Trellis data structure in Julia: Hierarchical clustering is a fundamental task often used to discover meaningful structures in data. Specifically, jet reconstruction at the Large Hadron Collider (LHC) involves a hierarchical clustering of the jet constituents that hit the detector. We define MAP hierarchy as the maximum likelihood hierarchical clustering given a dataset, e.g the binary tree that maximizes the likelihood. Current algorithms are approximate and greedy (locally maximize the clustering likelihood). Exactly performing inference on the MAP hierarchy and finding the partition function (marginal likelihood over all binary trees) by enumerating all hierarchical clusterings over N elements is exceptionally difficult because the number of hierarchies grows extremely rapidly, as (2N-3)!!. To overcome the computational burden, we introduced in arXiv:2002.11661 a cluster trellis data structure that computes these quantities in the O(3N) time, without having to iterate over each possible hierarchy. While reducing the super-exponential complexity, it is exponential and the current implementation is in python ClusterTrellis. Thus, implementing this algorithm in Julia would make it faster and feasible for larger datasets that typically appear in data analyses at the LHC. We would also like to take advantage of Julia’s automatic differentiation capabilities. (Contact(s): Kyle Cranmer Sebastian Macaluso )
  • Arrow-Native Storage with SkyhookDM and Coffea processor Integration: Apache Arrow is a columnar in-memory format for seamless data transfer between different big data systems and mitigates the need for serializing and deserializing data and has native abstractions for use in Big Data storage systems. We aim to convert SkyhookDM into an Arrow-Native storage system by utilizing the Object class SDK provided by Ceph to add a layer in its storage side using the Arrow C++ SDK to allow querying and processing of tabular datasets stored as objects in Apache Arrow format in the storage side. The Coffea framework is a Python package that provides several recipes and utilities to aid in development of HEP analysis following the columnar analysis paradigm. The goal is to integrate Coffea with SkyhookDM to achieve Dask or Spark based parallel computations on ROOT files stored as objects in Ceph OSDs. (Contact(s): Jeff LeFevre Ivo Jimenez Carlos Maltzahn )
  • Integrating MadAnalysis and pyhf into RECAST_cli and REANA: RECAST is a framework for systematic reinterpretation of Analyses prepared at the LHC using computational workflows. The workflows can run locally, or on REANA, a cloud-native distributed application. It consists of a public and private web-services, the “Frontend” and “Control Plane” components respectively, for managing reinterpretation requests from the phenomenology community. A RECAST_cli (command line interface) for truth reinterpretation, based on the Rivet package, has been previously developed. In its current state, the truth reinterpretation runs only on the user’s local machine. The goal of this project is three fold: add new backends to the RECAST_cli besides Rivet (e.g. MadAnalsyis), add alternative statistical tools (e.g. pyhf), and to run the workflows in REANA in the cloud. (Contact(s): Kyle Cranmer Matthew Feickert )
  • CMS RECAST example & documentation: RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers. (Contact(s): Kyle Cranmer )
  • A pyhf converter for binned likelihood models in CMS Combine: Binned likelihood models based on template histograms are ubiquitous in both ATLAS and CMS. Within ATLAS the HistFactory tool is used widely (sometimes from a higher-level tool like HistFitter or TRExFitter). Within CMS the Combine tool is widely used. Both produce RooFit workspaces. Recently, the HistFactory specification was implemented in a pure python environment called pyhf, which can take advantage of GPU acceleration, automatic differentiation, etc. via backends like TensorFlow, PyTorch, JAX, etc. In addition, the pyhf model uses a JSON schema which has benefits for digital publishing and reinterpretation. We seek a fellow to develop a to converter for binned template likelihoods from the CMS Combine syntax to the pyhf specification and develop some tools to perform comparisons between the two models. (Contact(s): Kyle Cranmer Alexander Held Matthew Feickert )
  • Automating multi-container tests: The Open Science Grid (OSG) offers an integrated software stack used by the High-Energy Physics community to meet their computing needs. As container images and orchestration technologies grow in popularity, the OSG is exploring software distribution and deployment through these methods, and need testing coverage to be able to develop its software with confidence. We seek a fellow to join us in developing a nightly testing pipeline that verifies end-to-end functionality of a multi-container OSG site. (Contact(s): Brian Lin )
  • Interactive C++ for ML (not sure if we should add ML here, we will need to get over tons of applications): Cling is an interactive C++ interpreter, built on top of Clang and LLVM compiler infrastructure. Cling realizes the read-eval-print loop (REPL) concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses its strengths such as the praised concise and expressive compiler diagnostics. The LLVM-based C++ interpreter has enabled interactive C++ coding environments, whether in a standalone shell or a Jupyter notebook environment in xeus-cling. In this project we aim to demonstrate that interactive C++ is useful with data analysis tools outside of the field of HEP. For example, prototype implementations for Pytorch have been successfully completed.See this link The project deliverables are: * Demonstrate that we can build and use PyTorch programs with Cling in Jupyter notebook. * Develop several non-trivial ML tutorials using interactive C++ in Jupyter * Experiment with automatic differentiation via Clad of PyTorch codes Candidate requirements: * Experience with C++, Jupyter notebooks and Conda are desirable * Experience with ML Interest in exploring the intersection of data science and interactive C++. (Contact(s): David Lange Vassil Vassilev )