Open IRIS-HEP fellow projects


This page lists a number of known software R&D projects of interest to IRIS-HEP researchers. (This page will be updated from time to time, so check back and reload to see if new projects have been added.) Contact the mentors for more information about any of these projects! Be sure you have read the guidelines.

  • Optimization of a Higgs to Diphoton NanoAOD (HiggsDNA) framework: The Higgs to Diphoton NanoAOD framework is a python-based analysis framework for Higgs to diphoton physics in CMS which utilizes columnar packages like awkward array. Depending on interest and skill set, a student could perform studies in performance optimization of the code, develop new tools for visualizing and summarizing the outputs of the code, or develop tools for training and integrating machine learning algorithms within the framework. The student should have experience with version control tools (Gitlab/Github), linux, and python (especially numpy, xgboost, or tensorflow). (Contact(s): Indara Suarez )
  • AutoDQM - an automated web-based tool for data quality monitoring in the CMS detector: AutoDQM is an automated web-based tool for data quality monitoring in the CMS detector which assists L3 shifters in identifying anomalous detector behavior during data-taking. AutoDQM_ML is an accompanying python-based machine learning toolkit which provides users with tools to train and assess unsupervised machine learning algorithms which can be then used within AutoDQM. A student could develop tools for streamlining the integration of trained ML algorithms between AutoDQM_ML and AutoDQM, develop software for analysis and optimization of hyperparameters of the machine learning algorithms (PCAs and Autoencoders), or implement new unsupervised ML algorithms in the framework. The student should have experience with version control tools (Gitlab/Github), linux, python. Machine learning experience is a plus as well. (Contact(s): Indara Suarez )
  • Geometric Machine Learning for Particle Tracking: Particle track reconstruction (trajectory finding) is a critical software component for meeting the physics goals of the HL-LHC.This project uses geometric machine learning methods (particularly graph neural networks) to address this problem. There are several possible areas a fellow could contribute to: (a) extending work on ‘one-shot’ tracking architectures that use object condensation or instance segmentation approaches to identify tracks and extract track parameters in a single algorithm, (b) contributing to an on-going study of equivariant GNNs for tracking by modifying an existing Lorentz-invariant network to work with tracking data, implementing rotational equivariant networks for tracking, or studying explainability methods for GNNs to characterize the impact of including invariance/equivariance, and (c) accelerating these tracking methods by modifying existing OpenCL implementations of the different components of GNN tracking (graph construction, GNN inference, and clustering) and linking them together into a full pipeline. The specific project will be developed with the fellow. For all projects a knowledge of python is necessary; for projects (a) and (b), experience with Pytorch is helpful but not required, while for project (c) some experience with C++ and/or FPGAs is advantageous. (Contact(s): Savannah Thais )
  • Prototyping of a US-CMS Data Lake: A federated data concept (“Data Lake”) has been proposed by US-CMS as a scalable solution that enables HL-LHC scaling of the US-CMS data distribution infrastructure. In particular, it would lower operational costs and allow better network capacity management while maintaining overall CPU efficiency. This project involves the implementation and performance evaluation of a Data Lake prototype, using resources at the San Diego Supercomputing Center (SDSC) and Pacific Research Platform (PRP). (Contact(s): Diego Davila Frank Wuerthwein )
  • Exploring the FAIR principles for preservation of UFO models: The FAIR (Findable, Accessible, Interoperable, Reusable) principles represent a set of standards that allow transparent and tractable progression in preservation of scientific data, tools, and software. One important set of tools used for Monte Carlo simulation of new physics models are the so called Universal Feynrules Outputs (UFO) models that are used in conjunction with event simulator like MadGraph. This project will be dedicated to developing a set of community standards in close association with the theorists, phenomenologists, and experimentalists to address FAIR preservation of such models. The project will encompass development of tools and CI-enabled repositories to allow a version controlled and well-documented preservation of these UFO models that can be then easily and reliably used by physicists. (Contact(s): Mark Neubauer Matthew Feickert Hans Dembinski )
  • Efficient implementation of algorithms to reconstruct charged particles trajectories: A Common Tracking Software (ACTS) is a general tracking software toolkit for High Energy Physics experiments. The ACTS collaboration has launched several R&D lines (vecmem, detray, and traccc) for GPU acceleration by parallelizing the track reconstruction. vecmem is a memory management toolkit which provides users with convenient GPU interface. detray is a geometry builder which translates the CPU geometry into GPU one. traccc incorporates the other R&D lines to demonstrate GPU tracking pipeline which includes hit clusterization, seed finding, and Kalman filtering. We seek for candidates who can contribute to the traccc project. The examples of possible projects are (1) the development of traccc algorithm, (2) the measurement and optimization of tracking performance, and (3) the acceleration of existing algorithms by using a caching allocator or realizing multi-threaded environment. A proper project will be assigned to the fellow based on their skill set and research interest. The successful candidates need a C++ programming skill, while experience on GPU APIs such as CUDA and SYCL will be advantageous. (Contact(s): Beomki Yeo Heather Gray )
  • Enabling support for MiniAOD Transformer for ServiceX Data Delivery Service: ServiceX is a distributed, cloud-native application that extracts columnar data from HEP event data and delivers it to an analyst. The func_adl data query language is used to tell ServiceX how to extract the data (the columns, simple cuts, etc.). The func_adl data query language has two backends that are currently part of ServiceX - one based on C++ for ATLAS data and CMS data, and one based on columnar processing using uproot and awkward arrays. The C++ backend currently runs only on the ATLAS binary format (xAOD) and CMS binary format (CMS AOD). This project will modify the C++ backend to also run on CMS MiniAOD binary files (available publicly as a part of Run 2 CMS Opendata release). The MiniAOD transformer is an important ingredient for a physics analysis workflow envisioned in the Analysis Grand Challenge. (Contact(s): Gordon Watts Ben Galewsky Oksana Shadura Alexander Held )
  • Metrics to define user activities and engagement on the various coffea-casa Analysis Facility deployments: coffea-casa is a prototype of analysis facility (AF), which provides services for “low latency columnar analysis”, enabling rapid processing of data in a column-wise fashion. These services, based on Dask and Jupyter notebooks, aim to dramatically lower time for analysis and provide an easily-scalable and user-friendly computational environment that will simplify, facilitate, and accelerate the delivery of HEP results. The goal of the project is to define a set of various user engagement metrics, collected from Jupytehub and other AF services, as well from underlying infrastructure (e.g. Kubernetes) and available through Elasticsearch. Expected results are the development of the various metrics, a data collection infrastructure for them, and possibly visualization dashboards. (Contact(s): Brian Bockelman Ken Bloom Oksana Shadura )
  • New powerful axis types and accumulators for Boost.Histogram: Boost.Histogram is a flexible high-performance library in C++. It is designed as a extensible modular system made from orthogonal parts. Axis classes are one of these components, which convert user values into indices. Boost.Histogram already supports the common axis types, but interesting new ones are still missing, like the composite axis, an axis for hexagonal bins, for generic polygons, for HealPix. Also interesting is to add support for rational numbers in a regular axis. Another orthogonal component of Boost Histograms are the accumulators. Common accumulators are supported, but the library lacks an efficiency accumulator, and a collector. If time permits, we can also look into deeper optimizations of the library. (Contact(s): Henry Schreiner Hans Dembinski )
  • Benchmarking of prototype analysis system components: The Analysis Grand Challenge of IRIS-HEP focuses on performing a high energy physics analysis at scale, including all relevant features encountered by analyzers in this context. It is performed using tools and technologies developed within both IRIS-HEP and the broader community, making use of the Python ecosystem and the required cyberinfrastructure to run at scale. This project will happen after a first preliminary benchmarking has been performed, and it will build on that: the prospective fellow will use pieces of an example physics analysis to study the performance of different system components in more detail. Fellows are expected to have prior Python experience and interest in working with a diverse stack of analysis tools available in the ecosystem. (Contact(s): Oksana Shadura Alexander Held )
  • Muon Collider Tracking Software: A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two possible phases. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers. In the second phase, if time allows, we will focus on the optimization of the algorithm and the detector layout in order to take full advantage of such information. (Contact(s): Sergo Jindariani Simone Pagan Griso )
  • Developing tutorials with LLVM/Cling: LLVM/Cling enables a truly interactive C++ coding environment, whether in a standalone shell or a Jupyter notebook environment. Experience with other programming languages demonstrates that this sort of interactive environment is an excellent tool for learning. The project goal is to modernize current tutorial materials, and to create new tutorials based on current best practices in C++ using the interactive capabilities of Cling to provide an engaging student experience. Students are expected to have prior C++ experience and interest in explaining modern C++ constructs to others. (Contact(s): David Lange Vassil Vassilev )
  • Implementing the Hierarchical Cluster Trellis data structure in Julia: Hierarchical clustering is a fundamental task often used to discover meaningful structures in data. Specifically, jet reconstruction at the Large Hadron Collider (LHC) involves a hierarchical clustering of the jet constituents that hit the detector. We define MAP hierarchy as the maximum likelihood hierarchical clustering given a dataset, e.g the binary tree that maximizes the likelihood. Current algorithms are approximate and greedy (locally maximize the clustering likelihood). Exactly performing inference on the MAP hierarchy and finding the partition function (marginal likelihood over all binary trees) by enumerating all hierarchical clusterings over N elements is exceptionally difficult because the number of hierarchies grows extremely rapidly, as (2N-3)!!. To overcome the computational burden, we introduced in arXiv:2002.11661 a cluster trellis data structure that computes these quantities in the O(3N) time, without having to iterate over each possible hierarchy. While reducing the super-exponential complexity, it is exponential and the current implementation is in python ClusterTrellis. Thus, implementing this algorithm in Julia would make it faster and feasible for larger datasets that typically appear in data analyses at the LHC. We would also like to take advantage of Julia’s automatic differentiation capabilities. (Contact(s): Kyle Cranmer Sebastian Macaluso )
  • Arrow-Native Storage with SkyhookDM and Coffea processor Integration: Apache Arrow is a columnar in-memory format for seamless data transfer between different big data systems and mitigates the need for serializing and deserializing data and has native abstractions for use in Big Data storage systems. We aim to convert SkyhookDM into an Arrow-Native storage system by utilizing the Object class SDK provided by Ceph to add a layer in its storage side using the Arrow C++ SDK to allow querying and processing of tabular datasets stored as objects in Apache Arrow format in the storage side. The Coffea framework is a Python package that provides several recipes and utilities to aid in development of HEP analysis following the columnar analysis paradigm. The goal is to integrate Coffea with SkyhookDM to achieve Dask or Spark based parallel computations on ROOT files stored as objects in Ceph OSDs. (Contact(s): Jeff LeFevre Ivo Jimenez Carlos Maltzahn )
  • Integrating MadAnalysis and pyhf into RECAST_cli and REANA: RECAST is a framework for systematic reinterpretation of Analyses prepared at the LHC using computational workflows. The workflows can run locally, or on REANA, a cloud-native distributed application. It consists of a public and private web-services, the “Frontend” and “Control Plane” components respectively, for managing reinterpretation requests from the phenomenology community. A RECAST_cli (command line interface) for truth reinterpretation, based on the Rivet package, has been previously developed. In its current state, the truth reinterpretation runs only on the user’s local machine. The goal of this project is three fold: add new backends to the RECAST_cli besides Rivet (e.g. MadAnalsyis), add alternative statistical tools (e.g. pyhf), and to run the workflows in REANA in the cloud. (Contact(s): Kyle Cranmer Matthew Feickert )
  • CMS RECAST example & documentation: RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers. (Contact(s): Kyle Cranmer )
  • Scaling HTTP caching with Kubernetes: The Open Science Grid (OSG) offers an integrated software stack and infrastructure used by the High-Energy Physics community to meet their computational needs. Frontier Squid is part of this software stack and acts as an HTTP proxy, caching requests to improve network usage. We seek a fellow to turn our existing single cache deployment into one that can scale horiontally using the same underlying storage for its cache. (Contact(s): Brian Lin Mátyás (Mat) Selmeci )
  • Securing container images for infrastructure services: The Open Science Grid (OSG) offers an integrated software stack used by the High-Energy Physics community to meet their computing needs. As container images and orchestration technologies grow in popularity, the OSG has started to distribute images through the Harbor container registry. We seek a fellow who will investigate image vulnerability scanning tools and configurations to improve the security posture of our images. (Contact(s): Brian Lin Mátyás (Mat) Selmeci )
  • Add network measuring to data federation caches: The Open Science Grid (OSG) offers an integrated software stack and infrastructure used by the High-Energy Physics community to meet their computational needs. To prepare for the orders of magnitude of more data from the High-Luminosity Large Hadron Collider (HL-LHC), experiments have begun deploying distributed data federations made up of caches and origins, to efficiently transfer and reuse data. We seek a fellow to join us in adding network performance tooling to our XCache Kubernetes deployments. (Contact(s): Brian Lin Mátyás (Mat) Selmeci )
  • Continuous Deployment on Kubernetes for OSG-LHC containerized services: The Open Science Grid (OSG) offers an integrated software stack and infrastructure used by the High-Energy Physics community to meet their computational needs. As part of this infrastructure, the OSG deploys central services that are verified in a testing environment before being moved to production. We seek a fellow who will join us in designing a Kubernetes-based continuous deployment model to automate production service updates. (Contact(s): Brian Lin Mátyás (Mat) Selmeci )
  • Interactive C++ for ML: Cling is an interactive C++ interpreter, built on top of Clang and LLVM compiler infrastructure. Cling realizes the read-eval-print loop (REPL) concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses its strengths such as the praised concise and expressive compiler diagnostics. The LLVM-based C++ interpreter has enabled interactive C++ coding environments, whether in a standalone shell or a Jupyter notebook environment in xeus-cling. In this project we aim to demonstrate that interactive C++ is useful with data analysis tools outside of the field of HEP. For example, prototype implementations for Pytorch have been successfully completed.See this link The project deliverables are: * Demonstrate that we can build and use PyTorch programs with Cling in Jupyter notebook. * Develop several non-trivial ML tutorials using interactive C++ in Jupyter * Experiment with automatic differentiation via Clad of PyTorch codes Candidate requirements: * Experience with C++, Jupyter notebooks and Conda are desirable * Experience with ML Interest in exploring the intersection of data science and interactive C++. (Contact(s): David Lange Vassil Vassilev )
  • Automating multi-container tests: The Open Science Grid (OSG) offers an integrated software stack used by the High-Energy Physics community to meet their computing needs. As container images and orchestration technologies grow in popularity, the OSG is exploring software distribution and deployment through these methods, and need testing coverage to be able to develop its software with confidence. We seek a fellow to join us in developing a nightly testing pipeline that verifies end-to-end functionality of a multi-container OSG site. (Contact(s): Brian Lin )