Open IRIS-HEP fellow projects

This page lists a number of known software R&D projects of interest to IRIS-HEP researchers. (This page will be updated from time to time, so check back and reload to see if new projects have been added.) Contact the mentors for more information about any of these projects! Be sure you have read the guidelines.

  • Accelerating Uproot with AwkwardForth: Uproot is a Python library that reads and writes ROOT files, the file format for nearly all particle physics data. (Over an exabyte of data is stored in the ROOT format.) As described in this talk, Uproot can only read data types that have a columnar layout quickly; data types with a record-oriented layout are hundreds of times slower. The same talk describes a solution: generating AwkwardForth code to read the data, rather than generating Python code to read the data, where AwkwardForth is a dialect of Forth, specialized for deserializing record-oriented data into columnar data. A successful candidate would add routines to generate AwkwardForth code in Python to deserialize C++ objects into Awkward Arrays—a very multilingual experience! The successful candidate would also monitor performance: adding these routines is expected to speed up deserialization of types like std::vector<std::vector<float>> by over 100× (see talk and the accompanying paper). (Contact(s): Jim Pivarski Ianna Osborne )
  • Muon Collider Tracking Software: A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two possible phases. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers. In the second phase, if time allows, we will focus on the optimization of the algorithm and the detector layout in order to take full advantage of such information. (Contact(s): Sergo Jindariani Simone Pagan Griso )
  • Developing tutorials with LLVM/Cling: LLVM/Cling enables a truly interactive C++ coding environment, whether in a standalone shell or a Jupyter notebook environment. Experience with other programming languages demonstrates that this sort of interactive environment is an excellent tool for learning. The project goal is to modernize current tutorial materials, and to create new tutorials based on current best practices in C++ using the interactive capabilities of Cling to provide an engaging student experience. Students are expected to have prior C++ experience and interest in explaining modern C++ constructs to others. (Contact(s): David Lange Vassil Vassilev )
  • Implementing the Hierarchical Cluster Trellis data structure in Julia: Hierarchical clustering is a fundamental task often used to discover meaningful structures in data. Specifically, jet reconstruction at the Large Hadron Collider (LHC) involves a hierarchical clustering of the jet constituents that hit the detector. We define MAP hierarchy as the maximum likelihood hierarchical clustering given a dataset, e.g the binary tree that maximizes the likelihood. Current algorithms are approximate and greedy (locally maximize the clustering likelihood). Exactly performing inference on the MAP hierarchy and finding the partition function (marginal likelihood over all binary trees) by enumerating all hierarchical clusterings over N elements is exceptionally difficult because the number of hierarchies grows extremely rapidly, as (2N-3)!!. To overcome the computational burden, we introduced in arXiv:2002.11661 a cluster trellis data structure that computes these quantities in the O(3N) time, without having to iterate over each possible hierarchy. While reducing the super-exponential complexity, it is exponential and the current implementation is in python ClusterTrellis. Thus, implementing this algorithm in Julia would make it faster and feasible for larger datasets that typically appear in data analyses at the LHC. We would also like to take advantage of Julia’s automatic differentiation capabilities. (Contact(s): Kyle Cranmer Sebastian Macaluso )
  • Adding HTTP backend to legacy data transfer server: In preparation for the High-Luminosity Large Hadron Collider (LHC), the High-Energy Physics community has begun to use HTTP as its future bulk data transfer protocol. To assist LHC sites with the transition to HTTP, the Open Science Grid (OSG) would like to develop and deploy a new backend to the legacy data transfer server (based on Globus GridFTP) that can read and write over HTTP. We seek a fellow to help us prototype an implementation and demonstrate the ability to provide transfer connectivity between a source GridFTP and a destination HTTP server. (Contact(s): Brian Lin )
  • Swift for TensorFlow in HEP: Recently, Google announced a project of using Swift for TensorFlow (Swift4TF) as a next generation system for Deep Learning and differentiable computing. We would like to explore Swift4TF in context of HEP, build ML model using one of the CMS use-cases, and benchmark it against transitional Python based frameworks (PyTorch, Kears+TF, etc.). In particular, we would like to understand if Swift4TF compiler optimized model can outperform Python based training on specific hardware resources (GPUs, TPUs) and can it provide significant performance boost at inference phase using different HEP specific ML models. (Contact(s): David Lange Valentin Kuznetsov )
  • Arrow-Native Storage with SkyhookDM and Coffea processor Integration: Apache Arrow is a columnar in-memory format for seamless data transfer between different big data systems and mitigates the need for serializing and deserializing data and has native abstractions for use in Big Data storage systems. We aim to convert SkyhookDM into an Arrow-Native storage system by utilizing the Object class SDK provided by Ceph to add a layer in its storage side using the Arrow C++ SDK to allow querying and processing of tabular datasets stored as objects in Apache Arrow format in the storage side. The Coffea framework is a Python package that provides several recipes and utilities to aid in development of HEP analysis following the columnar analysis paradigm. The goal is to integrate Coffea with SkyhookDM to achieve Dask or Spark based parallel computations on ROOT files stored as objects in Ceph OSDs. (Contact(s): Jeff LeFevre Ivo Jimenez Carlos Maltzahn )
  • Integrating MadAnalysis and pyhf into RECAST_cli and REANA: RECAST is a framework for systematic reinterpretation of Analyses prepared at the LHC using computational workflows. The workflows can run locally, or on REANA, a cloud-native distributed application. It consists of a public and private web-services, the “Frontend” and “Control Plane” components respectively, for managing reinterpretation requests from the phenomenology community. A RECAST_cli (command line interface) for truth reinterpretation, based on the Rivet package, has been previously developed. In its current state, the truth reinterpretation runs only on the user’s local machine. The goal of this project is three fold: add new backends to the RECAST_cli besides Rivet (e.g. MadAnalsyis), add alternative statistical tools (e.g. pyhf), and to run the workflows in REANA in the cloud. (Contact(s): Kyle Cranmer Matthew Feickert )
  • CMS RECAST example & documentation: RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers. (Contact(s): Kyle Cranmer )
  • Translating analyses into prototype analysis systems: This project is situated in the Analysis Systems focus area. An existing realistic physics analysis will be translated into new prototype analysis systems. The new implementation will be benchmarked and compared to existing traditional approaches. An example analysis for this is the search for Higgs boson decays to four leptons with CMS open data. This channel significantly contributed to the Higgs boson discovery in 2012. In this project, it will be translated to a new approach, making use of tools like coffea and awkward-array. (Contact(s): )
  • Active learning for more efficient generation of Monte Carlo for exclusion plots: An active learning algorithm called has been developed to make smart choices for parameters of signal Monte Carlo (eg. for a SUSY scan) that requires many fewer signal samples for producing a quality exclusion contour. Project would be to interface excursion with either the ATLAS or CMS production system and REANA/RECAST. Ideally, excursion algorithm would run on a machine with a GPU and be upgraded to include advances in scalable Gaussian Processes (eg. GPyTorch). (Contact(s): )
  • A pyhf converter for binned likelihood models in CMS Combine: Binned likelihood models based on template histograms are ubiquitous in both ATLAS and CMS. Within ATLAS the HistFactory tool is used widely (sometimes from a higher-level tool like HistFitter or TRExFitter). Within CMS the Combine tool is widely used. Both produce RooFit workspaces. Recently, the HistFactory specification was implemented in a pure python environment called pyhf, which can take advantage of GPU acceleration, automatic differentiation, etc. via backends like TensorFlow, PyTorch, JAX, etc. In addition, the pyhf model uses a JSON schema which has benefits for digital publishing and reinterpretation. We seek a fellow to develop a to converter for binned template likelihoods from the CMS Combine syntax to the pyhf specification and develop some tools to perform comparisons between the two models. (Contact(s): Kyle Cranmer Alexander Held )
  • Automating multi-container tests: The Open Science Grid (OSG) offers an integrated software stack used by the High-Energy Physics community to meet their computing needs. As container images and orchestration technologies grow in popularity, the OSG is exploring software distribution and deployment through these methods, and need testing coverage to be able to develop its software with confidence. We seek a fellow to join us in developing a nightly testing pipeline that verifies end-to-end functionality of a multi-container OSG site. (Contact(s): Brian Lin )