Open IRIS-HEP fellow projects

Contact the mentors for more information about any of these projects!

  • Matrix Element Method with Deep Learning and Automatic Differentiation : The Matrix Element Method (MEM) is a powerful technique for high energy physics, but is highly computationally expensive. Recent advancements in applying deep learning to approximate MEM calculations through regression have shown dramatic computational time reduction. The goal of this fellowship project would be to explore ways to improve upon the established deep learning techniques through contributions to the JAX-based pyMELA framework. If time allows, the project could also incorporate the study of the effects of using automatic differentiation to improve the interpolation of grid points. (Contact(s): Mark Neubauer Matthew Feickert )
  • Swift for TensorFlow in HEP : Recently, Google announced a project of using Swift for TensorFlow (Swift4TF) as a next generation system for Deep Learning and differentiable computing. We would like to explore Swift4TF in context of HEP, build ML model using one of the CMS use-cases, and benchmark it against transitional Python based frameworks (PyTorch, Kears+TF, etc.). In particular, we would like to understand if Swift4TF compiler optimized model can outperform Python based training on specific hardware resources (GPUs, TPUs) and can it provide significant performance boost at inference phase using different HEP specific ML models. (Contact(s): David Lange )
  • Arrow-Native Storage with SkyhookDM and Coffea processor Integration : Apache Arrow is a columnar in-memory format for seamless data transfer between different big data systems and mitigates the need for serializing and deserializing data and has native abstractions for use in Big Data storage systems. We aim to convert SkyhookDM into an Arrow-Native storage system by utilizing the Object class SDK provided by Ceph to add a layer in its storage side using the Arrow C++ SDK to allow querying and processing of tabular datasets stored as objects in Apache Arrow format in the storage side. The Coffea framework is a Python package that provides several recipes and utilities to aid in development of HEP analysis following the columnar analysis paradigm. The goal is to integrate Coffea with SkyhookDM to achieve Dask or Spark based parallel computations on ROOT files stored as objects in Ceph OSDs. (Contact(s): )
  • Floating point error evaluation with Clad : Estimating floating point computation errors is as important as the computation itself. Accurate error estimation requires processing the code, arithmetic operations, and assignments for each input variable and dependent intermediate ones. It is virtually impossible to make an accurate error estimation by hand. In cases where it is possible, it can make the code less readable and maintainable. One approach is the application of automatic differentiation (AD) techniques. The AD technology decomposes the computation graph into atomic operations, which can then be used to follow differential calculus rules to produce a derivative. Adding a set of extra rules to estimate the floating point error for automatic differentiated functions is straightforward.
    Clad is a source transformation AD tool for C++. It is based on LLVM compiler infrastructure and is implemented as a plugin for C++ compiler Clang, which allows Clad to be transparently integrated into the compilation phase, and to utilize large parts of the compiler itself. Clad relies on Clang’s parsing and code generation functionalities and can differentiate complicated C++constructs in both forward and reverse mode. It is available as a standalone Clang plugin that,when attached to the compiler, produces derivatives in the compilation phase.
    The project aims to develop of a generic error estimation framework which is not bound to a particular error approximation model. It should allow users to select their preferable estimation logic and should automatically generate functions augmented with code for the specified error estimator. (Contact(s): David Lange Vassil Vassilev )
  • A pyhf converter for binned likelihood models in CMS Combine : Binned likelihood models based on template histograms are ubiquitous in both ATLAS and CMS. Within ATLAS the HistFactory tool is used widely (sometimes from a higher-level tool like HistFitter or TRExFitter). Within CMS the Combine tool is widely used. Both produce RooFit workspaces. Recently, the HistFactory specification was implemented in a pure python environment called pyhf, which can take advantage of GPU acceleration, automatic differentiation, etc. via backends like TensorFlow, PyTorch, JAX, etc. In addition, the pyhf model uses a JSON schema which has benefits for digital publishing and reinterpretation. We seek a fellow to develop a to converter for binned template likelihoods from the CMS Combine syntax to the pyhf specification and develop some tools to perform comparisons between the two models. (Contact(s): )
  • Integrating MadAnalysis and pyhf into RECAST_cli and REANA : RECAST is a framework for systematic reinterpretation of Analyses prepared at the LHC using computational workflows. The workflows can run locally, or on REANA, a cloud-native distributed application. It consists of a public and private web-services, the “Frontend” and “Control Plane’’ components respectively, for managing reinterpretation requests from the phenomenology community. A RECAST_cli (command line interface) for truth reinterpretation, based on the Rivet package, has been previously developed. In its current state, the truth reinterpretation runs only on the user’s local machine. The goal of this project is three fold: add new backends to the RECAST_cli besides Rivet (e.g. MadAnalsyis), add alternative statistical tools (e.g. pyhf), and to run the workflows in REANA in the cloud. (Contact(s): )
  • Reading CMS Run 1/2 miniAOD files with ServiceX and func_adl : ServiceX is a distributed, cloud-native application that extracts columnar data from HEP event data and delivers it to an analyst. The func_adl data query language is used to tell ServiceX how to extract the data (what columns, what simple cuts, etc.). The func_adl data query language has two backends that are currently part of ServiceX - one based on C++ for ATLAS data and one based on columnar processing using uproot and awkward array. The C++ backend currently runs only on the ATLAS binary format, xAOD. The backend is written in python. This project will modify the C++ backend to also run on CMS mini-AOD binary files (Run 1/Run 2). (Contact(s): )
  • Translating analyses into prototype analysis systems : This project is situated in the Analysis Systems focus area. An existing realistic physics analysis will be translated into new prototype analysis systems. The new implementation will be benchmarked and compared to existing traditional approaches. An example analysis for this is the search for Higgs boson decays to four leptons with CMS open data. This channel significantly contributed to the Higgs boson discovery in 2012. In this project, it will be translated to a new approach, making use of tools like coffea and awkward-array. (Contact(s): )
  • Graph Methods for Particle Tracking : Particle track reconstruction is a critical software component for meeting the physics goals of the HL-LHC.This project uses Graph based learning methods to address this problem; the overall goal is to do acceleratedtrack-seeding at the trigger level by using Graph Neural Networks to construct tracklets in the ATLAS/CMS pixeldetectors and implementing the algorithms on dedicated FPGAs. There are two areas within this work that afellow could contribute to (and this project could support two separate fellows depending on interest):The first is on the Machine Learning side by implementing, training, and evaluating additional GNN architecturesand graph construction methods. The second is on the hardware and acceleration side by working on translatingthe graph components of these algorithms efficiently onto FPGAs. (Contact(s): )
  • Exploring Floating Point Compression of HEP data with zfp : Streams of floating point data representing observed or simulated data is notoriously difficult to compress. Compression algorithms often work by looking for repetition of data within a bytestream. Floating point data, however, often contains additional noise beyond the accuracy of the underlying instrument or simulation; this removes sources of repetitive data. By accepting data loss beyond a certain threshold - and taking advantage of the known structure of floating point representations - floating point compression libraries, such as zfp, can manipulate the floating point values to make them highly compressible. Some HEP data formats like CMS’s MiniAOD format already utilize rudimentary techniques (e.g., zeroing out the mantissa) to make data more compressible when the accuracy of the input data is known. We seek a fellow interested in taking data samples from CMS MiniAOD and compare the performance (CPU cost, compression ratios, memory) of zfp compared to these existing techniques. By the end of the project, we aim to characterize the potential space savings in MiniAOD from using these libraries and have a prototype version of a MiniAOD file that is zfp-compressed. (Contact(s): )
  • Probabilistic Programming for COVID : The PPX protocol developed for the Etalumis project allows an external inference algorithm to control the execution of a simulator. Originally this was the HEP event generator Sherpa. The same tools are now being used to control epidemiological simulators for inference and planning (see here and here). We seek a fellow to join in the effort for PPX / probprog and COVID simulators. The project deliverable would involve pedagogical examples and documentation to aid other epidemiologists to do the same with their own simulators. (Contact(s): )
  • Contributions to Pythonic jet tools for Scikit-HEP : Scikit-HEP has a tool called pyjet, which provides an interface between FastJet and NumPy. The tool was developed prior to Uproot and Awkward Array, which extend columnar (NumPy-like) analysis to High Energy Physics. As of version 3.3, FastJet itself ships with a Python interface, but it was not designed with columnar analysis in mind. Pyjet, on the other hand, is incomplete: it lacks some functionality related to jet substructure. This project would first investigate the design considerations of the existing Python bindings with Scikit-HEP’s Uproot and Awkward Array in mind and then contribute to improving the jet tools for the Scikit-HEP ecosystem. (Contact(s): )
  • Active learning for more efficient generation of Monte Carlo for exclusion plots : An active learning algorithm called has been developed to make smart choices for parameters of signal Monte Carlo (eg. for a SUSY scan) that requires many fewer signal samples for producing a quality exclusion contour. Project would be to interface excursion with either the ATLAS or CMS production system and REANA/RECAST. Ideally, excursion algorithm would run on a machine with a GPU and be upgraded to include advances in scalable Gaussian Processes (eg. GPyTorch). (Contact(s): )
  • CMS RECAST example & documentation : RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers. (Contact(s): )