Caching Data for LHC Analysis



Significant portions of LHC analysis use the same datasets, running over each dataset several times. Hence, we can utilize cache-based approaches as an opportunity to efficiency of CPU use (via reduced latency) and network (reduce WAN traffic). We are investigating the use of regional caches to store, on-demand, certain datasets relevant to analysis use cases. The aim of the caches are to speed up overall analysis and reduce overall network resource consumption – both of which are predicted to be significant challenges in the HL-LHC era as data volumes and event counts increase.

In Southern California the UCSD CMS Tier-2 and Caltech CMS Tier-2 joined forces to create and maintain a regional cache, commonly referred as the “CMS SoCal cache”, that benefits all Southern California CMS researchers. The SoCal cache was augmented by a joint project with ESNet, which integrated a caching server into the SoCal Cache. The server is deployed on the ESnet point of presence at Sunnyvale, CA but is managed by staff at UCSD through the PRP project’s Kubernetes-based Nautilus cluster.

A recent ESNet study was carried out on the network savings of the SoCal cache. The study analyzed the XRootD monitoring records from the XCache servers and showed a factor 3 reduction of network bandwidth over the analyzed period.

Network utilization savings
Network utilization savings

Network utilization reduction ratio in terms of (a)number of accesses and (b) volume transferred.

The aforementioned study also demonstrated how the accesses to the cache are evenly distributed among the different servers that conform the SoCal cache.

SoCal hits and misses
Misses(a) and Hits(b) distribution in SoCal cache

The above shows the distribution of hits and misses among servers in the SoCal cache.

The IRIS-HEP team engaged with CMS to have a monitoring page showing the popularity of the analyzed data in the SoCal cache, which provides guidance on the evolution of the popularity of files in the namespace.

CMS data popularity
CMS data popularity

The above shows the distribution of accesses in terms of volume of the CMS analysis tasks by data campaign.

Open Science Data Federation (OSDF)

Similarly to the LHC experiments the OSG has deployed a set of caches and origins that serve both public and authenticated data from diverse experiments and individual researchers. In the following image we can see the location of the different origins and caches conforming the federation

OSDF map
Open Science Data Federation

Location of the different caches and origins within the OSDF.

For more information on how to joing the OSDF please visit the following link

Monitoring improvements

During the past couple of years a significant amount of effor was dedicated to understand and improve the issues affecting the collection of the XRootD monitoring data. A first study: XrootD Monitoring Validation done in order to understand the data loss, found that the cause was a common UDP issue known as “UDP packet fragmentation”. The second study: XRootD Monitoring Scale Validation was carried out to find the limitations of the monitoring collector when used at a a higher scale.

As a result of the first study mentioned above a new component called The shoveler was introduced in the monitoring infrastructure to prevent the data loss due to UDP packet fragmentation. As depicted in the next figure, this lightweight component uses a secure and reliable channel to communicate the monitoring data from the XRootD servers to the central monitoring collector operated by OSG.

Shoveler diagram
The shoveler

The shoveler is deployed in between the XRootD server(s) and the XRootD Collector to ensure a reliable channel.

Finally, software improvements to the OSG collector have enabled us to start collecting and anlazying g-stream data, which is the XRootD monitoring stream that includes cache specific events. In the figure below we can observe an example of the g-stream data being collected form the caches in the OSDF.

GRACC g-stream
XRootD g-stream monitoring data

An example of the g-stream data collected from the caches in the OSDF.


Currently XCache is distributed by the OSG Software Stack (authored by the OSG-LHC area) both in the form of native packages for RedHat Enterprise Linux (RPMs) and container (Docker) images. Repositories of note include: