Caching Analysis Data

Significant portions of LHC analysis use the same datasets, running over each dataset several times. Hence, we can utilize cache-based approaches as an opportunity to efficiency of CPU use (via reduced latency) and network (reduce WAN traffic). We are investigating the use of regional caches to store, on-demand, certain datasets.

In Southern California the UCSD CMS Tier-2 and Caltech CMS Tier-2 joined forces to create and mantain a regional cache, commonly referred as the “CMS SoCal cache”, that benefits all southern California CMS researchers.

Later on ESnet approached the SoCal CMS group to integrate a caching server into the SoCal Cache. The server is deployed on the ESnet PoP at Sunnyvale, CA. but it is managed by UCSD via the PRP kubernetes cluster.

A recent study, led by ESnet, on the network savings of the SoCal cache, was carried out by analyzing the XRootD monitoring records from the XCache servers. The results showed a factor 3 reduction of network bandwidth over the analyzed period.

Network utilization savings
Network utilization savings

Network utilization reduction ratio in terms of (a)number of accesses and (b) volume transferred.

The aforementioned study also demonstrated how the accesses to the cache are evenly distributed among the different servers that conform the SoCal cache.

SoCal hits and misses
Misses(a) and Hits(b) distribution in SoCal cache

The above shows the distribution of hits and misses among the servers that conform the SoCal cache.

We also engaged with CMS to have a monitoring page that shows the popularity of the analyzed data, this helps us to consider changes in the namespace definition for what we cache.

CMS data popularity
CMS data popularity

The above shows the distribution of acesses in terms of volume of the CMS analysis tasks by data campaing.


Currently XCache is distributed by the OSG both in the form of RPM and docker images. The following are the corresponding repositories where the base code can be found: