Publications

Max Aehle, Mihaly Novak, Vassil Vassilev, Nicolas R. Gauger, et. al., Optimization Using Pathwise Algorithmic Derivatives of Electromagnetic Shower Simulations (2024).

Among the well-known methods to approximate derivatives of expectancies computed by Monte-Carlo simulations, averages of pathwise derivatives are often the easiest one to apply. Computing them via algorithmic differentiation typically does not require major manual analysis and rewriting of the code, even for very complex programs like simulations of particle-detector interactions in high-energy physics. However, the pathwise derivative estimator can be biased if there are discontinuities in the program, which may diminish its value for applications.
This work integrates algorithmic differentiation into the electromagnetic shower simulation code HepEmShow based on G4HepEm, allowing us to study how well pathwise derivatives approximate derivatives of energy depositions in a sampling calorimeter with respect to parameters of the beam and geometry. We found that when multiple scattering is disabled in the simulation, means of pathwise derivatives converge quickly to their expected values, and these are close to the actual derivatives of the energy deposition. Additionally, we demonstrate the applicability of this novel gradient estimator for stochastic gradient-based optimization in a model example.

Kim Liegeois, Brian Kelley, Eric Phipps, Sivasankaran Rajamanickam and Vassil Vassilev, Performance Portable Gradient Computations Using Source Transformation (2024).

Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, resulting in widespread use of these technologies. The C++ language has become the de facto standard for scientific computing due to numerous factors, yet language complexity has made the adoption of AD technologies for C++ difficult, hampering the incorporation of powerful differentiable programming approaches into C++ scientific simulations. This is exacerbated by the increasing emergence of architectures such as GPUs, which have limited memory capabilities and require massive thread-level concurrency. Portable scientific codes rely on domain specific programming models such as Kokkos making AD for such codes even more complex.
In this paper, we will investigate source transformation-based automatic differentiation using Clad to automatically generate portable and efficient gradient computations of Kokkos-based code. We discuss the modifications of Clad required to differentiate Kokkos abstractions. We will illustrate the feasibility of our proposed strategy by comparing the wall-clock time of the generated gradient code with the wall-clock time of the input function on different cutting edge GPU architectures such as NVIDIA H100, AMD MI250x, and Intel Ponte Vecchio GPU. For these three architectures and for the considered example, evaluating up to 10,000 entries of the gradient only took up to 2.17 times the wall-clock time of evaluating the input function.

Garima Singh, Baidyanath Kundu, Harshitha Menon, Alexander Penev, et. al., Fast And Automatic Floating Point Error Analysis With CHEF-FP 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 608 1018-1028 (2023).

As we reach the limit of Moore’s Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to its stringent requirement of accuracy. We need tools and techniques to identify regions of the code that are amenable to approximations and their impact on the application output quality so as to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analysing approximation errors in HPC applications. CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source level by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis. In this paper, we primarily focus on analyzing errors introduced by mixed-precision AC techniques, the most popular approximate technique in HPC. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during analysis time as compared to the existing state-of-the-art tool as a result of its ability to generate and insert approximation error estimate code directly into the derivative source. The generated code also becomes a candidate for better compiler optimizations contributing to lesser runtime performance overhead.

Baidyanath Kundu, Vassil Vassilev, Wim Lavrijsen, Efficient and Accurate Automatic Python Bindings with cppyy & Cling (2023).

The simplicity of Python and the power of C++ force stark choices on a scientific software stack. There have been multiple developments to mitigate language boundaries by implementing language bindings, but the impedance mismatch between the static nature of C++ and the dynamic one of Python hinders their implementation; examples include the use of user-defined Python types with templated C++ and advanced memory management. The development of the C++ interpreter Cling has changed the way we can think of language bindings as it provides an incremental compilation infrastructure available at runtime. That is, Python can interrogate C++ on demand, and bindings can be lazily constructed at runtime. This automatic binding provision requires no direct support from library authors and offers better performance than alternative solutions, such as PyBind11. ROOT pioneered this approach with PyROOT, which was later enhanced with its successor, cppyy. However, until now, cppyy relied on the reflection layer of ROOT, which is limited in terms of provided features and performance. This paper presents the next step for language interoperability with cppyy, enabling research into uniform cross-language execution environments and boosting optimization opportunities across language boundaries. We illustrate the use of advanced C++ in Numba-accelerated Python through cppyy. We outline a path forward for re-engineering parts of cppyy to use upstream LLVM components to improve performance and sustainability. We demonstrate cppyy purely based on a C++ reflection library, InterOp, which offers interoperability primitives based on Cling and Clang-Repl.

Garima Singh, Jonas Rembser, Lorenzo Moneta, David Lange, et. al., Automatic Differentiation of Binned Likelihoods With Roofit and Clad (2023).

RooFit is a toolkit for statistical modeling and fitting used by most experiments in particle physics. Just as data sets from next-generation experiments grow, processing requirements for physics analysis become more computationally demanding, necessitating performance optimizations for RooFit. One possibility to speed-up minimization and add stability is the use of Automatic Differentiation (AD). Unlike for numerical differentiation, the computation cost scales linearly with the number of parameters, making AD particularly appealing for statistical models with many parameters. In this paper, we report on one possible way to implement AD in RooFit. Our approach is to add a facility to generate C++ code for a full RooFit model automatically. Unlike the original RooFit model, this generated code is free of virtual function calls and other RooFit-specific overhead. In particular, this code is then used to produce the gradient automatically with Clad. Clad is a source transformation AD tool implemented as a plugin to the clang compiler, which automatically generates the derivative code for input C++ functions. We show results demonstrating the improvements observed when applying this code generation strategy to HistFactory and other commonly used RooFit models.

Ioana Ifrim, Vassil Vassilev, David J Lange, GPU Accelerated Automatic Differentiation With Clad arXiv preprint arXiv:2203.06139 (2022).

Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to evaluate the derivative of a function specified through a computer program. The range of AD application domain spans from Machine Learning to Robotics to High Energy Physics. Computing gradients with the help of AD is guaranteed to be more precise than the numerical alternative and have a low, constant factor more arithmetical operations compared to the original function. Moreover, AD applications to domain problems typically are computationally bound. They are often limited by the computational requirements of high-dimensional parameters and thus can benefit from parallel implementations on graphics processing units (GPUs). Clad aims to enable differential analysis for C/C++ and CUDA and is a compiler-assisted AD tool available both as a compiler extension and in ROOT. Moreover, Clad works as a plugin extending the Clang compiler; as a plugin extending the interactive interpreter Cling; and as a Jupyter kernel extension based on xeus-cling. We demonstrate the advantages of parallel gradient computations on GPUs with Clad. We explain how to bring forth a new layer of optimization and a proportional speed up by extending Clad to support CUDA. The gradients of well-behaved C++ functions can be automatically executed on a GPU. The library can be easily integrated into existing frameworks or used interactively. Furthermore, we demonstrate the achieved application performance improvements, including (~10x) in ROOT histogram fitting and corresponding performance gains from offloading to GPUs.

Marco Foco, Max Rietmann, Vassil Vassilev, Michael Wong, et. al., P2072R0: Differentiable programming for C++ (2020).

Derivatives are vital a wide variety of computing applications, including numerical optimization, solution of nonlinear equations, sensitivity analysis, and nonlinear inverse problems. Virtually every process could be described with a mathematical function. A mathematical function can be thought of as an association between elements from different sets. Derivatives can track how a varying quantity depends on another quantity, for example how the position of a planet as the time varies. Derivatives and gradients allows us to explore the properties of a function and thus the described process as a whole. Also, gradients are an essential component in gradient-based optimization methods, that have become more and more important in recent years, in particular with its application training of (Deep) Neural Networks. Derivatives can be computed numerically, but unfortunately the finite differences methods are problematic due to the approximation operated and the finite precision of floating point values used, and so the implementation of the method faces precision and round off problems which can affect the overall precision of the computation. This problem becomes worse with higher order derivatives.

Javier López-Gómez, Javier Fernández, David del Rio Astorga, Vassil Vassilev, et. al., Relaxing the one definition rule in interpreted C++ (2020).

Most implementations of the C++ programming language generate binary executable code. However, interpreted execution of C++ sources has its own use cases as the Cling interpreter from CERN’s ROOT project has shown. Some limitations are derived from the ODR (One Definition Rule) that rules out multiple definitions of entities within a single translation unit (TU). ODR is there to ensure uniform view of a given C++ entity across translation units. Ensuring uniform view of C++ entities helps when producing ABI compatible binaries. Interpreting C++ presumes a single ever-growing translation unit that define away some of the ODR use-cases. Therefore, it may well be desirable to relax the ODR and, consequently, to support the ability of developers to override any existing definition for a given declaration. This approach is especially well-suited for iterative prototyping. In this paper, we extend Cling, a Clang/LLVM …

Martin Vassilev, Vassil Vassilev, Alexander Penev, IDD – A Platform Enabling Differential Debugging Cybernetics and Information Technologies 20 53-67 (2020).

Debugging is a very time consuming task which is not well supported by existing tools. The existing methods do not provide tools enabling optimal developers’ productivity when debugging regressions in complex systems. In this paper we describe a possible solution aiding differential debugging. The differential debugging technique performs analysis of the regressed system and identifying the cause of the unexpected behavior by comparing to a previous version of the same system. The prototype, idd, inspects two versions of the executable – a baseline and a regressed version. The interactive debugging session runs side by side both executables and allows to examine and to compare various internal states. The architecture can work with multiple information sources comparing data from different tools. We also show how idd can detect performance regressions using information from third-party performance facilities. We illustrate how in practice we can quickly discover regressions in large systems such as the clang compiler.

Yuka Takahashi, Oksana Shadura, Vassil Vassilev, Migrating large codebases to C++ Modules Journal of Physics: Conference Series 1525 012051 (2020).

ROOT has several features which interact with libraries and require implicit header inclusion. This can be triggered by reading or writing data on disk, or user actions at the prompt. Often, the headers are immutable, and reparsing is redundant. C++ Modules are designed to minimize the reparsing of the same header content by providing an efficient on-disk representation of C++ Code. ROOT has released a C++ Modules-aware technology preview, which intends to become the default for the ROOT 6.20 release.

Vassil Vassilev, David Lange, Malik Shahzad Muzzafar, Mircho Rodozov, et. al., C++ Modules in ROOT and Beyond arXiv preprint arXiv:2004.06507 (2020).

C++ Modules come in C++ 20 to fix the long-standing build scalability problems in the language. They provide an io-efficient, on-disk representation capable to reduce build times and peak memory usage. ROOT employs the C++ modules technology further in the ROOT dictionary system to improve its performance and reduce the memory footprint.ROOT with C++ Modules was released as a technology preview in fall 2018, after intensive development during the last few years. The current state is ready for production, however, there is still room for performance optimizations. In this talk, we show the roadmap for making the technology default in ROOT. We demonstrate a global module indexing optimization which allows reducing the memory footprint dramatically for many workflows. We will report user feedback on the migration to ROOT with C++ Modules.

Kim Albertsson Brann, Sitong An, Vassil Vassilev, Enric Tejedor Saavedra, et. al., arXiv: Software Challenges For HL-LHC Data Analysis (2020).

The high energy physics community is discussing where investment is needed to prepare software for the HL-LHC and its unprecedented challenges. The ROOT project is one of the central software players in high energy physics since decades. From its experience and expectations, the ROOT team has distilled a comprehensive set of areas that should see research and development in the context of data analysis software, for making best use of HL-LHC’s physics potential. This work shows what these areas could be, why the ROOT team believes investing in them is needed, which gains are expected, and where related work is ongoing. It can serve as an indication for future research proposals and cooperations.

Vassil Vassilev, Aleksandr Efremov, Oksana Shadura, Automatic Differentiation in ROOT arXiv preprint arXiv:2004.04435 (2020).

In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.), elementary functions (exp, log, sin, cos, etc.) and control flow statements. AD takes source code of a function as input and produces source code of the derived function. By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.This paper presents AD techniques available in ROOT, supported by Cling, to produce derivatives of arbitrary C/C++ functions through implementing source code transformation and employing the chain rule of differential calculus in both forward mode and reverse mode. We explain its current integration for gradient computation in TFormula. We demonstrate the correctness and performance improvements in ROOT’s fitting algorithms.

Oksana Shadura, Brian Paul Bockelman, Vassil Vassilev, Extending ROOT through Modules EPJ Web of Conferences 214 05011 (2019).

The ROOT software framework is foundational for the HEP ecosystem, providing multiple capabilities such as I/O, a C++ interpreter, GUI, and math libraries. It uses object-oriented concepts and build-time components to layer between them. We believe that a new layering formalism will benefit the ROOT user community. We present the modularization strategy for ROOT which aims to build upon the existing source components, making available the dependencies and other metadata outside of the build system, and allow post-install additions on top of existing installation as well as in the ROOT runtime environment. Components can be grouped into packages and made available from repositories in order to provide a post-install step of missing packages. This feature implements a mechanism for the more comprehensive software ecosystem and makes it available even from a minimal ROOT installation. As part of …

Oksana Shadura, Vassil Vassilev, Brian Paul Bockelman, Continuous Performance Benchmarking Framework for ROOT EPJ Web of Conferences 214 05003 (2019).

Foundational software libraries such as ROOT are under intense pressure to avoid software regression, including performance regressions. Continuous performance benchmarking, as a part of continuous integration and other code quality testing, is an industry best-practice to understand how the performance of a software product evolves. We present a framework, built from industry best practices and tools, to help to understand ROOT code performance and monitor the efficiency of the code for several processor architectures. It additionally allows historical performance measurements for ROOT I/O, vectorization and parallelization sub-systems.

Y Takahashi, V Vassilev, O Shadura, R Isemann, Optimizing Frameworks’ Performance Using C++ Modules Aware ROOT EPJ Web of Conferences 214 02011 (2019).

ROOT is a data analysis framework broadly used in and outside of High Energy Physics (HEP). Since HEP software frameworks always strive for performance improvements, ROOT was extended with experimental support of runtime C++ Modules. C++ Modules are designed to improve the performance of C++ code parsing. C++ Modules offers a promising way to improve ROOT’s runtime performance by saving the C++ header parsing time which happens during ROOT runtime. This paper presents the results and challenges of integrating C++ Modules into ROOT.

V Vassilev, Optimizing ROOT’s Performance Using C++ Modules Journal of Physics: Conference Series 898 1742-6596 (2016).

ROOT comes with a C++ compliant interpreter cling. Cling needs to understand the content of the libraries in order to interact with them. Exposing the full shared library descriptors to the interpreter at runtime translates into increased memory footprint. ROOT’s exploratory programming concepts allow implicit and explicit runtime shared library loading. It requires the interpreter to load the library descriptor. Re-parsing of descriptors’ content has a noticeable effect on the runtime performance. Present state-of-art lazy parsing technique brings the runtime performance to reasonable levels but proves to be fragile and can introduce correctness issues. An elegant solution is to load information from the descriptor lazily and in a non-recursive way.The LLVM community advances its C++ Modules technology providing an io-efficient, on-disk representation capable to reduce build times and peak memory usage. The feature …

V Vassilev, Native Language Integrated Queries with CppLINQ in C++ Journal of Physics: Conference Series 608 012030 (2015).

Programming language evolution brought to us the domain-specific languages (DSL). They proved to be very useful for expressing specific concepts, turning into a vital ingredient even for general-purpose frameworks. Supporting declarative DSLs (such as SQL) in imperative languages (such as C++) can happen in the manner of language integrated query (LINQ).

V Vassilev, M Vassilev, A Penev, L Moneta, et. al., Clad—automatic differentiation using Clang and LLVM Journal of Physics: Conference Series 608 012055 (2015).

Differentiation is ubiquitous in high energy physics, for instance in minimization algorithms and statistical analysis, in detector alignment and calibration, and in theory. Automatic differentiation (AD) avoids well-known limitations in round-offs and speed, which symbolic and numerical differentiation suffer from, by transforming the source code of functions. We will present how AD can be used to compute the gradient of multi-variate functions and functor objects. We will explain approaches to implement an AD tool. We will show how LLVM, Clang and Cling (ROOT’s C++ 11 interpreter) simplifies creation of such a tool. We describe how the tool could be integrated within any framework. We will demonstrate a simple proof-of-concept prototype, called Clad, which is able to generate n-th order derivatives of C++ functions and other language constructs. We also demonstrate how Clad can offload laborious computations …

Violeta Ilieva, Vassil Vassilev, clad - Automatic Differentiation with Clang (2013).

Differentiation is the process of finding a derivative, which measures how a function changes as its input changes. Derivatives can be evaluated with machine precision accuracy through a method called automatic differentiation. Unlike other methods for differentiation, including numerical and symbolic, automatic differentiation yields exact derivatives even of complicated functions at relatively low processing and storage costs.

V Vassilev, Ph Canal, A Naumann, P Russo, Cling–the new interactive interpreter for ROOT 6 (2012).

Cling is an interactive C++ interpreter, built on top of Clang and LLVM compiler infrastructure. Like its predecessor Cint, Cling realizes the read-print-evaluate-loop concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses their strengths such as the praised concise and expressive compiler diagnostics. We show how to match the interpreter concept to the compiler library and generalize common set of requirements for building up an interactive interpreter. We reason the design and implementation decisions as solution to the challenge of implementing interpreter behaviour as an extension of the compiler library. We present the new features, eg how C++11 will come to Cling and how CINT-specific extensions are being adopted. We clarify the state of integration in the ROOT framework and the induced change set. We explain how ROOT …

Related Publications