Among the well-known methods to approximate derivatives of expectancies
computed by Monte-Carlo simulations, averages of pathwise derivatives are
often the easiest one to apply. Computing them via algorithmic
differentiation typically does not require major manual analysis and
rewriting of the code, even for very complex programs like simulations of
particle-detector interactions in high-energy physics. However, the pathwise
derivative estimator can be biased if there are discontinuities in the
program, which may diminish its value for applications.
This work integrates algorithmic differentiation into the electromagnetic
shower simulation code HepEmShow based on G4HepEm, allowing us to study how
well pathwise derivatives approximate derivatives of energy depositions in a
sampling calorimeter with respect to parameters of the beam and geometry. We
found that when multiple scattering is disabled in the simulation, means of
pathwise derivatives converge quickly to their expected values, and these
are close to the actual derivatives of the energy deposition. Additionally,
we demonstrate the applicability of this novel gradient estimator for
stochastic gradient-based optimization in a model example.
Derivative computation is a key component of optimization, sensitivity
analysis, uncertainty quantification, and nonlinear solvers. Automatic
differentiation (AD) is a powerful technique for evaluating such
derivatives, and in recent years, has been integrated into programming
environments such as Jax, PyTorch, and TensorFlow to support derivative
computations needed for training of machine learning models, resulting in
widespread use of these technologies. The C++ language has become the de
facto standard for scientific computing due to numerous factors, yet
language complexity has made the adoption of AD technologies for C++
difficult, hampering the incorporation of powerful differentiable
programming approaches into C++ scientific simulations. This is exacerbated
by the increasing emergence of architectures such as GPUs, which have
limited memory capabilities and require massive thread-level
concurrency. Portable scientific codes rely on domain specific programming
models such as Kokkos making AD for such codes even more complex.
In this paper, we will investigate source transformation-based automatic
differentiation using Clad to automatically generate portable and efficient
gradient computations of Kokkos-based code. We discuss the modifications of
Clad required to differentiate Kokkos abstractions. We will illustrate the
feasibility of our proposed strategy by comparing the wall-clock time of the
generated gradient code with the wall-clock time of the input function on
different cutting edge GPU architectures such as NVIDIA H100, AMD MI250x,
and Intel Ponte Vecchio GPU. For these three architectures and for the
considered example, evaluating up to 10,000 entries of the gradient only
took up to 2.17 times the wall-clock time of evaluating the input function.
As we reach the limit of Moore’s Law, researchers are exploring different
paradigms to achieve unprecedented performance. Approximate Computing (AC),
which relies on the ability of applications to tolerate some error in the
results to trade-off accuracy for performance, has shown significant promise.
Despite the success of AC in domains such as Machine Learning, its
acceptance in High-Performance Computing (HPC) is limited due to its
stringent requirement of accuracy. We need tools and techniques to identify
regions of the code that are amenable to approximations and their impact on
the application output quality so as to guide developers to employ selective
approximation. To this end, we propose CHEF-FP, a flexible, scalable, and
easy-to-use source-code transformation tool based on Automatic
Differentiation (AD) for analysing approximation errors in HPC applications.
CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang
compiler and based on the LLVM compiler infrastructure, as a backend
and utilizes its AD abilities to evaluate approximation errors in C++
code. CHEF-FP works at the source level by injecting error estimation
code into the generated adjoints. This enables the error-estimation code
to undergo compiler optimizations resulting in improved analysis time and
reduced memory usage. We also provide theoretical and architectural
augmentations to source code transformation-based AD tools to perform FP
error analysis. In this paper, we primarily focus on analyzing errors
introduced by mixed-precision AC techniques, the most popular approximate
technique in HPC. We also show the applicability of our tool in estimating
other kinds of errors by evaluating our tool on codes that use approximate
functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during
analysis time as compared to the existing state-of-the-art tool as a result
of its ability to generate and insert approximation error estimate code
directly into the derivative source. The generated code also becomes a
candidate for better compiler optimizations contributing to lesser
runtime performance overhead.
The simplicity of Python and the power of C++ force stark choices on a scientific software
stack. There have been multiple developments to mitigate language boundaries by implementing
language bindings, but the impedance mismatch between the static nature of C++ and the
dynamic one of Python hinders their implementation; examples include the use of user-defined
Python types with templated C++ and advanced memory management.
The development of the C++ interpreter Cling has changed the way we can think of language
bindings as it provides an incremental compilation infrastructure available at runtime.
That is, Python can interrogate C++ on demand, and bindings can be lazily constructed at
runtime. This automatic binding provision requires no direct support from library authors
and offers better performance than alternative solutions, such as PyBind11. ROOT pioneered
this approach with PyROOT, which was later enhanced with its successor, cppyy. However,
until now, cppyy relied on the reflection layer of ROOT, which is limited in terms of
provided features and performance.
This paper presents the next step for language interoperability with cppyy, enabling
research into uniform cross-language execution environments and boosting optimization
opportunities across language boundaries. We illustrate the use of advanced C++ in
Numba-accelerated Python through cppyy. We outline a path forward for re-engineering
parts of cppyy to use upstream LLVM components to improve performance and sustainability.
We demonstrate cppyy purely based on a C++ reflection library, InterOp, which offers
interoperability primitives based on Cling and Clang-Repl.
RooFit is a toolkit for statistical modeling and fitting used by most experiments
in particle physics. Just as data sets from next-generation experiments grow,
processing requirements for physics analysis become more computationally demanding,
necessitating performance optimizations for RooFit. One possibility to speed-up
minimization and add stability is the use of Automatic Differentiation (AD). Unlike
for numerical differentiation, the computation cost scales linearly with the number
of parameters, making AD particularly appealing for statistical models with many
parameters. In this paper, we report on one possible way to implement AD in RooFit.
Our approach is to add a facility to generate C++ code for a full RooFit model
automatically. Unlike the original RooFit model, this generated code is free of virtual
function calls and other RooFit-specific overhead. In particular, this code is then
used to produce the gradient automatically with Clad. Clad is a source transformation
AD tool implemented as a plugin to the clang compiler, which automatically generates
the derivative code for input C++ functions. We show results demonstrating the
improvements observed when applying this code generation strategy to HistFactory and
other commonly used RooFit models.
Automatic Differentiation (AD) is instrumental for science and industry.
It is a tool to evaluate the derivative of a function specified through a
computer program. The range of AD application domain spans from Machine
Learning to Robotics to High Energy Physics. Computing gradients with
the help of AD is guaranteed to be more precise than the numerical
alternative and have a low, constant factor more arithmetical operations
compared to the original function. Moreover, AD applications to domain
problems typically are computationally bound. They are often limited by
the computational requirements of high-dimensional parameters and thus can
benefit from parallel implementations on graphics processing units (GPUs).
Clad aims to enable differential analysis for C/C++ and CUDA and is a
compiler-assisted AD tool available both as a compiler extension and in ROOT.
Moreover, Clad works as a plugin extending the Clang compiler; as a plugin
extending the interactive interpreter Cling; and as a Jupyter kernel
extension based on xeus-cling. We demonstrate the advantages of parallel
gradient computations on GPUs with Clad. We explain how to bring forth a new
layer of optimization and a proportional speed up by extending Clad to
support CUDA. The gradients of well-behaved C++ functions can be
automatically executed on a GPU. The library can be easily integrated into
existing frameworks or used interactively. Furthermore, we demonstrate the
achieved application performance improvements, including (~10x) in ROOT
histogram fitting and corresponding performance gains from offloading to GPUs.
Derivatives are vital a wide variety of computing applications, including
numerical optimization, solution of nonlinear equations, sensitivity
analysis, and nonlinear inverse problems. Virtually every process could be
described with a mathematical function. A mathematical function can be
thought of as an association between elements from different sets.
Derivatives can track how a varying quantity depends on another quantity,
for example how the position of a planet as the time varies. Derivatives and
gradients allows us to explore the properties of a function and thus the
described process as a whole. Also, gradients are an essential component in
gradient-based optimization methods, that have become more and more important
in recent years, in particular with its application training of (Deep) Neural
Networks. Derivatives can be computed numerically, but unfortunately the
finite differences methods are problematic due to the approximation operated
and the finite precision of floating point values used, and so the
implementation of the method faces precision and round off problems which
can affect the overall precision of the computation. This problem becomes
worse with higher order derivatives.
Most implementations of the C++ programming language generate binary
executable code. However, interpreted execution of C++ sources has its own
use cases as the Cling interpreter from CERN’s ROOT project has shown. Some
limitations are derived from the ODR (One Definition Rule) that rules out
multiple definitions of entities within a single translation unit (TU). ODR
is there to ensure uniform view of a given C++ entity across translation
units. Ensuring uniform view of C++ entities helps when producing ABI
compatible binaries. Interpreting C++ presumes a single ever-growing
translation unit that define away some of the ODR use-cases. Therefore, it
may well be desirable to relax the ODR and, consequently, to support the
ability of developers to override any existing definition for a given
declaration. This approach is especially well-suited for iterative
prototyping. In this paper, we extend Cling, a Clang/LLVM …
Debugging is a very time consuming task which is not well supported by
existing tools. The existing methods do not provide tools enabling optimal
developers’ productivity when debugging regressions in complex systems. In
this paper we describe a possible solution aiding differential debugging.
The differential debugging technique performs analysis of the regressed
system and identifying the cause of the unexpected behavior by comparing to
a previous version of the same system. The prototype, idd, inspects two
versions of the executable – a baseline and a regressed version. The
interactive debugging session runs side by side both executables and allows
to examine and to compare various internal states. The architecture can work
with multiple information sources comparing data from different tools. We
also show how idd can detect performance regressions using information from
third-party performance facilities. We illustrate how in practice we can
quickly discover regressions in large systems such as the clang compiler.
ROOT has several features which interact with libraries and require implicit
header inclusion. This can be triggered by reading or writing data on disk,
or user actions at the prompt. Often, the headers are immutable, and
reparsing is redundant. C++ Modules are designed to minimize the reparsing
of the same header content by providing an efficient on-disk representation
of C++ Code. ROOT has released a C++ Modules-aware technology preview, which
intends to become the default for the ROOT 6.20 release.
Vassil Vassilev, David Lange, Malik Shahzad Muzzafar, Mircho Rodozov, et. al., C++ Modules in ROOT and Beyond arXiv preprint arXiv:2004.06507 (2020).
C++ Modules come in C++ 20 to fix the long-standing build scalability
problems in the language. They provide an io-efficient, on-disk
representation capable to reduce build times and peak memory usage. ROOT
employs the C++ modules technology further in the ROOT dictionary system to
improve its performance and reduce the memory footprint.ROOT with C++ Modules
was released as a technology preview in fall 2018, after intensive
development during the last few years. The current state is ready for
production, however, there is still room for performance optimizations. In
this talk, we show the roadmap for making the technology default in ROOT.
We demonstrate a global module indexing optimization which allows reducing
the memory footprint dramatically for many workflows. We will report user
feedback on the migration to ROOT with C++ Modules.
The high energy physics community is discussing where investment is needed
to prepare software for the HL-LHC and its unprecedented challenges. The ROOT
project is one of the central software players in high energy physics since
decades. From its experience and expectations, the ROOT team has distilled a
comprehensive set of areas that should see research and development in the
context of data analysis software, for making best use of HL-LHC’s physics
potential. This work shows what these areas could be, why the ROOT team
believes investing in them is needed, which gains are expected, and where
related work is ongoing. It can serve as an indication for future research
proposals and cooperations.
In mathematics and computer algebra, automatic differentiation (AD) is
a set of techniques to evaluate the derivative of a function specified by a
computer program. AD exploits the fact that every computer program, no matter
how complicated, executes a sequence of elementary arithmetic operations
(addition, subtraction, multiplication, division, etc.), elementary functions
(exp, log, sin, cos, etc.) and control flow statements. AD takes source code
of a function as input and produces source code of the derived function. By
applying the chain rule repeatedly to these operations, derivatives of
arbitrary order can be computed automatically, accurately to working
precision, and using at most a small constant factor more arithmetic
operations than the original program.This paper presents AD techniques
available in ROOT, supported by Cling, to produce derivatives of arbitrary
C/C++ functions through implementing source code transformation and
employing the chain rule of differential calculus in both forward mode and
reverse mode. We explain its current integration for gradient computation in
TFormula. We demonstrate the correctness and performance improvements in
ROOT’s fitting algorithms.
Oksana Shadura, Brian Paul Bockelman, Vassil Vassilev, Extending ROOT through Modules EPJ Web of Conferences 214 05011 (2019).
The ROOT software framework is foundational for the HEP ecosystem, providing
multiple capabilities such as I/O, a C++ interpreter, GUI, and math
libraries. It uses object-oriented concepts and build-time components to
layer between them. We believe that a new layering formalism will benefit
the ROOT user community. We present the modularization strategy for ROOT
which aims to build upon the existing source components, making available
the dependencies and other metadata outside of the build system, and allow
post-install additions on top of existing installation as well as in the ROOT
runtime environment. Components can be grouped into packages and made
available from repositories in order to provide a post-install step of
missing packages. This feature implements a mechanism for the more
comprehensive software ecosystem and makes it available even from a minimal
ROOT installation. As part of …
Foundational software libraries such as ROOT are under intense pressure to
avoid software regression, including performance regressions. Continuous
performance benchmarking, as a part of continuous integration and other code
quality testing, is an industry best-practice to understand how the performance
of a software product evolves. We present a framework, built from industry best
practices and tools, to help to understand ROOT code performance and monitor
the efficiency of the code for several processor architectures. It additionally
allows historical performance measurements for ROOT I/O, vectorization and
parallelization sub-systems.
ROOT is a data analysis framework broadly used in and outside of High Energy
Physics (HEP). Since HEP software frameworks always strive for performance
improvements, ROOT was extended with experimental support of runtime C++
Modules. C++ Modules are designed to improve the performance of C++ code
parsing. C++ Modules offers a promising way to improve ROOT’s runtime
performance by saving the C++ header parsing time which happens during ROOT
runtime. This paper presents the results and challenges of integrating C++
Modules into ROOT.
ROOT comes with a C++ compliant interpreter cling. Cling needs to understand
the content of the libraries in order to interact with them. Exposing the
full shared library descriptors to the interpreter at runtime translates
into increased memory footprint. ROOT’s exploratory programming concepts
allow implicit and explicit runtime shared library loading. It requires the
interpreter to load the library descriptor. Re-parsing of descriptors’
content has a noticeable effect on the runtime performance. Present
state-of-art lazy parsing technique brings the runtime performance to
reasonable levels but proves to be fragile and can introduce correctness
issues. An elegant solution is to load information from the descriptor lazily
and in a non-recursive way.The LLVM community advances its C++ Modules
technology providing an io-efficient, on-disk representation capable to
reduce build times and peak memory usage. The feature …
Programming language evolution brought to us the domain-specific languages
(DSL). They proved to be very useful for expressing specific concepts,
turning into a vital ingredient even for general-purpose frameworks.
Supporting declarative DSLs (such as SQL) in imperative languages (such as
C++) can happen in the manner of language integrated query (LINQ).
Differentiation is ubiquitous in high energy physics, for instance in
minimization algorithms and statistical analysis, in detector alignment and
calibration, and in theory. Automatic differentiation (AD) avoids well-known
limitations in round-offs and speed, which symbolic and numerical
differentiation suffer from, by transforming the source code of functions. We
will present how AD can be used to compute the gradient of multi-variate
functions and functor objects. We will explain approaches to implement an AD
tool. We will show how LLVM, Clang and Cling (ROOT’s C++ 11 interpreter)
simplifies creation of such a tool. We describe how the tool could be
integrated within any framework. We will demonstrate a simple proof-of-concept
prototype, called Clad, which is able to generate n-th order derivatives of
C++ functions and other language constructs. We also demonstrate how Clad can
offload laborious computations …
Differentiation is the process of finding a derivative, which measures
how a function changes as its input changes. Derivatives can be evaluated
with machine precision accuracy through a method called automatic
differentiation. Unlike other methods for differentiation, including
numerical and symbolic, automatic differentiation yields exact derivatives
even of complicated functions at relatively low processing and storage costs.
Cling is an interactive C++ interpreter, built on top of Clang and LLVM
compiler infrastructure. Like its predecessor Cint, Cling realizes the
read-print-evaluate-loop concept, in order to leverage rapid application
development. Implemented as a small extension to LLVM and Clang, the
interpreter reuses their strengths such as the praised concise and expressive
compiler diagnostics. We show how to match the interpreter concept to the
compiler library and generalize common set of requirements for building up
an interactive interpreter. We reason the design and implementation
decisions as solution to the challenge of implementing interpreter behaviour
as an extension of the compiler library. We present the new features, eg
how C++11 will come to Cling and how CINT-specific extensions are being
adopted. We clarify the state of integration in the ROOT framework and the
induced change set. We explain how ROOT …