As we reach the limit of Moore’s Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to its stringent requirement of accuracy. We need tools and techniques to identify regions of the code that are amenable to approximations and their impact on the application output quality so as to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analysing approximation errors in HPC applications.
CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source level by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis.
In this talk, we primarily focus on analyzing errors introduced by mixed-precision AC techniques, the most popular approximate technique in HPC. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during analysis time as compared to the existing state-of-the-art tool as a result of its ability to generate and insert approximation error estimate code directly into the derivative source. The generated code also becomes a candidate for better compiler optimizations contributing to lesser runtime performance overhead.
With the growing datasets of current and next-generation High-Energy and Nuclear Physics (HEP/NP) experiments, statistical analysis has become more computationally demanding. These increasing demands elicit improvements and modernizations in existing statistical analysis software. One way to address these issues is to improve parameter estimation performance and numeric stability using automatic differentiation (AD). AD’s computational efficiency and accuracy is superior to the preexisting numerical differentiation techniques and offers significant performance gains when calculating the derivatives of functions with a large number of inputs, making it particularly appealing for statistical models with many parameters. For such models, many HEP/NP experiments use RooFit, a toolkit for statistical modeling and fitting that is part of ROOT.
In this talk, we report on the effort to support the AD of RooFit likelihood functions. Our approach is to extend RooFit with a tool that generates overhead-free C++ code for a full likelihood function built from RooFit functional models. Gradients are then generated using Clad, a compiler-based source-code-transformation AD tool, using this C++ code. We present our results from applying AD to the entire minimization pipeline and profile likelihood calculations of several RooFit and HistFactory models at the LHC-experiment scale. We show significant reductions in calculation time and memory usage for the minimization of such likelihood functions. We also elaborate on this approach’s current limitations and explain our plans for the future.
Over the last decade the C++ programming language has evolved significantly into safer, easier to learn and better supported by tools general purpose programming language capable of extracting the last bit of performance from bare metal. The emergence of technologies such as LLVM and Clang have advanced tooling support for C++ and its ecosystem grew qualitatively. C++ has an important role in the field of scientific computing as the language design principles promote efficiency, reliability and backward compatibility - a vital tripod for any long-lived codebase. Other ecosystems such as Python have prioritized better usability and safety while making some tradeoffs on efficiency and backward compatibility. That has led developers to believe that there is a binary choice between performance and usability.
In this talk we would like to present the advancements in the C++ ecosystem; its relevance for scientific computing and beyond; and foreseen challenges. The talk introduces three major components for data science - interpreted C++; automatic language bindings; and differentiable programming. We outline how these components help Python and C++ ecosystems interoperate making a little compromise on either performance or usability. We elaborate on a future hybrid Python/C++ differentiable programming analysis framework which might accelerate science discovery in HEP by amplifying the power and physics sensitivity of data analyses into end-to-end differentiable pipelines.
RooFit is a toolkit for statistical modeling and fitting used by most experiments in particle physics. Just as data sets from next-generation experiments grow, processing requirements for physics analysis become more computationally demanding, necessitating performance optimizations for RooFit. One possibility to speed-up minimization and add stability is the use of automatic differentiation (AD). Unlike for numerical differentiation, the computation cost scales linearly with the number of parameters, making AD particularly appealing for statistical models with many parameters. In this talk, we report on one possible way to implement AD in RooFit. Our approach is to add a facility to generate C++ code for a full RooFit model automatically. Unlike the original RooFit model, this generated code is free of virtual function calls and other RooFit-specific overhead. In particular, this code is then used to produce the gradient automatically with Clad. Clad is a source transformation AD tool implemented as a plugin to the clang compiler, which automatically generates the derivative code for input C++ functions. We show results demonstrating the improvements observed when applying this code generation strategy to HistFactory and other commonly used RooFit models.
HistFactory is the subcomponent of RooFit that implements binned likelihood models with probability densities based on histogram templates. These models frequently have a very large number of free parameters, and are thus an interesting first target for AD support in RooFit.
The simplicity of Python and the power of C++ provide a hard choice for a scientific software stack. There have been multiple developments to mitigate the hard language boundaries by implementing language bindings. The static nature of C++ and the dynamic nature of Python are problematic for bindings provided by library authors and in particular features such as template instantiations with user-defined types or more advanced memory management.
The development of the C++ interpreter Cling has changed the way we can think of language bindings as it provides an incremental compilation infrastructure available at runtime. That is, Python can interrogate C++ on demand and fetch only the necessary information. This way of automatic binding provision requires no binding support by the library authors and offers better performance than Pybind11. This approach pioneered in ROOT with PyROOT and later was enhanced with its successor Cppyy. However, until now, Cppyy relied on the reflection layer of ROOT which is limited in terms of provided features and performance.
In this talk we show how basing Cppyy purely on Cling yields better correctness, performance and installation simplicity. We illustrate more advanced language interoperability of Numba-accelerated Python code capable of calling C++ functionality via Cppyy. We outline a path forward for integrating the reflection layer in LLVM upstream which will contribute to the project sustainability and will foster greater user adoption. We demonstrate usage of Cppyy through Cling’s LLVM mainline version Clang-Repl
The scientific community using Python has developed several ways to accelerate Python codes. One popular technology is Numba, a Just-in-time (JIT) compiler that translates a subset of Python and NumPy code into fast machine code using LLVM. We have extended Numba’s integration with LLVM’s intermediate representation (IR) to enable the use of C++ kernels and connect them to Numba accelerated codes. Such a multilanguage setup is also commonly used to achieve performance or to interface with external bare-metal libraries. In addition, Numba users will be able to write the performance-critical codes in C++ and use them easily at native speed.
This work relies on high-performance, dynamic, bindings between Python and C++. Cppyy, which is the basis of PyROOT’s interfaces to C++ libraries. Cppyy uses Cling, an incremental C++ interpreter, to generate on-demand bindings of required entities and connect them with the Python interpreter. This environment is uniquely positioned to enable the use of C++ from Numba in a fast and automatic way.
In this talk, we demonstrate using C++ from Numba through Cppyy. We show our approach which extends Cppyy to match the object typing and lowering models of Numba and the necessary additions to the reflection layers to generate IR from Python objects. The uniform LLVM runtime allows optimizations such as inlining which can in the future remove the C++ function call overhead. We discuss other optimizations such as lazily instantiated C++ templates based on input data. The talk also briefly outlines the non-negligible, Numba-introduced JIT overhead and possible ways to optimize it. Since this is built as a Cppyy extension Numba supports all bindings automatically without any user intervention.
Automatic Differentiation is a powerful technique to evaluate the derivative of a function specified by a computer program. Thanks to the ROOT interpreter, Cling, this technique is available in ROOT for computing gradients and Hessian matrices of multi-dimensional functions. We will present the current integration of this tool in the ROOT Mathematical libraries for computing gradients of functions that can then be used in numerical algorithms. For example, we demonstrate the correctness and performance improvements in ROOT’s fitting algorithms. We will show also how gradient and Hessian computation via AD is integrated in the main ROOT minimization algorithm Minuit. We will show also the present plans to integrate the Automatic Differentiation in the RooFit modelling package for obtaining gradients of the full model that can be used for fitting and other statistical studies.
Floating-point errors are a testament to the finite nature of computing and if left uncontrolled they can have catastrophic results. As such, for high-precision computing applications, quantifying these uncertainties becomes imperative. There have been significant efforts to mitigate such errors by either extending the underlying floating-point precision, using alternate compensation algorithms or estimating them using a variety of statistical and non-statistical methods. A prominent method of dynamic floating-point error estimation is using Automatic Differentiation (AD). However, most state-of-the-art AD-based estimation software requires manually adapting or annotating the source code by some amount. Moreover, operator overloading AD based error estimation tools call for multiple gradient recomputations to report errors over a large variety of inputs and suffer from all the shortcomings of the underlying operator overloading strategy such as reduced efficiency. In this work, we propose a customizable way to use AD to synthesize source code for estimating uncertainties arising from floating-point arithmetic in C/C++ applications.
Our work presents an automatic error annotation framework that can be used in conjunction with custom user defined error models. We also present our progress with error estimation on GPU applications.
Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to evaluate the derivative of a function specified through a computer program. The range of AD application domain spans from Machine Learning to Robotics to High Energy Physics. Computing gradients with the help of AD is guaranteed to be more precise than the numerical alternative and have at most a constant factor (4) more arithmetical operations compared to the original function. Moreover, AD applications to domain problems typically are computationally bound. They are often limited by the computational requirements of high-dimensional transformation parameters and thus can greatly benefit from parallel implementations on graphics processing units (GPUs).
Clad aims to enable differentiable analysis for C/C++ and CUDA and is a compiler-assisted AD tool available both as a compiler extension and in ROOT. Moreover, Clad works as a compiler plugin extending the Clang compiler; as a plugin extending the interactive interpreter Cling; and as a Jupyter kernel extension based on xeus-cling.
In this talk, we demonstrate the advantages of parallel gradient computations on graphics processing units (GPUs) with Clad. We explain how to bring forth a new layer of optimisation and a proportional speed up by extending the usage of Clad for CUDA. The gradients of well-behaved C++ functions can be automatically executed on a GPU. Thus, across the spectrum of fields, researchers can reuse their existing models and have workloads scheduled on parallel processors without the need to optimize their computational kernels. The library can be easily integrated into existing frameworks or used interactively, and provides optimal performance. Furthermore, we demonstrate the achieved application performance improvements, including (~10x) in ROOT histogram fitting and corresponding performance gains from offloading to GPUs.
The design of LLVM and Clang enables them to be used as libraries, and has led to the creation of an entire compiler-assisted ecosystem of tools. The relatively friendly codebase of Clang and advancements in the JIT infrastructure in LLVM further enable research into different methods for processing C++ by blurring the boundary between compile time and runtime. Challenges include incremental compilation and fitting compile/link time optimizations into a more dynamic environment.
Incremental compilation pipelines process code chunk-by-chunk by building an ever-growing translation unit. Code is then lowered into the LLVM IR and subsequently run by the LLVM JIT. Such a pipeline allows creation of efficient interpreters. The interpreter enables interactive exploration and makes the C++ language more user friendly. The incremental compilation mode is used by the interactive C++ interpreter, Cling, initially developed to enable interactive high-energy physics analysis in a C++ environment. Cling is now used for interactive development in Jupyter Notebooks (via xeus-cling), dynamic python bindings (via cppyy) and interactive CUDA development.
In this talk we dive into the details of implementing incremental
compilation with Clang. We outline a path forward for
Clang-Repl which is
built with the experience gained in Cling and is now available in mainstream
llvm. We describe how the new Orc JIT infrastructure allows us to naturally
perform static optimizations at runtime, and enables linker voodoo to make
the compiler/interpreter boundaries disappear. We explain the potential of
having a compiler-as-a-service architecture in the context of automatic
language interoperability for Python and beyond.
Clad enables automatic differentiation (AD) for C++ algorithms through source-to-source transformation. It is based on LLVM compiler infrastructure and as a Clang compiler plugin. Different from other tools, Clad manipulates the high-level code representation (the AST) rather than implementing its own C++ parser and does not require modifications to existing code bases. This methodology is both easier to adopt and potentially more performant than other approaches. Having full access to the Clang compiler’s internals means that Clad is able to follow the high-level semantics of algorithms and can perform domain-specific optimisations; automatically generate code (re-targeting C++) on accelerator hardware with appropriate scheduling; and has a direct connection to compiler diagnostics engine and thus producing precise and expressive diagnostics positioned at desired source locations.
In this talk, we showcase the above mentioned advantages through examples and outline Clad’s features, applications and support extensions. We describe the challenges coming from supporting automatic differentiation of broader C++ and present how Clad can compute derivatives of functions, member functions, functors and lambda expressions. We show the newly added support of array differentiation which provides the basis utility for CUDA support and parallelisation of gradient computation. Moreover, we will demo different interactive use-cases of Clad, either within a Jupyter environment as a kernel extension based on xeus-cling or within a gpu-cpu environment where the gradient computation can be accelerated through gpu-code produced by Clad and run with the help of the Cling interpreter.
C++ is used for many numerically intensive applications. A combination of performance and solid backward compatibility has led to its use for many research software codes over the past 20 years. Despite its power, C++ is often seen as difficult to learn and not well suited with rapid application development. The long edit-compile-run cycle is a large impediment to exploration and prototyping during development.
Cling has emerged as a recognized capability that enables interactivity, dynamic interoperability and rapid prototyping capabilities for C++ developers. Cling is an interactive C++ interpreter, built on top of the Clang and LLVM compiler infrastructure. The interpreter enables interactive exploration and makes the C++ language more welcoming for research. Cling supports the full C++ feature set including the use of templates, lambdas, and virtual inheritance.Cling’s field of origin is the field of high energy physics where it facilitates the processing of scientific data. The interpreter was an essential part of the software tools of the LHC experimental program and was part of the software used to detect the gravitational waves of the LIGO experiment. Interactive C++ has proven to be useful for other communities. The Cling ecosystem includes dynamic bindings tools to languages including python, D and Julia (cppyy); C++ in Jupyter Notebooks (xeus-cling); interactive CUDA; and automatic differentiation on the fly (clad).
This talk outlines key properties of interactive C++ such as execution results, entity redefinition, error recovery and code undo. It demonstrates the capability enabled by an interactive C++ platform in the context of data science. For example, the use of eval-style programming, C++ in Jupyter notebooks and CUDA C++. We talk about design and implementation challenges and go beyond just interpreting C++. We showcase template instantiation on demand, language interoperability on demand and bridging compiled and interpreted code. We show how to easily build new capabilities using the Cling infrastructure through developing an automatic differentiation plugin for C++ and CUDA.
Mathematical derivatives are vital components of many computing algorithms including: neural networks, numerical optimization, Bayesian inference, nonlinear equation solvers, physics simulations, sensitivity analysis, and nonlinear inverse problems. Derivatives track the rate of change of an output parameter with respect to an input parameter, such as how much reducing an individuals’ carbon footprint will impact the Earth’s temperature. Derivatives (and generalizations such as gradients, jacobians, hessians, etc) allow us to explore the properties of a function and better describe the underlying process as a whole. In recent years, the use of gradient-based optimizations such as training neural networks have become widespread, leading to many languages making differentiation a first-class citizen.
Derivatives can be computed numerically, but unfortunately the accumulation of floating-point errors and high-computational complexity presents several challenges. These problems become worse with higher order derivatives and more parameters to differentiate.
Many derivative-based algorithms require gradients, or the computation of the derivative of an output parameter with respect to many input parameters. As such, developing techniques for computing gradients that are scalable in the number of input parameters is crucial for the performance of such algorithms. This paper describes a broad set of domains where scalable derivative computations are essential. We make an overview of the major techniques in computing derivatives, and finally, we introduce the flagman of computational differential calculus – algorithmic (also known as automatic) differentiation (AD). AD makes clever use of the ‘nice’ mathematical properties of the chained rule and generative programming to solve the scalability issues by inverting the dependence on the number of input variables to the number of output variables. AD provides mechanisms to augment the regular function computation with instructions calculating its derivatives.
Differentiable programming is a programming paradigm in which the programs can be differentiated throughout, usually via automatic differentiation. This talk introduces the differentiable programming paradigm in terms of C++. It shows its applications in science as applicable for data science and industry. The authors make an overview of the existing tools in the area and the two common implementation approaches – template metaprogramming and custom parsers. We demonstrate implementation pros and cons and propose compiler toolchain-based implementation either in clang AST or LLVM IR. We would like to briefly outline our current efforts in standardization of that feature.
RooFit-AD Plan of Work
G Singh at the Weekly Compiler Research Meetings (1 February 2023) Link to slides
A numba extension for cppyy / PyROOT
B Kundu at the Parallelism, Performance and Programming model meeting (1 September 2022) Slides, Notebook
Automatic Differentiation in C++
V Vassilev and M Foco (NVIDIA) at the Prague 2020 ISO C++ WG21 meeting (10 February 2020) Link to slides
Automatic Differentiation in ROOT
O Shadura at the Computing in High Energy Physics (CHEP 2019, Adeliade) (5 November 2019) Link to slides
Migrating large codebases to C++ Modules
Y Takahashi at the 19th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2019 Saas-Fee) (13 March 2019) Link to slides
Optimizing Frameworks Performance Using C++ Modules-Aware ROOT
Y Takahashi at the Computing in High Energy Physics (CHEP 2018, Sofia) (10 July 2018) Link to poster