Open Projects

Clad as First-Class Citizen in LibTorch

#clad#automatic-differentiation#compiler-technology#libtorch#pytorch#machine-learning#root-framework#scientific-computing#performance-optimization#c++#gsoc#gsoc-26

This project will design, implement, benchmark, and integrate a proof-of-concept that uses Clad (compiler-based automatic differentiation) as a first-class gradient engine in LibTorch (the C++ API of PyTorch). The goal is to demonstrate how ROOT users can run high-performance, pure-C++ machine-learning training and inference pipelines, without relying on Python. The project will result in a working prototype that integrates Clad-generated backward routines into LibTorch via torch::autograd::Function or custom ATen operators.

Recent efforts have extended the ROOT framework with modern machine-learning capabilities. In particular, a ROOT Users Workshop 2025 contribution by Meyer-Conde et al. demonstrates the use of LibTorch directly inside ROOT for gravitational-wave data analysis [1]. Their “ROOT+” prototype library augments ROOT with advanced features such as complex tensor arithmetic on CPU/GPU and modern I/O mechanisms (HTTP, Kafka), while relying on LibTorch for ML training and inference. In practice, this enables ROOT to load and execute neural networks (via ONNX or LibTorch) entirely in C++, and to combine them seamlessly with ROOT’s data-processing tools such as RDataFrame and TMVA, all within a single environment.

In parallel, recent work in the Compiler Research community has demonstrated that Clad-generated gradients can match and even outperform PyTorch autograd on CPU when carefully optimized [2]. These results motivate a deeper exploration of compiler-driven automatic differentiation as a backend for machine learning frameworks. Building on both efforts, this project will culminate in a ROOT integration demo (for example, a simplified gravitational-wave analysis workflow) and a reproducible benchmarking suite comparing Clad-based gradients with PyTorch autograd for realistic HEP and GW workloads.

This project is expected to deliver tangible performance and usability benefits for machine-learning workflows in ROOT. By offloading gradient computation to Clad’s compiler-generated routines, meaningful speedups are expected for CPU-bound training workloads; prior results report speedups over PyTorch autograd on CPU [2]. This makes the approach particularly attractive for offline HEP and gravitational-wave analyses, where CPU efficiency is often a limiting factor. In addition, the project will enable fully native C++ machine-learning workflows in ROOT, allowing users to define, train, and evaluate models without Python dependencies and to integrate ML tightly with existing C++ analysis code, ROOT I/O, and data pipelines. The Clad-enhanced LibTorch backend will naturally complement ROOT’s existing ML ecosystem including TMVA, SOFIE, ONNX-based inference, and RDataFrame providing a flexible “best-of-both-worlds” solution that combines modern deep-learning frameworks with ROOT’s mature analysis infrastructure. Beyond the immediate prototype, this work will establish a solid foundation for future research on compiler-driven optimizations such as kernel fusion, reduced memory traffic, and eventual GPU acceleration.

Task ideas and expected results

Create a small C++ demo where a simple neural network is defined (e.g. MLP) and use Clad to generate its derivative functions. Integrate this with LibTorch by wrapping the Clad-generated gradient code as a custom torch::autograd::Function or operator. This follows the strategy outlined in the Clad-PyTorch project. The result is a model that uses LibTorch tensors for forward, but Clad’s code for backward.
Measure training (forward+backward) performance on CPU for representative tasks (e.g. MNIST or a simple GW signal classification). Compare Clad-derived gradients vs PyTorch autograd. Focus on performance: optimize memory layout and avoid dynamic allocations to maximize throughput.
Adapt the working prototype into the ROOT framework. For example, incorporate it into a ROOT macro or plugin so that one can run C++ ML code under root.exe or in PyROOT. Provide examples using ROOT’s data structures (TTrees, RDataFrame) feeding into the Clad-empowered model. Investigate loading pretrained models (via ONNX or TorchScript) and whether Clad can backpropagate through them.

CartopiaX: Enhancing a Next-Generation Platform for Computational Cancer Biology

#computational-biology#computational-oncology#agent-based-simulation#scientific-computing#high-performance-computing#parallel-programming#c++#python#gpu#reproducibility#gsoc#gsoc-26

CartopiaX is an emerging simulation and modeling platform designed to support computational cancer research through large-scale, agent-based biological simulations. The project builds on modern high-performance scientific computing practices and leverages technologies inspired by platforms such as BioDynaMo to model tumor growth, tissue microenvironments, cell-cell interactions, and diffusion of signaling molecules. It reimplements the agent based model presented in In silico study of heterogeneous tumour-derived organoid response to CAR T-cell therapy using BioDynaMo.

CartopiaX aims to provide a flexible research environment that enables computational scientists and domain biologists to collaboratively design, execute, and analyze large-scale biological simulations. The platform combines high-performance C++ simulation kernels with user-friendly interfaces and scripting capabilities to enable rapid experimentation and reproducible research workflows.

Currently, CartopiaX provides a performant core simulation engine but still requires improvements in usability, extensibility, and performance portability to support wider adoption in computational oncology and systems biology communities.

This project invites contributors to explore improvements that help integrate, extend, and deploy CartopiaX for real-world research applications. Students are encouraged to propose approaches that enhance developer productivity, accessibility for domain scientists, and computational performance.

Possible areas of exploration. Easy integration: a possible direction focuses on improving the usability of CartopiaX by developing more intuitive ways for researchers to configure and run simulations. Currently, simulations rely heavily on static configuration files and parameter definitions. Students may explore designing graphical or web-based interfaces that allow researchers to interactively define experiments, create structured configuration systems using formats such as YAML or JSON, and develop reusable experiment templates. This direction aims to make CartopiaX more accessible to domain scientists who may not have extensive programming experience while improving reproducibility and workflow management. Flexibility: A potential direction involves extending CartopiaX through Python integration to support flexible and rapid scientific experimentation. Many researchers in computational biology prefer Python due to its strong ecosystem for data analysis and prototyping. Students may investigate technologies such as cppyy to enable seamless interaction between the high-performance C++ simulation core and Python. This could allow scientists to define cell behaviors, simulation rules, or analysis pipelines directly in Python while preserving the performance advantages of the C++ backend. This area provides opportunities to work on language interoperability and mixed-language scientific workflows. HPC: a third direction explores improving the performance and scalability of CartopiaX by identifying and optimizing computational bottlenecks within the simulation engine. Agent-based biological simulations frequently involve expensive processes such as diffusion modeling and large-scale cell interaction calculations. Students may explore profiling the simulation engine, investigating GPU acceleration strategies for diffusion solvers or other parallelizable components, and developing benchmarking tools to evaluate performance improvements. This direction is particularly suited for students interested in high-performance computing and parallel programming techniques.

Task ideas and expected results

Agent-based Large-Scale Antimatter Simulation

#biodynamo#agent-based-simulation#computational-physics#high-performance-computing#parallel-programming#c++#scientific-simulation#antimatter#containerization#visualization

Description: Deliver a self-contained BioDynaMo module and research prototype that enables validated, reproducible simulations of charged antiparticle ensembles in Penning-trap-like geometries at scales beyond existing demonstrations. The project generalizes prior BioDynaMo Penning-trap work into a reusable, documented, and scalable module suitable for antimatter-motivated studies and other charged-particle systems.

The student will extend BioDynaMo with a focused set of features (pluginized force models, neighbor search tuned for charged particles, elastic runtime hooks, and analysis/visualization pipelines), validate the models on canonical testcases (single-particle motion, small plasma modes), and demonstrate scaling and scientific workflows up to the largest feasible size within available resources.

BioDynaMo already provides an agent/plugin API, parallel execution (OpenMP), and visualization hooks (ParaView/VTK). A prior intern report demonstrates a Penning-trap proof-of-concept and identifies directions for extension (custom forces, multi-scale runs, hierarchical models, CI, containerization)[1].

The high-level goals are: Engineering

Implement a BioDynaMo plugin module (“AntimatterKernel”) optimized for charged-particle workloads, including SoA-compatible data layouts, spatial decomposition, and an efficient neighbor search.
Enable elastic and reproducible execution via containerized workflows and runtime configuration for local, HPC, or cloud environments.
Provide performance instrumentation and a small, well-documented benchmark suite integrated with BioDynaMo’s tooling.

Physics / Scientific

Implement physics components as BioDynaMo plugins: Penning-trap external fields, Coulomb interactions (pairwise with documented extension points for approximations), stochastic annihilation handling, and basic species support.
Validate against analytic and reference scenarios (single-particle trapping, basic plasma oscillation modes), with clearly stated assumptions and limits.
Perform a limited parameter sweep (e.g. density, magnetic field, trap voltage) at increasing scale to explore collective behavior observable within accessible regimes.

Task ideas and expected results

A BioDynaMo plugin/module implementing charged-particle dynamics suitable for antimatter-motivated simulations.
A set of validated physics testcases reproducing canonical scenarios, with documented assumptions and limitations.
A scalable and reproducible simulation workflow, including performance instrumentation and example benchmark configurations.
Elastic execution artifacts (containers and run scripts) enabling consistent execution across local, HPC, and cloud systems.
Analysis and visualization pipelines producing scientifically meaningful observables (e.g. density profiles, energy spectra, annihilation maps).
A public open-source release with documentation and a short technical report or draft publication suitable for a workshop or conference.

Enable GPU support and Python Interoperability via a Plugin System

#xeus#xeus-cpp#clang#clang-repl#jupyter#gpu#cuda#python#plugins

Xeus-Cpp integrates Clang-Repl with the Xeus protocol via CppInterOp, providing a powerful platform for C++ development within Jupyter Notebooks.

This project aims to introduce a plugin system for magic commands (cell, line, etc.), enabling a more modular and maintainable approach to extend Xeus-Cpp. Traditionally, magic commands introduce additional code and dependencies directly into the Xeus-Cpp kernel, increasing its complexity and maintenance burden. By offloading this functionality to a dedicated plugin library, we can keep the core kernel minimal while ensuring extensibility. This approach allows new magic commands to be developed, packaged, and deployed independently—eliminating the need to rebuild and release Xeus-Cpp for each new addition.

Initial groundwork has already been laid with the Xplugin library, and this project will build upon that foundation. The goal is to clearly define magic command compatibility across different platforms while ensuring seamless integration. A key objective is to reimplement existing features, such as the LLM cell magic and the in-development Python magic, as plugins. This will not only improve modularity within Xeus-Cpp but also enable these features to be used in other Jupyter kernels.

As an extended goal, we aim to develop a new plugin for GPU execution, leveraging CUDA or OpenMP to support high-performance computing workflows within Jupyter.

Task ideas and expected results

Move the currently implemented magics and reframe using xplugin
Complete the on-going work on the Python interoperability magic
Implement a test suite for the plugins
Extended: To be able to execute on GPU using CUDA or OpenMP
Optional: Extend the magics for the wasm use case (xeus-cpp-lite)
Present the work at the relevant meetings and conferences

Consolidate and advance the GPU infrastructure in Clad

#clad#gpu#cuda#gsoc#gsoc-26

Clad is a Clang-based automatic differentiation (AD) plugin for C++. Over the past years, several efforts have explored GPU support in Clad, including differentiation of CUDA code, partial support for the Thrust API, and prototype integrations with larger applications such as XSBench, LULESH, a tiny raytracer in the Clad repository, and LLM training examples (including work carried out last year). While these efforts demonstrate feasibility, they are fragmented across forks and student branches, are inconsistently tested, and lack reproducible benchmarking.

This project aims to consolidate and strengthen Clad’s GPU infrastructure. The focus is on upstreaming existing work, improving correctness and consistency of CUDA and Thrust support, and integrating Clad with realistic GPU-intensive codebases. A key goal is to establish reliable benchmarks and CI coverage: if current results are already good, they should be documented and validated; if not, the implementation should be optimized further so that Clad is a practical AD solution for real-world GPU applications.

Task ideas and expected results

Recover, reproduce, and upstream past Clad+GPU work, including prior student projects and LLM training prototypes.
Integrate Clad with representative GPU applications such as XSBench, LULESH, and the in-tree tiny raytracer, ensuring correct end-to-end differentiation.
Establish reproducible benchmarks for these codebases and compare results with other AD tools (e.g. Enzyme) where feasible.
Reduce reliance on atomic operations, improve accumulation strategies, and add support for additional GPU primitives and CUDA/Thrust features.
Add unit and integration tests and enable GPU-aware CI to catch correctness and performance regressions.
Improve user-facing documentation and examples for CUDA and Thrust usage.
Present intermediate and final results at relevant project meetings and conferences.

Enable automatic differentiation of OpenMP programs with Clad

#clad#openmp#gsoc#gsoc-26

Clad is an automatic differentiation (AD) clang plugin for C++. Given a C++ source code of a mathematical function, it can automatically generate C++ code for computing derivatives of the function. Clad is useful in powering statistical analysis and uncertainty assessment applications. OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and other computing platforms.

This project aims to develop infrastructure in Clad to support the differentiation of programs that contain OpenMP primitives.

Task ideas and expected results

Extend the pragma handling support
List the most commonly used OpenMP concurrency primitives and prepare a plan for how they should be handled in both forward and reverse accumulation in Clad
Add support for concurrency primitives in Clad’s forward and reverse mode automatic differentiation.
Add proper tests and documentation.
Present the work at the relevant meetings and conferences.

Enhancing LLM Training with Clad for efficient differentiation

#clad#llm#ai#machine-learning#automatic-differentiation#cpp#optimization

This project aims to leverage Clad, an automatic differentiation (AD) plugin for Clang, to optimize large language model (LLM) training primarily in C++. Automatic differentiation is a crucial component of deep learning training, enabling efficient computation of gradients for optimization algorithms such as stochastic gradient descent (SGD). While most modern LLM frameworks rely on Python-based ecosystems, their heavy reliance on interpreted code and dynamic computation graphs can introduce performance bottlenecks. By integrating Clad into C++-based deep learning pipelines, we can enable high-performance differentiation at the compiler level, reducing computational overhead and improving memory efficiency. This will allow developers to build more optimized training workflows without sacrificing flexibility or precision.

Beyond performance improvements, integrating Clad with LLM training in C++ opens new possibilities for deploying AI models in resource-constrained environments, such as embedded systems and HPC clusters, where minimizing memory footprint and maximizing computational efficiency are critical. Additionally, this work will bridge the gap between modern deep learning research and traditional scientific computing by providing a more robust and scalable AD solution for physics-informed machine learning models. By optimizing the differentiation process at the compiler level, this project has the potential to enhance both research and production-level AI applications, aligning with compiler-research.org’s broader goal of advancing computational techniques for scientific discovery.

Task ideas and expected results

Develop a simplified LLM setup in C++
Apply Clad to compute gradients for selected layers and loss functions
Enhance clad to support it if necessary, and prepare performance benchmarks
Enhance the LLM complexity to cover larger projects such as llama
Repeat bugfixing and benchmarks
Develop tests to ensure correctness, numerical stability, and efficiency
Document the approach, implementation details, and performance gains
Present progress and findings at relevant meetings and conferences

Integrate Clad in PyTorch and compare the gradient execution times

#clad#pytorch#python#cuda#benchmarking#automatic-differentiation#gpu

PyTorch is a popular machine learning framework that includes its own automatic differentiation engine, while Clad is a Clang plugin for automatic differentiation that performs source-to-source transformation to generate functions capable of computing derivatives at compile time.

This project aims to integrate Clad-generated functions into PyTorch using its C++ API and expose them to a Python workflow. The goal is to compare the execution times of gradients computed by Clad with those computed by PyTorch’s native autograd system. Special attention will be given to CUDA-enabled gradient computations, as PyTorch also offers GPU acceleration capabilities.

Task ideas and expected results

Incorporate Clad’s API components (such as clad::array and clad::tape) into PyTorch using its C++ API
Pass Clad-generated derivative functions to PyTorch and expose them to Python
Perform benchmarks comparing the execution times and performance of Clad-derived gradients versus PyTorch’s autograd
Automate the integration process
Document thoroughly the integration process and the benchmark results and identify potential bottlenecks in Clad’s execution
Present the work at the relevant meetings and conferences.

Enable automatic differentiation of C++ STL concurrency primitives in Clad

#clad#cpp#stl#concurrency#multithreading#automatic-differentiation

Clad is an automatic differentiation (AD) clang plugin for C++. Given a C++ source code of a mathematical function, it can automatically generate C++ code for computing derivatives of the function. This project focuses on enabling automatic differentiation of codes that utilise C++ concurrency features such as std::thread, std::mutex, atomic operations and more. This will allow users to fully utilize their CPU resources.

Task ideas and expected results

Explore C++ concurrency primitives and prepare a report detailing the associated challenges involved and the features that can be feasibly supported within the given timeframe.
Add concurrency primitives support in Clad’s forward-mode automatic differentiation.
Add concurrency primitives support in Clad’s reverse-mode automatic differentiation.
Add proper tests and documentation.
Present the work at the relevant meetings and conferences.

Interactive Differential Debugging - Intelligent Auto-Stepping and Tab-Completion

#debugging#idd#gdb#lldb#regression#tooling#ci

Differential debugging is a time-consuming task that is not well supported by existing tools. Existing state-of-the-art tools do not consider a baseline(working) version while debugging regressions in complex systems, often leading to manual efforts by developers to achieve an automatable task.

The differential debugging technique analyzes a regressed system and identifies the cause of unexpected behaviors by comparing it to a previous version of the same system. The idd tool inspects two versions of the executable - a baseline and a regressed version. The interactive debugging session runs both executables side-by-side, allowing the users to inspect and compare various internal states.

This project aims to implement intelligent stepping (debugging) and tab completions of commands. IDD should be able to execute until a stack frame or variable diverges between the two versions of the system, then drop to the debugger. This may be achieved by introducing new IDD-specific commands. IDD should be able to tab complete the underlying GDB/LLDB commands. The contributor is also expected to set up the necessary CI infrastructure to automate the testing process of IDD.

Task ideas and expected results

Enable stream capture
Enable IDD-specific commands to execute until diverging stack or variable value.
Enable tab completion of commands.
Set up CI infrastructure to automate testing IDD.
Present the work at the relevant meetings and conferences.

Implement CppInterOp API exposing memory, ownership and thread safety information

#cppinterop#cppyy#clang-repl#cling#interoperability#ast#jit

Incremental compilation pipelines process code chunk-by-chunk by building an ever-growing translation unit. Code is then lowered into the LLVM IR and subsequently run by the LLVM JIT. Such a pipeline allows creation of efficient interpreters. The interpreter enables interactive exploration and makes the C++ language more user friendly. The incremental compilation mode is used by the interactive C++ interpreter, Cling, initially developed to enable interactive high-energy physics analysis in a C++ environment.

Clang and LLVM provide access to C++ from other programming languages, but currently only exposes the declared public interfaces of such C++ code even when it has parsed implementation details directly. Both the high-level and the low-level program representation has enough information to capture and expose more of such details to improve language interoperability. Examples include details of memory management, ownership transfer, thread safety, externalized side-effects, etc. For example, if memory is allocated and returned, the caller needs to take ownership; if a function is pure, it can be elided; if a call provides access to a data member, it can be reduced to an address lookup.

The goal of this project is to develop API for CppInterOp which are capable of extracting and exposing such information AST or from JIT-ed code and use it in cppyy (Python-C++ language bindings) as an exemplar. If time permits, extend the work to persistify this information across translation units and use it on code compiled with Clang.

Task ideas and expected results

Collect and categorize possible exposed interop information kinds
Write one or more facilities to extract necessary implementation details
Design a language-independent interface to expose this information
Integrate the work in clang-repl and Cling
Implement and demonstrate its use in cppyy as an exemplar
Present the work at the relevant meetings and conferences.

Implement and improve an efficient, layered tape with prefetching capabilities

#clad#data-structures#performance#memory-management#gpu#hpc

In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences). Clad is based on Clang which provides the necessary facilities for code transformation. The AD library can differentiate non-trivial functions, to find a partial derivative for trivial cases and has good unit test coverage.

The most heavily used entity in AD is a stack-like data structure called a tape. For example, the first-in last-out access pattern, which naturally occurs in the storage of intermediate values for reverse mode AD, lends itself towards asynchronous storage. Asynchronous prefetching of values during the reverse pass allows checkpoints deeper in the stack to be stored furthest away in the memory hierarchy. Checkpointing provides a mechanism to parallelize segments of a function that can be executed on independent cores. Inserting checkpoints in these segments using separate tapes enables keeping the memory local and not sharing memory between cores. We will research techniques for local parallelization of the gradient reverse pass, and extend it to achieve better scalability and/or lower constant overheads on CPUs and potentially accelerators. We will evaluate techniques for efficient memory use, such as multi-level checkpointing support. Combining already developed techniques will allow executing gradient segments across different cores or in heterogeneous computing systems. These techniques must be robust and user-friendly, and minimize required application code and build system changes.

This project aims to improve the efficiency of the clad tape and generalize it into a tool-agnostic facility that could be used outside of clad as well.

Task ideas and expected results

Optimize the current tape by avoiding re-allocating on resize in favor of using connected slabs of array
Enhance existing benchmarks demonstrating the efficiency of the new tape
Add the tape thread safety
Implement multilayer tape being stored in memory and on disk
[Stretch goal] Support cpu-gpu transfer of the tape
[Stretch goal] Add infrastructure to enable checkpointing offload to the new tape
[Stretch goal] Performance benchmarks

Enabling CUDA compilation on Cppyy-Numba generated IR

#cppyy#numba#cuda#llvm#ir#gpu#python

Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling C++ from Python and Python from C++. Initial support has been added that allows Cppyy to hook into the high-performance Python compiler, Numba which compiles looped code containing C++ objects/methods/functions defined via Cppyy into fast machine code. Since Numba compiles the code in loops into machine code it crosses the language barrier just once and avoids large slowdowns accumulating from repeated calls between the two languages. Numba uses its own lightweight version of the LLVM compiler toolkit (llvmlite) that generates an intermediate code representation (LLVM IR) which is also supported by the Clang compiler capable of compiling CUDA C++ code.

The project aims to demonstrate Cppyy’s capability to provide CUDA paradigms to Python users without any compromise in performance. Upon successful completion a possible proof-of-concept can be expected in the below code snippet -

          import cppyy
import cppyy.numba_ext

cppyy.cppdef('''
__global__ void MatrixMul(float* A, float* B, float* out) {
    // kernel logic for matrix multiplication
}
''')

@numba.njit
def run_cuda_mul(A, B, out):
    # Allocate memory for input and output arrays on GPU
    # Define grid and block dimensions
    # Launch the kernel
    MatrixMul[griddim, blockdim](d_A, d_B, d_out)

        

Task ideas and expected results

Add support for declaration and parsing of Cppyy-defined CUDA code on the Numba extension.
Design and develop a CUDA compilation and execution mechanism.
Prepare proper tests and documentation.

Cppyy STL/Eigen - Automatic conversion and plugins for Python based ML-backends

#cppyy#stl#eigen#jax#cutlass#numpy#machine-learning

Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling C++ from Python and Python from C++. Cppyy uses pythonized wrappers of useful classes from libraries like STL and Eigen that allow the user to utilize them on the Python side. Current support follows container types in STL like std::vector, std::map, and std::tuple and the Matrix-based classes in Eigen/Dense. These cppyy objects can be plugged into idiomatic expressions that expect Python builtin-types. This behaviour is achieved by growing pythonistic methods like __len__ while also retaining its C++ methods like size.

Efficient and automatic conversion between C++ and Python is essential towards high-performance cross-language support. This approach eliminates overheads arising from iterative initialization such as comma insertion in Eigen. This opens up new avenues for the utilization of Cppyy’s bindings in tools that perform numerical operations for transformations, or optimization.

The on-demand C++ infrastructure wrapped by idiomatic Python enables new techniques in ML tools like JAX/CUTLASS. This project allows the C++ infrastructure to be plugged into at service to the users seeking high-performance library primitives that are unavailable in Python.

Task ideas and expected results

Extend STL support for std::vectors of arbitrary dimensions
Improve the initialization approach for Eigen classes
Develop a streamlined interconversion mechanism between Python builtin-types, numpy.ndarray, and STL/Eigen data structures
Implement experimental plugins that perform basic computational operations in frameworks like JAX
Work on integrating these plugins with toolkits like CUTLASS that utilise the bindings to provide a Python API

Broaden the Scope for the Floating-Point Error Estimation Framework in Clad

#clad#floating-point#numerical-stability#benchmarking#error-estimation

In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences). Clad is based on Clang which provides the necessary facilities for code transformation. The AD library can differentiate non-trivial functions, to find a partial derivative for trivial cases and has good unit test coverage.

Clad also possesses the capabilities of annotating given source code with floating-point error estimation code. This allows Clad to compute any floating-point related errors in the given function on the fly. This allows Clad to reason about the numerical stability of the given function and also analyze the sensitivity of the variables involved.

The idea behind this project is to develop benchmarks and improve the floating-point error estimation framework as necessary. Moreover, find compelling real-world use-cases of the tool and investigate the possibility of performing lossy compression with it.

On successful completion of the project, the framework should have a sufficiently large set of benchmarks and example usages. Moreover, the framework should be able to run the following code as expected:

          #include <iostream>
#include "clad/Differentiator/Differentiator.h"

// Some complicated function made up of doubles.
double someFunc(double F1[], double F2[], double V3[], double COUP1, double COUP2)
{
  double cI = 1;
  double TMP3;
  double TMP4;
  TMP3 = (F1[2] * (F2[4] * (V3[2] + V3[5]) + F2[5] * (V3[3] + cI * (V3[4]))) +
  F1[3] * (F2[4] * (V3[3] - cI * (V3[4])) + F2[5] * (V3[2] - V3[5])));
  TMP4 = (F1[4] * (F2[2] * (V3[2] - V3[5]) - F2[3] * (V3[3] + cI * (V3[4]))) +
  F1[5] * (F2[2] * (-V3[3] + cI * (V3[4])) + F2[3] * (V3[2] + V3[5])));
  return (-1.) * (COUP2 * (+cI * (TMP3) + 2. * cI * (TMP4)) + cI * (TMP3 *
  COUP1));
}

int main() {
  auto df = clad::estimate_error(someFunc);
  // This call should generate a report to decide
  // which variables can be downcast to a float.
  df.execute(args...);
}

        

Task ideas and expected results

The project consists of the following tasks:

Add at least 5 benchmarks and compare the framework’s correctness and performance against them.
Compile at least 3 real-world examples that are complex enough to demonstrate the capabilities of the framework.
Solve any general-purpose issues that come up with Clad during the process.
Prepare demos and carry out development needed for lossy compression.

Improve robustness of dictionary to module lookups in ROOT

#root#cern#cpp-modules#cmssw#dictionary#io

The LHC smashes groups of protons together at close to the speed of light: 40 million times per second and with seven times the energy of the most powerful accelerators built up to now. Many of these will just be glancing blows but some will be head on collisions and very energetic. When this happens some of the energy of the collision is turned into mass and previously unobserved, short-lived particles – which could give clues about how Nature behaves at a fundamental level - fly out and into the detector. Our work includes the experimental discovery of the Higgs boson, which leads to the award of a Nobel prize for the underlying theory that predicted the Higgs boson as an important piece of the standard model theory of particle physics.

CMS is a particle detector that is designed to see a wide range of particles and phenomena produced in high-energy collisions in the LHC. Like a cylindrical onion, different layers of detectors measure the different particles, and use this key data to build up a picture of events at the heart of the collision. The CMSSW is a collection of software for the CMS experiment. It is responsible for the collection and processing of information about the particle collisions at the detector. CMSSW uses the ROOT framework to provide support for data storage and processing. ROOT relies on Cling, Clang, LLVM for building automatically efficient I/O representation of the necessary C++ objects. The I/O properties of each object is described in a compileable C++ file called a /dictionary/. ROOT’s I/O dictionary system relies on C++ modules to improve the overall memory footprint when being used.

The few run time failures in the modules integration builds of CMSSW are due to dictionaries that can not be found in the modules system. These dictionaries are present as the mainstream system is able to find them using a broader search. The modules setup in ROOT needs to be extended to include a dictionary extension to track dictionary<->module mappings for C++ entities that introduce synonyms rather than declarations (using std::vector<A<B>> = MyVector where the dictionaries of A, B are elsewhere)

Task ideas and expected results

The project consists of the following tasks:

If an alias declaration of kind using std::vector<A<B>> = MyVector, we should store the ODRHash of it in the respective dictionary file as a number attached to a special variable which can be retrieved at symbol scanning time.
Track down the test failures of CMSSW and check if the proposed implementation works.
Develop tutorials and documentation.
Present the work at the relevant meetings and conferences.

Enhance the incremental compilation error recovery in clang and clang-repl

#clang#clang-repl#incremental-compilation#error-recovery#jit

The Clang compiler is part of the LLVM compiler infrastructure and supports various languages such as C, C++, ObjC and ObjC++. The design of LLVM and Clang enables them to be used as libraries, and has led to the creation of an entire compiler-assisted ecosystem of tools. The relatively friendly codebase of Clang and advancements in the JIT infrastructure in LLVM further enable research into different methods for processing C++ by blurring the boundary between compile time and runtime. Challenges include incremental compilation and fitting compile/link time optimizations into a more dynamic environment.

Incremental compilation pipelines process code chunk-by-chunk by building an ever-growing translation unit. Code is then lowered into the LLVM IR and subsequently run by the LLVM JIT. Such a pipeline allows creation of efficient interpreters. The interpreter enables interactive exploration and makes the C++ language more user friendly. The incremental compilation mode is used by the interactive C++ interpreter, Cling, initially developed to enable interactive high-energy physics analysis in a C++ environment.

Our group puts efforts to incorporate and possibly redesign parts of Cling in Clang mainline through a new tool, clang-repl. The project aims at enhancing the error recovery when users type C++ at the prompt of clang-repl.

Task ideas and expected results

There are several tasks to improve the current rudimentary state of the error recovery:

Extend the test coverage for error recovery
Find and fix cases where there are bugs
Implement template instantiation error recovery support
Implement argument-dependent lookup (ADL) recovery support

Implement missing C++26 features in Clang

#cppalliance-fellow#cppalliance-fellow-26#clang#c++26#standards#parsing#sema#diagnostics#codegen

This project proposes a candidate-driven program to complete Clang’s support for selected C++26-era language papers and closely related C proposals. The project provides a vehicle for applicants to propose a standards paper that is not yet implemented in Clang (or to pick from a short curated list) and then implement that paper within the scope of the project. Work may touch parsing, semantic analysis, diagnostics, preprocessing, or code generation; when appropriate, a clang-tidy/clang-sa prototype will be offered first so behavior can be validated opt-in before moving into the default compiler.

Candidates are encouraged to propose compact, well-scoped papers (syntax sugar, diagnostics, small semantic clarifications, preprocessing or embedding features) that have clear testability and low ABI risk. If a candidate does not have their own paper, the following example papers are suggested starting points: Introduce storage-class specifiers for compound literals (C23 example). if declarations for C (C2y) (based on C++ practice). #embed / binary resource inclusion (P1967) (C++ proposal that maps to a C-style inclusion concept). “Preprocessing is never undefined” / better preprocessing diagnostics (P2843 — diagnostics-focused).

Care must be taken to avoid breaking existing code, destabilizing diagnostics that downstream users depend on, or adding noisy warnings by default. All changes will be incremental, conservative by default, and backed by targeted regression tests and example-driven validation. Where a change could be noisy or controversial, an opt-in clang-tidy/clang-sa prototype or command-line flag will be used initially.

Task ideas and expected results

A concise intake form and selection rubric for candidate papers (paper skeleton + required example test cases).
A prioritized backlog of selected papers to implement in this cycle.
For each accepted paper: an implementation plan (scoped tasks), incremental PRs, and a mentor assigned from the Clang team.
Reference Clang implementations for selected papers (parser + Sema + diagnostics + CodeGen as needed), or clang-tidy/clang-sa prototypes when opt-in behavior is preferred.
Regression test suites (lit tests, compile-and-run where applicable) and CI integration proving correctness and no regressions on existing test suites.
Upstreamed PRs with review iteration, and a short implementation summary per paper (what changed, why, and any remaining limitations or interactions).
Onboarding docs for future candidates: the minimal paper skeleton, example tests, and a short “how to implement a language paper in Clang” checklist.

Success Criteria

Implemented papers match the intent and precise wording of the submitted standards text (or a documented, reviewers-approved interpretation).
All new tests pass in Clang CI; no regressions on the existing LLVM/Clang test suite; no unacceptable performance or memory regressions on representative large codebases.
At least one implemented paper per funding sprint is accepted upstream as a Clang change or as a well-documented clang-tidy prototype with clear adoption guidance.
New diagnostics (if any) are low-noise in practice; fix-its are safe and validated by compile-and-run tests where applicable.
Positive reviewer and community feedback indicating the implementations are correct, maintainable, and useful.

Testing and Verification

Require a minimal reproducer and at least 5 short lit-style test cases submitted with each candidate paper (good/bad/borderline/edge cases).
Add lit tests to clang/test and clang/CodeGen where codegen is touched; add diagnostics tests verifying message text and fix-its when applicable.
For changes that affect preprocessing, templates, or macros, include macro-heavy test cases and negative tests to avoid false positives.
For fixes touching runtime/codegen behaviour, include runtime tests (where feasible) to validate observable behavior and ABI stability.
For changes that add analysis or dataflow: include microbenchmarks and memory checks to ensure no unacceptable compile-time or peak memory regressions.
Run the new tests and a selected subset of the LLVM/Clang test suite and perfbench on representative OSS projects as part of the PR pipeline.

Difficulty: 4/10; Expected timeline: 6 months FTE (program setup + 2-4 small/medium paper implementations, or equivalent aggregated effort; individual paper scope will affect schedule).

Improve Clang Performance

#cppalliance-fellow#cppalliance-fellow-26#clang#performance#benchmarking#profiling#memory-optimization#codegen

This project aims to systematically improve Clang’s performance by identifying, understanding, and addressing common compilation bottlenecks across real-world codebases. While Clang is highly capable, performance regressions and inefficiencies can accumulate over time due to increasing language complexity, new features, and evolving usage patterns. These issues can impact compile time, peak memory usage, and overall memory pressure, especially in large projects.

The work is divided into two major phases. First, we will benchmark Clang across a representative set of open-source projects and compiler releases to identify performance hot spots and regressions. This includes analyzing trends over time to determine where performance has degraded and correlating regressions with specific upstream changes. Second, we will deeply investigate identified bottlenecks across the major compilation phases lexing, parsing, semantic analysis, and code generation and design targeted improvements to address them.

Care needs to be taken to avoid changes that compromise correctness, diagnostic quality, or maintainability. Improvements will be data-driven, incremental, and supported by strong evidence from benchmarks and profiling. Where appropriate, fixes will prioritize broadly impactful improvements rather than narrowly optimized corner cases.

Improving Clang’s performance reduces build times, lowers resource consumption, and improves developer productivity across the ecosystem. By identifying root causes of regressions and addressing systemic inefficiencies, this project helps ensure Clang scales effectively with modern C++ workloads.

Task ideas and expected results

A benchmarking framework and methodology for evaluating Clang performance across releases.
A curated set of real-world open-source benchmarks covering different compilation phases.
A documented analysis of identified performance bottlenecks and regressions.
Targeted performance improvements addressing key issues in lexing, parsing, semantic analysis, and/or code generation.
Memory usage and peak memory pressure optimizations where applicable.
A summary report documenting regressions found, fixes implemented, and measurable improvements achieved.

Success Criteria

The project is successful if the identified performance issues are reproducible, well-understood, and meaningfully improved without introducing correctness regressions. All changes should be validated by benchmarks, pass existing and new tests, and be accepted upstream. Measurable improvements in compile time and/or memory usage on representative codebases, along with positive upstream feedback, indicate success.

Testing and Verification

We should:

Run performance benchmarks across multiple Clang releases to detect regressions.
Use profiling tools to attribute time and memory usage to specific compiler components.
Add regression benchmarks or tests where appropriate to prevent future slowdowns.
Validate improvements on large, real-world codebases.
Ensure that compile-time performance gains do not introduce excessive memory usage or correctness issues.

Difficulty: 3/10; Expected timeline: 6 months FTE

Process High-Impact clang-tidy and Clang Static Analyzer Requests

#cppalliance-fellow#cppalliance-fellow-26#clang#clang-tidy#static-analyzer#diagnostics#tooling#regression-tests

This project proposes to systematically process and resolve the highest- impact open requests in clang-tidy and the Clang Static Analyzer. The clang-tidy issue tracker contains a large backlog of reports covering false positives, false negatives, invalid fix-it hints, performance regressions, and requests for new checks. Many of these issues are well-motivated and affect real-world adoption, but remain unresolved due to limited maintainer time rather than technical difficulty.

The focus of this project is to triage these requests, identify those with the highest user impact, and implement conservative, well-tested fixes or improvements. Work will prioritize issues that reduce noise, fix incorrect or unsafe fix-its, close obvious coverage gaps, or address performance and scalability problems in commonly used checks. Where appropriate, fixes will apply both to clang-tidy and to the Clang Static Analyzer, depending on the nature of the analysis and the desired default behavior.

Care must be taken to avoid introducing new false positives, breaking existing workflows, or increasing analysis cost. All changes will be driven by minimal reproducer test cases, include targeted regression tests, and follow existing Clang and LLVM contribution guidelines. Fix-it hints will only be added or modified when correctness can be guaranteed.

By reducing long-standing friction points and improving the reliability of existing checks, this work improves trust in Clang-based tooling and directly benefits a broad segment of the C and C++ ecosystem without requiring users to adopt new tools or configurations.

Task ideas and expected results

A triaged and prioritized list of high-impact open clang-tidy and Clang Static Analyzer issues.
Implementations addressing a selected set of high-impact requests (e.g., false positives, false negatives, invalid fix-its, or performance issues).
Improved or corrected fix-it hints where safe and appropriate.
Targeted regression tests for each resolved issue.
Documentation or issue updates explaining the resolution and any remaining limitations.
A short summary documenting which issues were addressed and why they were prioritized.

Success Criteria

The resolved issues are reproducible, correctly fixed, and accepted upstream. The changes reduce real-world false positives or incorrect behavior without introducing regressions, pass all existing and new tests, and receive positive feedback from users and reviewers indicating improved usefulness and reliability of clang-tidy and the Clang Static Analyzer.

Testing and Verification

We will add minimal reproducer-based regression tests for each issue, include negative tests to prevent spurious diagnostics, validate fix-it correctness on representative examples, and ensure that analysis performance and memory usage do not regress on large codebases.

Difficulty: 4/10; Expected timeline: 6 months FTE

Enhance Clang Diagnostics

#cppalliance-fellow#cppalliance-fellow-26#clang#diagnostics#clang-tidy#fixit#developer-experience

This project proposes to improve the clarity, usefulness, and robustness of Clang’s diagnostics so that developers can more easily understand and fix errors and warnings in C++ code. Many existing diagnostics are technically correct but difficult to interpret, especially for newcomers or when dealing with complex language features. This project focuses on refining wording, adding contextual information, introducing high-value new diagnostics, and providing reliable fix-it hints where safe and appropriate. We will also upstream mature clang-tidy checks into Clang itself when they are broadly applicable.

Beyond immediate improvements, this work establishes a foundation for more systematic diagnostic quality improvements, such as developing informal style guidelines and identifying recurring sources of confusion. Care must be taken to avoid false positives, excessive verbosity, or breaking users who rely on stable diagnostic output. Changes will therefore be incremental, well-tested, and conservative by default.

Clearer diagnostics reduce debugging time, lower the barrier to entry for C++, and improve developer productivity across the ecosystem. By upstreaming proven checks and improving default behavior, the benefits reach all Clang users without requiring additional tools or configuration.

Task ideas and expected results

A curated set of rewritten diagnostics with clearer wording and added context.
New default diagnostics for subtle or silent semantic issues.
Improved and expanded fix-it hints.
Migration of selected clang-tidy checks into Clang proper.
A list of clang-tidy diagnostics which can be moved to clang including but not limited to bugprone-return-const-ref-from-parameter, bugprone-raw-memory-call-on-non-trivial-type, bugprone-multiple-new-in-one-expression, bugprone-multiple-statement-macro. bugprone-use-after-move.
Implementation of diagnostics for silent change of semantics during third party code changes such as https://godbolt.org/z/E9Me1djP6
A short summary documenting which diagnostics were improved and why.

Success Criteria

The enhanced diagnostics trigger correctly in intended scenarios, introduce no significant false positives or regressions, pass all existing and new tests, and are accepted upstream with positive feedback indicating improved clarity and usefulness.

Testing and Verification

We should add targeted regression tests for each diagnostic change, validate fix-its on representative examples, perform negative testing to prevent spurious warnings, and ensure compile-time performance remains stable on large codebases.

Difficulty: 5/10; Expected timeline: 6 months FTE

On Demand Parsing in Clang

#cppalliance-fellow#cppalliance-fellow-26#clang#parsing#lazy-parsing#scalability#memory-optimization#templates#cling

Clang, like any C++ compiler, parses a sequence of characters as they appear, linearly. The linear character sequence is then turned into tokens and AST before lowering to machine code. In many cases the end-user code uses a small portion of the C++ entities from the entire translation unit but the user still pays the price for compiling all of the redundancies.

This project proposes to process the heavy compiling C++ entities upon using them rather than eagerly. This approach is already adopted in Clang’s CodeGen where it allows Clang to produce code only for what is being used. On demand compilation is expected to significantly reduce the compilation peak memory and improve the compile time for translation units which sparsely use their contents. In addition, that would have a significant impact on interactive C++ where header inclusion essentially becomes a no-op and entities will be only parsed on demand.

The Cling interpreter implements a very naive but efficient cross-translation unit lazy compilation optimization which scales across hundreds of libraries in the field of high-energy physics.

            // A.h
  #include <string>
  #include <vector>
  template <class T, class U = int> struct AStruct {
    void doIt() { /*...*/ }
    const char* data;
    // ...
  };

  template<class T, class U = AStruct<T>>
  inline void freeFunction() { /* ... */ }
  inline void doit(unsigned N = 1) { /* ... */ }

  // Main.cpp
  #include "A.h"
  int main() {
    doit();
    return 0;
  }

        

This pathological example expands to 37253 lines of code to process. Cling builds an index (it calls it an autoloading map) where it contains only forward declarations of these C++ entities. Their size is 3000 lines of code.

The index looks like:

            // A.h.index
  namespace std{inline namespace __1{template <class _Tp, class _Allocator> class __attribute__((annotate("$clingAutoload$vector")))  __attribute__((annotate("$clingAutoload$A.h")))  __vector_base;
    }}
  ...
  template <class T, class U = int> struct __attribute__((annotate("$clingAutoload$A.h"))) AStruct;

        

Upon requiring the complete type of an entity, Cling includes the relevant header file to get it. There are several trivial workarounds to deal with default arguments and default template arguments as they now appear on the forward declaration and then the definition. You can read more here.

Although the implementation could not be called a reference implementation, it shows that the Parser and the Preprocessor of Clang are relatively stateless and can be used to process character sequences which are not linear in their nature. In particular namespace-scope definitions are relatively easy to handle and it is not very difficult to return to namespace-scope when we lazily parse something. For other contexts such as local classes we will have lost some essential information such as name lookup tables for local entities. However, these cases are probably not very interesting as the lazy parsing granularity is probably worth doing only for top-level entities.

Such implementation can help with already existing issues in the standard such as CWG2335, under which the delayed portions of classes get parsed immediately when they’re first needed, if that first usage precedes the end of the class. That should give good motivation to upstream all the operations needed to return to an enclosing scope and parse something.

Implementation approach:

Upon seeing a tag definition during parsing we could create a forward declaration, record the token sequence and mark it as a lazy definition. Later upon complete type request, we could re-position the parser to parse the definition body. We already skip some of the template specializations in a similar way [commit, commit].

Another approach is every lazy parsed entity to record its token stream and change the Toks stored on LateParsedDeclarations to optionally refer to a subsequence of the externally-stored token sequence instead of storing its own sequence (or maybe change CachedTokens so it can do that transparently). One of the challenges would be that we currently modify the cached tokens list to append an “eof” token, but it should be possible to handle that in a different way.

In some cases, a class definition can affect its surrounding context in a few ways you’ll need to be careful about here:

1) struct X appearing inside the class can introduce the name X into the enclosing context.

2) static inline declarations can introduce global variables with non-constant initializers that may have arbitrary side-effects.

For point (2), there’s a more general problem: parsing any expression can trigger a template instantiation of a class template that has a static data member with an initializer that has side-effects. Unlike the above two cases, I don’t think there’s any way we can correctly detect and handle such cases by some simple analysis of the token stream; actual semantic analysis is required to detect such cases. But perhaps if they happen only in code that is itself unused, it wouldn’t be terrible for Clang to have a language mode that doesn’t guarantee that such instantiations actually happen.

Alternative and more efficient implementation could be to make the lookup tables range based but we do not have even a prototype proving this could be a feasible approach.

Task ideas and expected results

A prototype deferring parsing of non-templated functions and classes.
Initial support for lazy parsing of class and struct definitions.
Benchmark results comparing memory usage and compile time.
A design document or RFC describing the approach and trade-offs.
A stretch prototype extending the mechanism to templates.

Success Criteria

The prototype shows measurable reductions in memory usage or compile time on representative workloads, preserves correct semantics, passes Clang’s test suite, and receives constructive community feedback through the RFC process.

Testing and Verification

We should add regression tests covering lazy parsing behavior, validate correctness on pathological and real-world examples, benchmark against baseline builds, and ensure deferred parsing is triggered correctly when entities are referenced.

Difficulty: 10/10; Expected timeline: 6 months FTE

Optimize Usage of Source Locations in Clang Modules

#cppalliance-fellow#cppalliance-fellow-26#clang#modules#source-locations#memory-optimization#diagnostics

This project proposes to reduce source-location memory pressure in modular builds by reusing source-location allocations for duplicated inputs, extending the lifetime of Clang’s 32-bit source-location representation. In large modular builds, repeated inclusion of the same headers across modules can quickly exhaust available offsets.

Rather than immediately switching to a more invasive 64-bit representation, this project explores reusing existing allocations through interval mapping and careful coordination with module loading. This approach introduces complexity in diagnostics and include-stack reconstruction, so correctness and transparency are key concerns.

If successful, the work avoids a disruptive global change, reduces memory usage, and improves Clang’s scalability for modern modular C++ codebases.

Task ideas and expected results

An interval-mapping mechanism to detect and reuse source-location slabs.
Updates to module loading and deserialization to enable reuse.
A prototype demonstrating reduced duplication in multi-module builds.
Measurements showing memory savings.
Documentation of diagnostic implications and mitigations.

Success Criteria

Duplicated module inputs no longer cause proportional growth in source-location allocations, modular builds complete without exhaustion, diagnostics remain correct in tested cases, and the design is accepted or constructively reviewed upstream.

Testing and Verification

We should reproduce known problematic module scenarios, compare allocation statistics before and after changes, run regression tests with emphasis on diagnostics, and validate behavior across different module load orders.

Difficulty: 6/10; Expected timeline: 6 months FTE

Consistent Error Recovery Infrastructure

#cppalliance-fellow#cppalliance-fellow-26#clang#error-recovery#repl#interactive#robustness

This project proposes to improve Clang’s error recovery, particularly for interactive and incremental use cases such as clang-repl. Today, invalid or incomplete C++ input can easily leave the compiler in an unrecoverable state, limiting its usefulness for exploration, teaching, and rapid prototyping.

This project strengthens recovery across templates, ADL failures, and name collisions, while improving performance and crash resilience. Error recovery must be powerful without hiding real problems or destabilizing compiler state, so improvements will be carefully scoped and heavily tested.

Better recovery makes interactive C++ tooling more practical, supports educational workflows, and provides shared infrastructure for future incremental and REPL-based tools built on Clang.

Task ideas and expected results

Expanded test coverage for error recovery scenarios.
Fixes for known recovery bugs.
Recovery support for templates, ADL, and name collisions.
Optional bump-allocator support for recovery paths.
Improved crash resilience and value printing in clang-repl.

Success Criteria

The compiler and REPL recover from common errors without crashing, continue accepting input after failures, pass all tests, and demonstrate clearly improved interactive usability.

Testing and Verification

We should simulate interactive sessions with invalid input, add targeted regression tests for recovery paths, fuzz error cases to detect crashes, and ensure normal compilation performance is unaffected. In particular we should include a test case in clang, undo it and include it again without any errors produced.

Difficulty: 7/10; Expected timeline: 6 months FTE