On Demand Parsing in Clang
#clang#compiler#parsing#performance#memory-optimization#cling
Clang, like any C++ compiler, parses a sequence of characters as they appear,
linearly. The linear character sequence is then turned into tokens and AST
before lowering to machine code. In many cases the end-user code uses a small
portion of the C++ entities from the entire translation unit but the user
still pays the price for compiling all of the redundancies.
This project proposes to process the heavy compiling C++ entities upon using
them rather than eagerly. This approach is already adopted in Clang’s CodeGen
where it allows Clang to produce code only for what is being used. On demand
compilation is expected to significantly reduce the compilation peak memory
and improve the compile time for translation units which sparsely use their
contents. In addition, that would have a significant impact on interactive
C++ where header inclusion essentially becomes a no-op and entities will be
only parsed on demand.
The Cling interpreter implements a very naive but efficient cross-translation
unit lazy compilation optimization which scales across hundreds of libraries
in the field of high-energy physics.
// A.h
#include <string>
#include <vector>
template <class T, class U = int> struct AStruct {
void doIt() { /*...*/ }
const char* data;
// ...
};
template<class T, class U = AStruct<T>>
inline void freeFunction() { /* ... */ }
inline void doit(unsigned N = 1) { /* ... */ }
// Main.cpp
#include "A.h"
int main() {
doit();
return 0;
}
This pathological example expands to 37253 lines of code to process. Cling
builds an index (it calls it an autoloading map) where it contains only
forward declarations of these C++ entities. Their size is 3000 lines of code.
The index looks like:
// A.h.index
namespace std{inline namespace __1{template <class _Tp, class _Allocator> class __attribute__((annotate("$clingAutoload$vector"))) __attribute__((annotate("$clingAutoload$A.h"))) __vector_base;
}}
...
template <class T, class U = int> struct __attribute__((annotate("$clingAutoload$A.h"))) AStruct;
Upon requiring the complete type of an entity, Cling includes the relevant
header file to get it. There are several trivial workarounds to deal with
default arguments and default template arguments as they now appear on the
forward declaration and then the definition. You can read more here.
Although the implementation could not be called a reference implementation,
it shows that the Parser and the Preprocessor of Clang are relatively stateless
and can be used to process character sequences which are not linear in their
nature. In particular namespace-scope definitions are relatively easy to handle
and it is not very difficult to return to namespace-scope when we lazily parse
something. For other contexts such as local classes we will have lost some
essential information such as name lookup tables for local entities. However,
these cases are probably not very interesting as the lazy parsing granularity
is probably worth doing only for top-level entities.
Such implementation can help with already existing issues in the standard such
as CWG2335, under which the delayed portions of classes get parsed immediately
when they’re first needed, if that first usage precedes the end of the class.
That should give good motivation to upstream all the operations needed to
return to an enclosing scope and parse something.
Implementation approach:
Upon seeing a tag definition during parsing we could create a forward declaration,
record the token sequence and mark it as a lazy definition. Later upon complete
type request, we could re-position the parser to parse the definition body.
We already skip some of the template specializations in a similar way [commit, commit].
Another approach is every lazy parsed entity to record its token stream and change
the Toks stored on LateParsedDeclarations to optionally refer to a subsequence of
the externally-stored token sequence instead of storing its own sequence
(or maybe change CachedTokens so it can do that transparently). One of the
challenges would be that we currently modify the cached tokens list to append
an “eof” token, but it should be possible to handle that in a different way.
In some cases, a class definition can affect its surrounding context in a few
ways you’ll need to be careful about here:
1) struct X appearing inside the class can introduce the name X into the enclosing context.
2) static inline declarations can introduce global variables with non-constant initializers
that may have arbitrary side-effects.
For point (2), there’s a more general problem: parsing any expression can trigger
a template instantiation of a class template that has a static data member with
an initializer that has side-effects. Unlike the above two cases, I don’t think
there’s any way we can correctly detect and handle such cases by some simple analysis
of the token stream; actual semantic analysis is required to detect such cases. But
perhaps if they happen only in code that is itself unused, it wouldn’t be terrible
for Clang to have a language mode that doesn’t guarantee that such instantiations
actually happen.
Alternative and more efficient implementation could be to make the lookup tables
range based but we do not have even a prototype proving this could be a feasible
approach.
Task ideas and expected results
- Design and implementation of on-demand compilation for non-templated functions
- Support non-templated structs and classes
- Run performance benchmarks on relevant codebases and prepare report
- Prepare a community RFC document
- [Stretch goal] Support templates
The successful candidate should commit to regular participation in weekly
meetings, deliver presentations, and contribute blog posts as requested.
Additionally, they should demonstrate the ability to navigate the
community process with patience and understanding.