C++ Language Interoperability Layer

Overview

The C++ programming language is used for many numerically intensive scientific applications. A combination of performance and solid backward compatibility has led to its use for many research software codes over the past 20 years. Despite its power, C++ is often seen as difficult to learn and inconsistent with rapid application development. Exploration and prototyping is slowed down by the long edit-compile-run cycles during development. Exploratory programming is an effective way to gain a deeper understanding of a project’s requirements; reduce the complexity of a problem; and provide an early validation of the system’s design and implementation. This is amongst the strengths of Python and a major design goal of new languages such as Julia, D and Swift.

Two of the most widely used languages by researchers are C++ and Python [1]. Python has grown steadily as a language of choice for data science and application control. The interactive nature of Python and its many available libraries make it an excellent choice for scripting tasks and code prototyping. However, native computational performance of Python is mediocre. Python includes functionality for replacing the most critical components of a processing kernel with implementations in C. This functionality is insufficient to fully cover many scientific use cases because crossing the language boundary is expensive due to limitations in current tools.

This document describes key aspects of language interoperability with C++ using an automated binding approach. The primary initial focus is to support automatic binding to and from Python, however the approach is generic enough to fit other languages such as D and Julia. The document shows a prototype using the proposed language interoperability layer capable of instantiating a C++ template from within Python.

Background

Most widely used set of python binding tools are usually static binding generators with their own parsers. They include SWIG [2] and SIP [3]. They require user code to steer the generation process because of their limited parsers and the insufficient type introspection. The custom language allows the user to fill in the missing introspection information or work around unsupported features in the parsers. These tools have limited C++ support and require many manual interventions.

There is another set of tools such as Boost.Python [4] and Pybind11 [5]. They are C++ APIs on top of the Python C-API, which has limited capabilities and hides important details such as C++-specific memory management patterns (eg. reference counting). These tools have better C++ support but still require manual interventions. For example, the library author needs to write code to define the needed language bindings.

Despite the continuous improvements in the tools generating language bindings it has a fundamental design limitation: they are primarily designed for use by programmers, or by library authors who want to provide language bindings.

The rise of C++ interpreters (notably Cling [6]) lifted this limitation by allowing on-demand automatic bindings to be built for any well-behaved C++ package. On-demand bindings bring improved performance and ease of development and use. They can enable efficient code development and debugging, while also bringing a production runtime benefit. In particular, cppyy [9] uses introspection information available in the C++ interpreter to generate bindings on the fly, and is currently the only fully dynamic Python binding generator. One of Julia’s C++ interoperability packages, Cxx.jl, takes a similar approach [7]. While existing binders work during library development, Cppyy and Cxx.jl work during runtime and can bind code from arbitrary libraries.

The common binding approaches can be classified into static and on-demand (or dynamic) bindings. Static bindings (e.g., those generated by Pybind11) are implemented as a series of invocations to Python APIs to create the needed run-time structures. Static bindings are often inefficient because they are partly in C++ but partly in Python. In some cases they need to use redundant trampoline functions to match object inheritance semantics.

Dynamic approaches suffer from a common misconception that building something dynamically is necessarily slow. Instead, the execution performance bottleneck of the dynamic system is the interpretative language environment that needs to access C++ codes. In fact, the cppyy approach shows better performance than pybind11 [8] [9]. The approach of Cppyy can be further improved by replacing its interoperability layer based on parsing strings. We propose a more efficient implementation in this document.

State of the art for Dynamic Binding based implementations

The dynamic bindings approach requires a compiler instance which is alive during the entire execution of the program. This way it can provide the necessary information about C++ types on demand. In addition, the compiler should be able to handle multiple requests which means it should provide some minimal support for incremental compilation. That is, the compiler should work as a library, which can be interfaced via an API and be used as a service. Very few compilers are designed to work this way. One such compiler is the Clang [10] compiler which can work as a library.

Cling and recently Clang-Repl use Clang as a library to implement incremental compilation. The incremental compilation enables interpreter-like tools but also keeps an active compiler instance at runtime. For example, Cling implements a facility, LookupHelper, which we illustrate in Listing 1, that takes a (perhaps qualified) C++ code and checks if a declaration with that qualified name already exists.

[cling] struct S{};
[cling] cling::LookupHelper& LH = gCling->getLookupHelper()
(cling::LookupHelper &) @0x7fcba3c0bfc0
[cling] auto D = LH.findScope("std::vector<S>",
				 cling::LookupHelper::DiagSetting::NoDiagnostics)
(const clang::Decl *) 0x1216bdcd8
[cling] D->getDeclKindName()
(const char *) "ClassTemplateSpecialization"

Listing 1.

In the particular case, findScope instantiates the template and returns its clang AST representation. Template instantiation on demand addresses the common library problem of template combinatorial explosion. Template instantiation on demand and conversion of textual qualified C++ names into entity meta information has proven to be a very powerful mechanism aiding data serialization and language interoperability.

This approach is very powerful as it allows us to pass in strings and convert them into the appropriate object. One of the drawbacks is its implementation complexity. In the example above, findScope requires a reentrant (recursive) parser that can be much slower than using unqualified lookup facilities. Additionally, in case of a failed lookup, the internal compiler data structures are designed to be immutable and it is very hard to retain the memory.

Essential Features

This document is mostly inspired by the extensive experience we have accumulated via cppyy, but some of the usefulness of the features are reaffirmed by recent talks about D and Julia interop [11], [12].

C++ is a complex language with a complicated in-memory object layout. A feature-complete reflection-based interoperability mechanism must allow for:

Introspection – the ability of the program to examine itself. The program should be able to answer the general question “What am I”?
Reflection – the ability for a program to modify itself (including its behaviour and its state).

The introspection operation is usually a read-only operation. It generally checks for entity existence or properties. It must answer questions such as: “Is that declaration a namespace?”; “How many base classes does this class have?”; “What is the memory layout?”; and “Which is the selected overload candidate for these arguments?”. Technically, it is usually implementable as a parallel operation that does not require locks.The reflection operation modifies the program behavior and/or state. It needs to mutate the internal program representation or values describing programs state in memory. It must be able to generate entities with particular properties or instantiate templates or a specific set of types.

This document does not focus on terminology and boundaries between where the passive part (introspection) ends and the active part (reflection) begins. This boundary is blurry in the C++ language as in certain cases the compiler is required to instantiate a template in order to decide which parsing rule it should later pick. In such a context the introspection relies on reflection.

In the rest of the document, we enumerate, non-exhaustively, the features that are important for language interoperability. We describe key concepts of the C++ language in a way that bridges them with another language semantically (eg, Python, D, Swift, or Julia). We propose a C++ language interoperability layer (LIL) which bridges C++ to another language using a set of reusable utilities. We propose an interface organized for usage from parallel systems where only the reflection operations that mutate the AST should require locks.

Modeling Unqualified Lookup

In many scenarios the information about the details of an entity are accumulated during parsing. Parsing needs to know some information about the underlying C++ constructs to be successful. Consider the Python construct val = std.vector[int]((1,2,3)), where we need to know that std, and later vector, is known to the system but without the need for additional detailed information during parsing. That is, it is sufficient if the interface returns whether the entity is known. Then at the end of the statement, the interop layer can collect all information and instruct the template instantiation system to try to instantiate an std::vector<int>({1,2,3}).

In fact, for certain scenarios an approximative lookup is sufficient. For example, the type int in Python might designate **int8_t</strong>, uint8_t, short, unsigned short, int, unsigned int, long, unsigned long, long long, unsigned long long, int64_t, uint64_t, etc. In such cases, we should be able to express that we are looking up an integral type rather than the concrete type int.

C++ lookup rules are position dependent. That is, the results can be vastly different depending on the lexical scope or the position in the translation unit in which a lookup is performed. Until now, we considered that lookups are to be performed in the global scope as if the parser position is at the current end of the translation unit. We may need to extend the design to allow lookups to work correctly if we need to perform a lookup within a given lexical scope. Performance of the lookups is usually constant, however it can be challenging to process the returned results because of the ever growing translation unit.

Modeling Templates. Instantiation

One of the key features of LIL is its ability to instantiate C++ templates. It must contain interfaces that allow instantiating a C++ template with concrete types. Continuing the example of std.vector[int]((1,2,3)), the C++ templated class std::vector should be instantiated with type int and then constructed with an initializer_list holding 3 values: 1, 2 and 3.

This process is more challenging when supporting type systems that are less strict than in C++. For example, in python int corresponds to an integral type rather than a concrete type. As template instantiation can fail due to minor type mismatch such as missing const/volatile (cv) qualification of the types, the only option is to start guessing at the correct template and adjust the required information. This combinatorial process is often slow and cumbersome. Instead, LIL should provide information back why a template instantiation failed, including the deduction failure information (see clang::TemplateSpecCandidateSet and clang::DeductionFailureInfo), allowing for a more efficient template instantiation algorithm.

Modelling expression templates is also challenging from a performance standpoint. The type tree of the expression template is accumulated and only folded upon an assignment or when using an operation returning a value. This can be problematic for interoperability because the type tree keeps growing and the squash operation often does not have a hint about the final type. An efficient design for expression templates is still being investigated.

Modelling Overload Sets

In many cases a query to LIL will result in ambiguous results. One example is the lookup of an overloaded function. LIL should provide an API to inspect the overload candidates and possibly apply a different set of overload resolution rules to select between them. (see clang::OverloadSet).

Diagnostics

The communication to and from C++ happens via code passed as strings, or by passing internal representations of C++ entities, to the compiler API [13]. In the first case, if there is an inconsistency in the code passed as a string, the compiler issues an onscreen diagnostic. Alternatively, the API call can provide an onscreen diagnostic as well as a flag to programmatically signal failure. In both cases LIL should capture such diagnostics and make them available. The resulting diagnostics in C++ can be overwhelming: LIL should also provide diagnostic filters and the ability to suppress the diagnostics system.

LIL by itself is rather complex and looks opaque from the end-user perspective. This makes providing good debugging hints to developers challenging. The layer should have logging capabilities including logging levels to facilitate this.

Modelling Semantics. Challenges

Every language defines rules according to its design principles. For successful languages, the principles resonate to specific communities and application domains, and different approaches, and therefore languages, work better in different areas. An interoperability layer is a bridge between two languages as well as their design principles.

The C++ type system has evolved over the years and has many performance-related aspects. For example const qualification is part of the overload selection. That is, two covariant functions with the same name can return two different types depending on if the parameter was const-qualified. Some languages, such as Python, do not have the notion of const (or const qualification) and therefore cannot match well both concepts. For example, the standard C++ overload resolution rules may resolve a const-overload but that will not be useful for Python. Therefore, the layer should provide a facility to select the non-const overload to map to languages such as Python which do not have a notion of constness. Python does not always clearly distinguish between a lookup for reading and a lookup for writing. For example, 1D array access x[0] is simple enough (__getitem__ v.s. __setitem__), but 2d array access like x[0][0] = 0 is __getitem__ followed by __setitem__. Thus, in the second case, that first __getitem__ must be a non-const variant. This case can be handled by a relatively simple heuristics in LIL to select the const overload only when there is nothing else.

Another example of hard to model semantic concepts is the type alias. For example, Int8_t is a typedef over char. The compiler type desugaring system will provide LIL with char but the semantic intent for this type was to behave like an integer. The layer should represent the type sugar or should allow type resugaring after template instantiation [14].

In many cases the layer can successfully translate concepts but there will always be cases where language design principles contradict each other. For example, the design principles of temporary object lifetimes in Swift and C++ differ fundamentally and full interoperability is not possible. C++ has very explicit reference semantics, including r-value references (and member functions qualified as such), or smart pointers. The LIL will expose these to the language binding, as the mapping to the target language is expected to be different for different target languages.

Proof of concept

Here we show a prototype to demonstrate the template instantiation process. This prototype is based on ctypes and introduces several simplifications for expository purposes. Its main goal is to show that template instantiation on demand is implementable. The full details of the prototype code of LIL is available in Appendix A.

// Type aliases to make the core clearer.
typedef void* Decl_t;
typedef unsigned long FnAddr_t;
extern "C" {
  // Parses C++ input.
  void Clang_Parse(const char* Code);
  // Looks up a name in a given context.
  Decl_t Clang_LookupName(const char* Name, Decl_t Context = 0);
  // Creates an object of the given type and returns the allocated memory.
  void* Clang_CreateObject(Decl_t RecordDecl);
  // Instantiates a template within a given context
  Decl_t Clang_InstantiateTemplate(Decl_t Context, const char* Name, const char* Args);
  // Returns the low-level name of the compiled function.
  FnAddr_t Clang_GetFunctionAddress(Decl_t D);
}

Listing 2.

The high-level goal of the demonstrator is to parse C++ input (via Clang_Parse), find the required for interoperability C++ entity lazily (via Clang_LookupName), create C++ objects at runtime (via Clang_CreateObject), instantiate the template (via Clang_InstantiateTemplate) and provide the low-level callable (via Clang_GetFunctionAddress).

# template_instantiate_demo.py
import ctypes

libInterop = ctypes.CDLL("./libInterOp.so")
# tell ctypes which function to call and what are the expected in/out types.
_cpp_compile = libInterop.Clang_Parse
_cpp_compile.argtypes = [ctypes.c_char_p]

def cpp_compile(arg):
	return _cpp_compile(arg.encode("ascii"))

# define some classes to play with
cpp_compile(r"""\
void* operator new(__SIZE_TYPE__, void* __p);
extern "C" int printf(const char*,...);
class A {};
class C {};
class B {
public:
	template<typename T, typename S, typename U>
	static void callme(T, S, U*) { printf(" callme in B! \n"); }
};
""")
...
# initialize our C++ interoperability layer wrapper
gIL = InterOpLayerWrapper()

if __name__ == '__main__':
 # create a couple of types to play with
  A = type('A', (), {
	'handle'  : gIL.get_scope('A'),
	'__new__' : cpp_allocate
  })
  h = gIL.get_scope('B')
  B = type('B', (A,), {
	'handle'  : h,
	'__new__' : cpp_allocate,
	'callme'  : TemplateWrapper(h, 'callme')
  })
  C = type('C', (), {
	'handle'  : gIL.get_scope('C'),
	'__new__' : cpp_allocate
  })
  # call templates
  a = A()
  b = B()
  c = C()

  # explicit template instantiation
  b.callme['A, int, C*'](a, 42, c)

  # implicit template instantiation
  b.callme(a, 42, c)


#Output: python3  template_instantiate_demo.py
# callme in B!
# callme in B!

Listing 3.

Listing 3 defines a python function (cpp_compile) which processes C++ code that contains classes A, B and C, and a template named callme. Next it creates python wrapper objects for them and connects them via handles to LIL. Each python operation such as construction (via ___init___) or calling (via ___call___ or ___get_item___) is tracked and connected to the C++ side to mutate the object states on both ends. For example, running A() sends a request to LIL to allocate memory for the C++ class A and construct it. The C++ objects are always allocated on the heap as the OS stack is set and used by the Python interpreter and there is no foreseeable benefit in using stack-allocated C++ objects. Then the ___handle___ is assigned to the corresponding C++ named entity. Class B has a template function callme which is managed by the TemplateWrapper object. The object is responsible for finding and instantiating the template with the requested types upon function call.

The memory management assumes that the Python wrapper owns the C++ object otherwise the C++ runtime needs to destroy the object which will then require a notification sent back. The C++ object deallocation happens when the reference count reaches zero if CPython is used or upon garbage collector’s instruction if PyPy is used. Relying on memory management fits naturally in the programming model either way.

Implementation Approaches.

As C++ is a complex language with a complicated in-memory object layout, the full reflection information about the language entities is overwhelming and should be approached using a layered system. These layers can be roughly defined by:

What kind of entity is that? What is the name of that entity if any?
What is the data members list of this entity?
How to generate an on-disk read and write information about this C++ entity?

A flexible LIL should be able to abstract out unnecessary details and provide them only when the actual implementation requires it. Most of the C++ details are stored in a clang::Decl which is too verbose for many use-cases. Ultimately, the implementation should provide several abstraction layers (likely using the facade pattern).

Clang and LLVM do not make promises of backward compatibility. This allows their implementation to evolve quite freely without being constrained by backward compatibility concerns.This means that downstream tools (the ones which are not being integrated against every change in LLVM) have higher maintenance costs due to needed migrations when switching to new versions. Even if LIL lives in LLVM mainstream, its users will most often be downstream tools. In order to reduce the maintenance costs and make transitions to future versions of the layer easier, we aim for a design that offers greater ABI breakage resilience. That is, it’s API surface should be small, and volatile parts should be passed as opaque types. LibClang has taken a similar approach by introducing CXXCursors and has been praised for being relatively stable. Unfortunately, we cannot just extend LibClang, as it uses a Visitor-based mechanism to build the CXXCursors and that is not a very efficient operation.

Instead, we can wrap the clang::Decl into a facade class hiding the redundant amount of detail. For instance:

enum EntityKind {
  NamedDecl, TagDecl, NamespaceDecl, ... 
};

using OpaqueCxxDecl = void*;

struct CxxEntity { // meant to expose parts of clang::Decl
  EntityKind Kind;
  constexpr auto name = "...";
  constexpr auto name_as_written = "...";
  constexpr auto value = "";
  CxxOpaqueDecl details; // pimpl
};

std::string getQualifiedNameAsString(OpaqueCxxDecl CxxD);

Listing 4.

Listing 4 defines a lightweight wrapper entity, CxxEntity, which exposes information about the kind of the C++ entity, its eventual name and value. The CxxOpaqueDecl hides the implementation details using the pimpl pattern aiming to provide a stable API.

In a similar way, Listing 5 shows the corresponding name lookup syntax.

using OpaqueCxxLookupResult = void*;
struct CxxLookupResult {
  std::vector<CxxEntity> Decls;
  void *details; // clang::LookupResult
};

CxxLookupResult R = cpp::lookup("std");
auto *StdNamespace = R.Decls[0];
CxxLookupResult R = cpp::lookup("vector", StdNamespace);

...
void DiagnoseAmbiguousOverloads(OpaqueCxxLookupResult R, ...);

Listing 5.

Conclusion

The demand for cross-language talk has been increasing especially in the field of data science. Different systems are required to interoperate with C++ codebases to a different extent. The straightforward approach of generating wrappers statically has been proven to be reliable but limiting in the supported features and requiring a library author or maintainer to provide bindings for. The advancements in the area of C++ interpreters have enabled a more automatic approach to bindings without sacrificing performance.

This document outlines a path forward to automatic creation of bindings on demand, describing several theoretical and practical challenges. The proposed approach is supported by a prototype which allows Python to interoperate with C++ code, instantiate a template and execute it. The document explains how to improve the API stability of the language interoperability layer.

The implementation of language interoperability layer is challenging but it can be a baseline for C++ cross-language talks beyond just Python. A similar approach is tested for D and Julia.

Acknowledgement

We would like to thank people who contributed to this document by comments, suggestions and design reviews. In particular, Wim Lavrijsen (LBL), Axel Naumann (CERN), David Lange (Princeton), Ioana Ifrim (Princeton), Bernhard Manfred Gruber (CERN, CASUS, TU Dresden),

References

[1] Results of US Research Software Sustainability Institute (URSSI) programming language survey, private communication from Dan Katz (UIUC/NCSA), and https://spectrum.ieee. org/computing/software/the-2017-top-programming-languages (Visited August 2021).

[2] Beazley, David M. “SWIG: An Easy to Use Tool for Integrating Scripting Languages with C and C++.” Tcl/Tk Workshop. Vol. 43. 1996.

[3] SIP software home page, https://www.riverbankcomputing.com/ software/sip/intro (Visited Sep 2021)

[4] Boost.Python Reference Manual (for software version 1.72), https: //www.boost.org/doc/libs/1_65_1/libs/python/doc/html/reference/index.html, (Visited August 2021)

[5] Pybind11 project homepage, http://pybind11.readthedocs.io/ (Visited August 2021).

[6] Vasilev, Vasil, et al. “Cling–the new interactive interpreter for ROOT 6.” Journal of Physics: Conference Series. Vol. 396. No. 5. IOP Publishing, 2012.

[7] Cxx.jl GitHub repository, https://github.com/JuliaInterop/Cxx.jl (Visited September 2021)

[8] Wim Lavrijsen, cppyy, presentation, https://compiler-research.org/meetings/#caas_02Sep2021 (Visited September 2021)

[9] Cppyy Philosophy, https://cppyy.readthedocs.io/en/latest/philosophy.html#run-time-v-s-compile-time (Visited August 2021)

[10] Clang: a C language family frontend for LLVM, https://clang.llvm.org/, (Visited August 2021)

[11] Alexandru Militaru, Calling C++ libraries from a D-written DSL: A cling/cppyy-based approach, presentation, https://compiler-research.org/meetings/#caas_04Feb2021, (Visited August 2021)

[12] Keno Fischer, A brief history of Cxx.jl, https://compiler-research.org/meetings/#caas_05Aug2021, (Visited August 2021)

[13] Template instantiation not happening in indirectly called function, https://bitbucket.org/wlav/cppyy/issues/369/template-instantiation-not-happening-in (Visited August 2021)

[14] Extend clang AST to provide information for the type as written in template instantiations, https://llvm.org/OpenProjects.html#clang-template-instantiation-sugar (Visited August 2021)

Appendix A

Complete code for InterpreterUtils.cpp.

#include "InterpreterUtils.h"

#include "clang/AST/Mangle.h"
#include "clang/Interpreter/Interpreter.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Sema/Lookup.h"
#include "clang/Sema/TemplateDeduction.h"

#include "llvm/Support/TargetSelect.h"

#include <memory>
#include <vector>
#include <sstream>

using namespace clang;

static std::unique_ptr<clang::Interpreter> CreateInterpreter() {
  std::vector<const char *> ClangArgv = {"-Xclang", "-emit-llvm-only"};
  auto CI = llvm::cantFail(IncrementalCompilerBuilder::create(ClangArgv));
  return llvm::cantFail(Interpreter::create(std::move(CI)));
}

struct LLVMInitRAII {
  LLVMInitRAII() {
	llvm::InitializeNativeTarget();
	llvm::InitializeNativeTargetAsmPrinter();
  }
  ~LLVMInitRAII() {llvm::llvm_shutdown();}
} LLVMInit;

auto Interp = CreateInterpreter().release();

static LookupResult LookupName(Sema &SemaRef, const char* Name) {
  ASTContext &C = SemaRef.getASTContext();
  DeclarationName DeclName = &C.Idents.get(Name);
  LookupResult R(SemaRef, DeclName, SourceLocation(), Sema::LookupOrdinaryName);
  SemaRef.LookupName(R, SemaRef.TUScope);
  assert(!R.empty());
  return R;
}

Decl_t Clang_LookupName(const char* Name, Decl_t Context /*=0*/) {
  return LookupName(Interp->getCompilerInstance()->getSema(), Name).getFoundDecl();
}

FnAddr_t Clang_GetFunctionAddress(Decl_t D) {
  clang::NamedDecl *ND = static_cast<clang::NamedDecl*>(D);
  clang::ASTContext &C = ND->getASTContext();
  std::unique_ptr<MangleContext> MangleC(C.createMangleContext());
  std::string mangledName;
  llvm::raw_string_ostream RawStr(mangledName);
  MangleC->mangleName(ND, RawStr);
  auto Addr = Interp->getSymbolAddress(RawStr.str(), /*IsMangled=*/true);
  if (!Addr)
	return 0;
  return *Addr;
}

void * Clang_CreateObject(Decl_t RecordDecl) {
  clang::TypeDecl *TD = static_cast<clang::TypeDecl*>(RecordDecl);
  std::string Name = TD->getQualifiedNameAsString();
  const clang::Type *RDTy = TD->getTypeForDecl();
  clang::ASTContext &C = Interp->getCompilerInstance()->getASTContext();
  size_t size = C.getTypeSize(RDTy);
  void * loc = malloc(size);

  // Tell the interpreter to call the default ctor with this memory. Synthesize:
  // new (loc) ClassName;
  static unsigned counter = 0;
  std::stringstream ss;
  ss << "auto _v" << counter++ << " = " << "new ((void*)" << loc << ")" << Name << "();";

  auto R = Interp->ParseAndExecute(ss.str());
  if (!R)
    return nullptr;

  return loc;
}


void Clang_Parse(const char* Code) {
  llvm::cantFail(Interp->Parse(Code));
}

/// auto f = &B::callme<A, int, C*>;
Decl_t Clang_InstantiateTemplate(Decl_t Scope, const char* Name, const char* Args) {
  static unsigned counter = 0;
  std::stringstream ss;
  NamedDecl *ND = static_cast<NamedDecl*>(Scope);
  // Args is empty.
  // FIXME: Here we should call Sema::DeduceTemplateArguments (for fn addr) and
  // extend it such that if the substitution is unsuccessful to get out the list
  // of failed candidates, eg TemplateSpecCandidateSet.
  ss << "auto _t" << counter++ << " = &" << ND->getNameAsString() << "::"
 	<< Name;
  llvm::StringRef ArgList = Args;
  if (!ArgList.empty())
	ss << '<' << Args << '>';
  ss  << ';';
  auto PTU1 = &llvm::cantFail(Interp->Parse(ss.str()));
  llvm::cantFail(Interp->Execute(*PTU1));

  //PTU1->TUPart->dump();

  VarDecl *VD = static_cast<VarDecl*>(*PTU1->TUPart->decls_begin());
  UnaryOperator *UO = llvm::cast<UnaryOperator>(VD->getInit());
  return llvm::cast<DeclRefExpr>(UO->getSubExpr())->getDecl();
}

Complete code for template_instantiate_demo.py

import ctypes

libInterop = ctypes.CDLL("./libInterOp.so")
_cpp_compile = libInterop.Clang_Parse
_cpp_compile.restype = ctypes.c_int
_cpp_compile.argtypes = [ctypes.c_char_p]

def cpp_compile(arg):
	return _cpp_compile(arg.encode("ascii"))

# define some classes to play with
cpp_compile(r"""\
void* operator new(__SIZE_TYPE__, void* __p);
extern "C" int printf(const char*,...);
class A {};
class C {};
class B {
public:
	template<typename T, typename S, typename U>
	static void callme(T, S, U*) { printf(" callme in B! \n"); }
};
""")

class InterOpLayerWrapper:
  # Responsible for providing a python wrapper over the interop layer.
  _get_scope = libInterop.Clang_LookupName
  _get_scope.restype = ctypes.c_size_t
  _get_scope.argtypes = [ctypes.c_char_p]

  _construct = libInterop.Clang_CreateObject
  _construct.restype = ctypes.c_void_p
  _construct.argtypes = [ctypes.c_size_t]

  _get_template_ct = libInterop.Clang_InstantiateTemplate
  _get_template_ct.restype = ctypes.c_size_t
  _get_template_ct.argtypes = [ctypes.c_size_t, ctypes.c_char_p, ctypes.c_char_p]

  def _get_template(self, scope, name, args):
	return self._get_template_ct(scope, name.encode("ascii"), args.encode("ascii"))

  def get_scope(self, name):
	return self._get_scope(name.encode("ascii"))

  def get_template(self, scope, name, tmpl_args = [], tpargs = []):
	if tmpl_args:
  	# Instantiation is explicit from full name
  	full_name = name + '<' + ', '.join([a for a in tmpl_args]) + '>'
  	meth = self._get_template(scope, full_name, '')
	elif tpargs:
  	# Instantiation is implicit from argument types
  	meth = self._get_template(scope, name, ', '.join([a.__name__ for a in tpargs]))
	return CallCPPFunc(meth)
    
  def construct(self, cpptype):
	return self._construct(cpptype)

class TemplateWrapper:
  # Responsible for finding a template which matches the arguments.
  def __init__(self, scope, name):
	self._scope = scope
	self._name  = name

  def __getitem__(self, *args, **kwds):
	# Look up the template and return the overload.
	return gIL.get_template(
  	self._scope, self._name, tmpl_args = args)

  def __call__(self, *args, **kwds):
	# Keyword arguments are not supported for this demo.
	assert not kwds

	# Construct the template arguments from the types and find the overload.
	ol = gIL.get_template(
  	self._scope, self._name, tpargs = [type(a) for a in args])

	# Call actual method.
	ol(*args, **kwds)


class CallCPPFunc:
  # Responsible for calling low-level function pointers.
  _get_funcptr = libInterop.Clang_GetFunctionAddress
  _get_funcptr.restype = ctypes.c_void_p
  _get_funcptr.argtypes = [ctypes.c_size_t]

  def __init__(self, func):
	# In real life this would normally go through the interop layer to know
	# whether to pass pointer, reference, or value of which type etc.
	proto = ctypes.CFUNCTYPE(None, ctypes.c_void_p, ctypes.c_int, ctypes.c_void_p)
	self._funcptr = proto(self._get_funcptr(func))

  def __call__(self, *args, **kwds):
	# See the comment above.
	a0 = ctypes.cast(args[0].cppobj, ctypes.POINTER(ctypes.c_void_p))
	a1 = args[1]
	a2 = args[2].cppobj
	return self._funcptr(a0, a1, a2)

gIL = InterOpLayerWrapper()

def cpp_allocate(proxy):
  pyobj = object.__new__(proxy)
  proxy.__init__(pyobj)
  pyobj.cppobj = gIL.construct(proxy.handle)
  return pyobj


if __name__ == '__main__':
  # create a couple of types to play with
  A = type('A', (), {
	'handle'  : gIL.get_scope('A'),
	'__new__' : cpp_allocate
  })
  h = gIL.get_scope('B')
  B = type('B', (A,), {
	'handle'  : h,
	'__new__' : cpp_allocate,
	'callme'  : TemplateWrapper(h, 'callme')
  })
  C = type('C', (), {
	'handle'  : gIL.get_scope('C'),
	'__new__' : cpp_allocate
  })

  # call templates
  a = A()
  b = B()
  c = C()

  # explicit template instantiation
  b.callme['A, int, C*'](a, 42, c)

  # implicit template instantiation
  b.callme(a, 42, c)