Enhance and Develop GeneROOT Infrastructure
Introduction
My name is Jeffrey Zhang, and I’m a third-year B.S. undergraduate student studying Physics at Nagoya University, Japan. I’ll be working on extending the GeneROOT infrastructure, building directly on the foundation laid by Aditya Pandey during his GSoC 2025 work on using ROOT in genome sequencing.
Mentors: Martin Vassilev, Vassil Vassilev, Aaron Jomy
Overview
Large-scale biological data, such as a fully sequenced human genome, typically occupies $\sim$500 GB. Analyzing such datasets for research involves data volumes that exceed petabytes. Handling data at this scale requires a highly robust underlying software infrastructure. To meet this challenge, the GeneROOT project draws on CERN’s extensive expertise in managing massive physics datasets through its columnar-based ROOT software framework. The GeneROOT project aims to adapt this framework specifically for processing biological data.
During the 2025 GeneROOT GSoC project, Aditya Pandey established the RNTuple data model for genome sequences. It currently supports region queries, conversion from SAM to RNTuple, and a benchmark comparison against the industry-standard CRAM format on a single test sample HG00154 from the 1000 Genomes Project.
However, the results reveal several limitations. The benchmark suite relies on a single low-coverage sample with hard-coded file paths, which is insufficient for a credible comparison with tools such as SAMtools and CRAM. In terms of performance, RNTuple’s index lookup itself performs a linear scan that does not scale to production-sized datasets. In terms of functionality, RAMtools cannot currently export records back to SAM, has no merge operation to complement the chromosome splitter, no sort, and no statistics tools. These gaps leave RAMtools as a proof of concept rather than a usable pipeline component.
My project builds on that foundation by expanding benchmark suite, optimizing indexing, evaluating compression algorithms, and bringing more SAMtools features to RAMtools.
Technical Implementation
The work breaks into five tasks:
-
Benchmark on heavy bioinformatics datasets. Refactor the benchmark suite, replace hard-coded paths with a
benchmark_config.hand CLI-driven dataset selection, run against well-known reference samples (HG001–HG007), and capture more metrics such as memory usage in addition to timing metrics. -
Cross-format comparison. Extend the
system()-call approach already used inchromosome_split_benchmark.cxxso all benchmark scripts measure SAM, BAM, and CRAM against RAMtools/RNTuple on the same datasets. -
Genomic compression algorithms. Evaluate modern quality-score compression schemes (Crumble, QVZ, CALQ, P-block), extend the
EQualCompressionBitsenum inRAMNTupleRecord.h, and add the most effective candidates as new quality policies. -
Indexing and search optimizations.
GetRowsInRange()currently does an O(N) linear scan; I’ll replace it with an O(log N) binary search over a sortedfIndex(eliminating the redundantfIndexMap/RebuildMappair), havekPositionIntervalandkMappedIntervalas configurable parameters, and implement a no-index columnar query fallback inRAMNTupleView.cxxsimilar to legacy TTreeramview_no_index.cxx. -
Add common SAMtools features to RAMtools. Add
ramntuplestats,ramntupleidxstat, andramntupleflagstat; completeramntupleviewwith N-record, region-filtering, and selective-column output; and addramntuplesplit,ramntuplemerge, andramntuplesort.
Goals
By the end of the coding period I aim to have:
- A reproducible benchmark suite that runs against multiple genomic datasets with custom commands and outputs.
- Quantitative cross-format comparisons against SAM, BAM, and CRAM.
- A measurable storage-efficiency improvement on
QUALdata using modern compression algorithms. - Faster region queries that can scale better to production-sized datasets.
- RAMtools at feature parity with some of the commonly used functionalities in SAMtools, such as
StatandView.
The combined effect should move RAMtools from a working proof of concept toward a more usable component of a real genomics pipeline.