Small Scale project, 2020 Hamming distance
Scenario
Researcher
Andreas Ott, Principal Investigator and Privatdozent, Institute for Mathematics
Initial problem
- Bottleneck in an analysis pipeline of SARS-CoV-2 genome data
- Calculation of hamming distances from a large set of gene expressions
- This pre-processing step takes many hours to complete even using many cores
- Prevented them from analysing larger datasets
Outcome
What we did
- Replace Python implementation with a highly optimized, vectorized and parallelized C++ version
- Package as a python library hammingdist to fit into the existing analysis pipeline
- Add vectorized SSE2/AVX/AVX512 implementations with automatic runtime dispatch based on detected CPU
- Provide pre-compiled binary wheels on PyPI for convenient installation
- Set up Continuous Integration and Continuous Delivery to provide automated testing and deployment of the code
Results
- Code now runs in minutes on a single core with original dataset
- Code now scales using HPC resources to process hundreds of thousands of gene expressions
- Publication: arXiv:2207.03394 [math.AT] MuRiT: Efficient Computation of Pathwise Persistence Barcodes in Multi-Filtered Flag Complexes via Vietoris-Rips Transformations
- Publication: arXiv:2106.07292 [q-bio.PE] Topological data analysis identifies emerging adaptive mutations in SARS-CoV-2
- Pipeline: CoVtRec Topological Surveillance of Recurrent Mutations in SARS-CoV-2