Small Scale project, 2020 Jove Matrix Performance
Scenario
Researcher
Benjamin Thomitzni, PhD Student in the Theoretical and Computational Chemistry Group
Initial Problem
- Matrix multiplication code has poor single core performance
- Performance also doesn't scale beyond 4 threads.
Outcome
What we did
- Exploit symmetry to reduce required operations
- Change data layout and re-order loops to avoid cache misses
- Use a specialized linear algebra library to generate optimized code
- Add thread-safe and performant parallelism using OpenMP
Results
- Optimized code runs more than 4x faster on a single core with small test dataset
- With a larger dataset this increases to 8x performance improvement on a single core
- Near perfect parallel scaling on a 12-core machine with small test dataset
- Near perfect parallel scaling on a 56-core machine with larger dataset