Small Scale project, 2020 Jove Matrix Performance

Scenario

Researcher

Benjamin Thomitzni, PhD Student in the Theoretical and Computational Chemistry Group

Initial Problem

  • Matrix multiplication code has poor single core performance
  • Performance also doesn't scale beyond 4 threads.

Outcome

What we did

  • Exploit symmetry to reduce required operations
  • Change data layout and re-order loops to avoid cache misses
  • Use a specialized linear algebra library to generate optimized code
  • Add thread-safe and performant parallelism using OpenMP

Results

  • Optimized code runs more than 4x faster on a single core with small test dataset
  • With a larger dataset this increases to 8x performance improvement on a single core
  • Near perfect parallel scaling on a 12-core machine with small test dataset
  • Near perfect parallel scaling on a 56-core machine with larger dataset