Today’s high-end multicore systems are characterized by a deep memory hierarchy, i.e., several levels of local and shared caches, with limited size and bandwidth per core. The ever-increasing gap between the processor and memory speed will further exacerbate the problem and has lead the scientific community to revisit numerical software implementations to better suit the underlying memory subsystem for performance (data reuse) as well as energy efficiency (data locality). The authors propose a novel multithreaded wavefront diamond blocking (MWD) implementation in the context of stencil computations, which represents the core operation for seismic imaging in oil industry. The stencil diamond formulation introduces temporal blocking for high data reuse in the upper cache levels. The wavefront optimization technique ensures data locality by allowing multiple threads to share common adjacent point stencil. Therefore, MWD is able to take up the aforementioned challenges by alleviating the cache size limitation and releasing pressure from the memory bandwidth. Performance comparisons are shown against the optimized 25-point stencil standard seismic imaging scheme using spatial and temporal blocking and demonstrate the effectiveness of MWD.


Article metrics loading...

Loading full text...

Full text loading...


  1. LIKWID Performance Tools
    LIKWID Performance Tools. [2015]. http://code.google.com/p/likwid.
  2. Malas, T.
    [2015] Girih Stencil Optimization Framework. https://github.com/tareqmalas/girih.
  3. Malas, T., Hager, G., Ltaief, H. and Keyes, D.
    [2014] Towards Energy Efficiency and Maximum Computational Intensity for Stencil Algorithms Using Wavefront Diamond Temporal Blocking. arXiv preprintarXiv.org/abs/1410.5561.
    [Google Scholar]
  4. Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G. and Keyes, D.
    [2014] Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. To appear in SIAM Journal on Scientific Computing. arXiv preprintarXiv.org/abs/1410.3060.
    [Google Scholar]
  5. McCalpin, J. D.
    [1995] STREAM: Sustainable Memory Bandwidth in High Performance Computers. Charlottesville, VA.http://www.cs.virginia.edu/stream/.
    [Google Scholar]
  6. Strzodka, R., Shaheen, M., Pajak, D. and Seidel, H.-P.
    [2011] Cache Accurate Time Skewing in Iterative Stencil Computations. In International Conference on Parallel Processing, 571–581.
    [Google Scholar]
  7. Wellein, G., Hager, G., Zeiser, T., Wittmann, M. and Fehske, H.
    [2009] Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization. 33rd Annual IEEE International Computer Software and Applications Conference, 579–586.
    [Google Scholar]
  8. Williams, S., Waterman, A. and Patterson, D.
    [2009] Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4), 65–76.
    [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error