Today, the widest vector units found on a mass production processor are in the Intel Xeon Phi coprocessor with its 512-bit vector registers. These vector units have a theoretical single precision peak performance gain of 16x for single flop operations. In practice, due to limiting factors like memory access latency, I/O demand, serial code sections, and global synchronization, the real performance improvement number is typically much lower.

In this work, we present a solution to take advantage of vector units across various processor SIMD architectures with a single, portable source code. This is accomplished by just adding a vector type and hardware intrinsics support to C/C++ language through a header file that is compatible with gcc and commercially available compilers in general. We hide different hardware/compiler feature sets under a common portable programming syntax. In addition, the implementation supports a scalar backend alternative to target unknown architectures.

This implementation has been successfully demonstrated on multiple SIMD architectures including Intel SSE/AVX/AVX-512/IMCI, ARM NEON and IBM Power VSX using only a common header file to enable the compiler to generate highly optimized code with proper SIMD instructions for the given underlying architecture.


Article metrics loading...

Loading full text...

Full text loading...


  1. [1]AndreolliCedric, ThierryPhilippe, BorgesLeonardo, SkinnerGregg, YountChuck. 2014. Characterization and optimization methodology applied to stencil computations. Chapter 23 in Book High Performance parallelism pearls: multicore and many-core programming approaches. ISBN 9780128021187
    [Google Scholar]
  2. [2]https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso
  3. [3]PattersonDavid A. and HennesseyJohn L., Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751
    [Google Scholar]
  4. [4]SouzaPaulo, BorgesLeonardo, AndreolliCédric and ThierryPhilippe.Tool for developing portable vectorizable code. Chapter in Book High performance parallelism pearls two/ Multicore and many-core programming approaches. Accepted for publication.
    [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error