1887

Abstract

Summary

Today, the widest vector units found on a mass production processor are in the Intel Xeon Phi coprocessor with its 512-bit vector registers. These vector units have a theoretical single precision peak performance gain of 16x for single flop operations. In practice, due to limiting factors like memory access latency, I/O demand, serial code sections, and global synchronization, the real performance improvement number is typically much lower.

In this work, we present a solution to take advantage of vector units across various processor SIMD architectures with a single, portable source code. This is accomplished by just adding a vector type and hardware intrinsics support to C/C++ language through a header file that is compatible with gcc and commercially available compilers in general. We hide different hardware/compiler feature sets under a common portable programming syntax. In addition, the implementation supports a scalar backend alternative to target unknown architectures.

This implementation has been successfully demonstrated on multiple SIMD architectures including Intel SSE/AVX/AVX-512/IMCI, ARM NEON and IBM Power VSX using only a common header file to enable the compiler to generate highly optimized code with proper SIMD instructions for the given underlying architecture.

Loading

Article metrics loading...

/content/papers/10.3997/2214-4609.201414038
2015-11-16
2020-04-01
Loading full text...

Full text loading...

References

  1. [1]AndreolliCedric, ThierryPhilippe, BorgesLeonardo, SkinnerGregg, YountChuck. 2014. Characterization and optimization methodology applied to stencil computations. Chapter 23 in Book High Performance parallelism pearls: multicore and many-core programming approaches. ISBN 9780128021187
    [Google Scholar]
  2. [2]https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso
  3. [3]PattersonDavid A. and HennesseyJohn L., Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751
    [Google Scholar]
  4. [4]SouzaPaulo, BorgesLeonardo, AndreolliCédric and ThierryPhilippe.Tool for developing portable vectorizable code. Chapter in Book High performance parallelism pearls two/ Multicore and many-core programming approaches. Accepted for publication.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/papers/10.3997/2214-4609.201414038
Loading
/content/papers/10.3997/2214-4609.201414038
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error