Complex computer systems exhibit many different characteristics that the best parameter choice becomes impossible to define. The range of parameters impacting the performance is too large to be solved by simple trial and error when considering manual tuning techniques, the domain decomposition influence, the compiler capabilities and hardware impact. Then auto-tuning appears now as an elegant solution to optimize source codes before compilation, using different compiler flags or at run time by tuning the input parameters. Starting from the basic implementation of a 3D finite differences kernel, we describe first the methodology to get an estimate of the best performance an algorithm can deliver. To get close to this theoretical achievable performance we present several tuning steps from the basic up to a full intrinsic implementation in order to improve parallelism, vectorization and data locality. Then to get to the best set of parameters, we introduce an auto-tuning methodology based on a genetic algorithm search. We are able to optimize for cache blocking sizes, domain decomposition shapes, prefetching flags or even power consumption, among others. From the un-optimized to the most optimized version, we achieved more than 6x performance improvement on the E5-2697v2 and almost 30x improvement on Xeon Phi.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error