- Home
- Conferences
- Conference Proceedings
- Conferences
EAGE Workshop on High Performance Computing for Upstream
- Conference date: 07 Sep 2014 - 10 Sep 2014
- Location: Chania, Crete, Greece
- ISBN: 978-94-6282-025-8
- Published: 07 September 2014
1 - 20 of 39 results
-
-
Keynote Presentation: Design of Seismic Modeling Engines and Optimization Algorithms for FWI Applications
By J.M. VirieuxDifferent challenges of the full waveform inversion (FWI) will be encountered as we need to move away from acoustic approximation to improved elastic modeling. Quantitative high-resolution imaging could be achieved through efficient gradient estimation of the cost function but also through Hessian influence estimation which is crucial for multiparameter reconstruction. We propose an analysis of the different steps required to achieve the computation of these ingredients: their complexities, the algorithm organization bringing the attention to issues for performing the medium reconstruction.
-
-
-
Speeding-up FWI by One Order of Magnitude
Authors V. Etienne, T. Tonellot, P. Thierry, V. Berthoumieux and C. AndreolliWe present several strategies to speed-up full waveform inversion via specific optimizations of the time-domain finite-difference modeling. Efficient vectorization of the computations is achieved on the Intel xeon computing core when modifying the absorbing boundaries. We propose also to increase the computational speed by using high orders in space and by solving the second-order wave equation instead of the first-order formulation. These strategies combined together allow for a reduction of the computation time by a factor of 27 for modeling using the SEG SEAM II model. Finally, we show that the optimized algorithm has a quasi-perfect scalability on one dual-socket computing node.
-
-
-
Seismic Data Prestack Kirchhoff Time Migration with Multi-GPUs
More LessIn this paper, we present a scheme of Prestack Kirchhoff Time Migration(PKTM) with multi-GPU. Firstly, we intorduced three main optimization points of GPU code of PKTM, then we test the code with a real field data. After analysis the efficiency curve, we proposed a multi-GPU flowchart of PKTM. it first splits the seismic data to different GPU nodes according to the offset range and collecting and sorting the result to CRP gather , the proposed method can reach the maximum efficency of GPU PKTM code.
-
-
-
Hybridizable Discontinuous Galerkin Methods for Solving Helmholtz Equations
Authors M. Bonnasse-Gahot, H. Calandra, J. Diaz and S. LanteriAs the drilling is expensive, the petroleum industry is interested by methods able to produce images of the intern sturctures of the Earth before the drilling. Seismic imaging can be processed in time-domain or in harmonic-domain. The imaging condition is easier to obtain in frequency domain, but solving the Helmholtz equations in 3D is almost impossible due to a huge computationnal cost, even with the help of High Performance Computing. We then have to develop less expensive methods. We consider hybridizable discontinuous Galerkin (HDG) method to solve Helmholtz equations : because it is a discontinuous Galerkin method, it is more convenient to handle the topography of the subsurface than finite difference methods and we can use unstructured meshes and have a flexible choice of interpolation orders. Moreover, as it is an hybrid method we reduce the globally coupled unknowns leading to a reduction of the computational cost.
-
-
-
Full-Bandwidth FWI: A Paradigm Change in the Imaging and Interpretation of Seismic Data
More LessComputational and workflow limitations restricted the scope of FWI applications, with the method only using very low frequencies and transmitted arrivals. We have now overcome these limitations and can invert the full bandwidth of the data and include the reflected wavefield. The new, full-bandwidth, high-resolution FWI products have broad utility, beyond merely being used as migration velocity models. In particular: (1) subsurface scales not visible in traditional seismic can now be seen in FWI models; (2)the models are used for improved reservoir characterization and 4D seismic interpretation; (3) the method offers an effective approach for handling novel,non-conventional acquisition geometries and reduce cycle time. At the presentation we will show examples demonstrating the methods and benefits of Full-Bandwidth FWI.
-
-
-
Design and Performance of an Intel Xeon Phi based Cluster for Reverse Time Migration
Authors V. Arslan, J.Y. Blanc, M. Tchiboukdjian, P. Thierry and G. Thomas-CollignonIn this paper, we design an Intel Xeon Phi based cluster specifically tuned for RTM and compare its performance to our current optimized architecture consisting of Nvidia GPU accelerated nodes. The Xeon Phi nodes are designed to offer a good balance between co-processor computing capabilities, host computing capabilities, local scratch bandwidth and network bandwidth. Performance is evaluated at the system level including wave propagation kernels but also wavefield checkpointing to the local scratch and pre and post-processing steps. Moreover, the kernels are tuned for the Xeon Phi and compared to our GPU optimized kernels taking into account the tuning effort. Overall, the obtained performance is comparable to our GPU-based architecture while offering a better portability since most of the code is identical to the CPU implementation.
-
-
-
Massively Parallel Algebraic Multiscale Linear Solver
Authors A. Manea, J. Sewall and H.A. TchelepiWe analyze the parallel performance of the Algebraic Multiscale Solver (AMS) for the heterogeneous pressure system that arises from incompressible flow in porous media. We propose special modifications to the algorithm to improve its computational efficiency on massively parallel architectures. AMS is a two-level linear solver algorithm, based on the idea of non-overlapping domain-decomposition with a localization assumption, where the local solutions in each domain are used to construct the coarse operator. The overall scalability of AMS is strongly tied to the choice of parameters and algorithms involved. These choices additionally impact the convergence properties of the solver. We focus on the basis-function kernel, which dominates the setup phase, and on the local smoother, which dominates the solution phase. We carefully consider the balance of computational scalability and convergence rate to ensure high overall performance while maintaining robustness. We present test results for highly heterogeneous problems derived from the SPE10 benchmark, and ranging in size from millions to tens of millions of cells. The parallel AMS code is run two different architectures: a multi-core architecture and a massively parallel Knights Corner architecture. We also compare the performance and robustness of AMS with the widely-used SAMG solver, running on the multi-core architecture.
-
-
-
Keynote Presentation: Moore's Law is Dying, What Now?
By J. OdegardMoore’s Law has been and continues to be a critical driver for technology development, industry efficiency, and social transformation. In 1965 Gordon E. Moore observed that the number of transistors that can be placed on an integrated circuit would double approximately every two years. This trend has continued unabated until today and silicon roadmaps suggest we can expect this to continue through this decade. However, short of a technology disruption, the road beyond the next decade suggests Moore’s Law scaling will be slowing down or coming to a full stop. In this talk we discuss how we got to where we are today and point to the importance of investing in people, software, and tools moving forward.
-
-
-
Utilizing Key Performance Indices to Deliver Extensive HPC Management and Administration
Authors S.E. Alsaif, M.A. Baddourah and A.A. TurkiThe paper introduces Key Performance Indices (KPIs) that cover four major service areas of importance to HPC: System Utilization, Network Health and Performance, System Availability, and Jobs Scheduling Efficiency. Development and implantation of the proposed indices are discussed in details. It has been found that the proposed methodologies play an important role in clarifying the overall representation of our systems, allowing us to proactively and optimally administer and manage our HPC resources for improved simulation processes.
-
-
-
Fast Simulation of Through-casing Resistivity Measurements Using Semi-analytical Asymptotic Models. Part 1: Accuracy Study
Authors A. Erdozain, V. Péron and D. PardoWhen trying to obtain a better characterization of the Earth's subsurface, it is common to use borehole through-casing resistivity measurements. It is also common for the wells to be surrounded by a metal casing to protect the well and avoid possible collapses. The presence of this metal case highly complicates the numeric simulation of the problem due to the high conductivity of the casing compared to the conductivity of the rock formations. Here we present an application of some theoretical asymptotic methods in order to deal with complex borehole scenarios like cased wells. The main idea consist in replacing the part of the domain related to the casing by a transmission impedance boundary condition. The small thickness of the casing makes it ideal to apply this kind of mathematical technique. When eliminating the casing from the computational domain, the computational cost of the problems considerably decreases, while the effect of the casing does not disappear due to the impedance transmission conditions. The results show that when applying an order three impedance boundary condition for a simplified domain, it only generates a negligible approximation error, while it considerably reduces the computational cost.
-
-
-
FRTM - A Productive Framework for Reverse Time Migration
Authors D. Gruenewald, N. Ettrich, M. Rahn and F.J. PfreundtWe have identified challenges upcoming hardware development is going to impose on RTM implementations. The increasing heterogeneity and complexity of target machines needs to be transparently mapped into the software layer. An efficient fault tolerance mechanism needs to be provided and I/O latencies need to be efficiently hidden. We have introduced a framework for RTM which is able to solve these problems. The framework is data dependency driven on two granularity levels. On the coarse level, concurrent computation of shots is powered by GPI-Space, a parallel development and execution framework. GPI-Space boosts our RTM framework by introducing a fault tolerant execution layer, an efficient topology mapping and an on the fly resource management. On the fine level, the computation of one shot is handled by domain decomposition in a task based model. The tight coupling between neighbouring domains is efficiently relaxed by the one sided asynchronous communication API GPI-2.0. Weak synchronization primitives allow for a fine granular and application specific breakup of data synchronization points with optimal overlap of communication by computation. Our framework has an inherent separation of parallelization and computation. Domain experts concentrate on the implementation of domain knowledge. Computer scientist can simultaneously do the parallelization and optimization.
-
-
-
Compute and Data Intensive Platforms Designed for Industry and Productivity
By E. L. GohFor complex seismic processing and reservoir simulation, a scalable coherent shared memory design that is near Uniform Memory Access, NUMA, as opposed to Non Uniform Memory Access, NUMA.
-
-
-
Keynote Presentation: Adapting Upstream Applications to Extreme Scale
By D. KeyesAlgorithmic adaptations are required if anticipated exascale hardware is to be used near its potential for upstream applications, since the existing code base has been engineered to minimize floating point operations. Programmers must now minimize synchronizations, memory usage, and memory transfers, while extra flops on locally cached data are almost “free”. High concurrency requires greater freedom to redistribute data while power-efficient design of the individual cores will likely require greater fault tolerance from algorithms to relieve the hardware. Stencil operation-intensive hyperbolic solvers present different opportunities for improving data locality in different regimes of dimension, number of components, discretization order, stencil structure, coefficient characteristics, and hardware characteristics. Today’s elliptic solvers exploit frequent global synchronizations, ultimately reflecting the global Green’s function for the Laplacian, yet execute few flops to cover these latencies. After decades of algorithm refinement during a period of programming model stability, new programming models and algorithms must be developed simultaneously. In this presentation, we briefly recap the architectural constraints and roadmap, highlight ongoing work at KAUST, and outline future directions.
-
-
-
Rapid High-Fidelity Reservoir Simulation with Fine-Grained Parallelism on Multiple GPUs
Authors J. Shumway, K. Esler, K. Mukundakrishnan, V. Natoli, Y. Zhang and J. GilmanIndustry pressures, including large full field simulation, modeling enhanced recovery from mature assets, and the rapid development of unconventional assets, are creating a huge demand for large, fast, high-fidelity reservoir simulation. Similar needs in seismic imaging have been addressed by running on GPUs, but the application of GPUs to something as complex as reservoir simulation presents challenges. We report on a practical multi-GPU approach to reservoir simulation that provides fast turn-around times on large models of upwards of tens of millions of cells on a single workstation. All major steps of the fully implicit CPR-AMG preconditioned black-oil simulation including property evaluation and Jacobian construction are directly evaluated on the GPUs. In this process, the abundant fine grain parallelism of the GPU cores are fully exposed and utilized. Because the simulations are memory-bandwidth bound, our data structures and algorithms are optimized to efficiently utilize the limited GPU memory and increase data locality for coalesced access. Weak-scaling tests of black-oil simulations using tiled SPE-10 models validate our approach for multiple GPUs. We conclude that GPUs can deliver performance to industry demands, and this approach will see further benefits with new generations of many-core hardware.
-
-
-
Using a GPU Cluster for Multiple Realization Workflows
Authors T. Miller, G. Bowen and B. ZineddinThere a two long standing trends in reservoir simulation, firstly to increase resolution of models and secondly to run multiple realization workflows to better characterize the inherent uncertainty associated with the sub-surface. These need to be balanced when faced with defined computing resources. Moore’s law has continued to revolutionize the computing environment. Recently it has become a requirement that parallelism be exposed at all levels to exploit computer power. Therefore the GPU has become an important tool for HPC applications. This paper demonstrates that models (~1M cells, fully implicit, black oil) can be run effectively on a single GPU taking advantage of its parallel nature and more importantly the high bandwidth to memory relative to a CPU. The second parallelism utilizes a cluster of GPUs to handle multiple realization (MR) workflows, where the MR capability is implemented within the simulator itself and is optimized to take full advantage of the hardware. The “sweet spot” of being able to efficiently run response surface assisted history matching runs and Monte Carlo type analyses as a single batch process holds out the prospect of changing how simulation is used on a day-to-day basis, making it practical to move away from a single deterministic model.
-
-
-
Accelerating Curvature Attributes on GPUs
Curvature attributes have been widely used for visualization of folds, flexures, faults, collapse features, among others interesting features in seismic data. This work presents a parallel approach to compute volumetric curvature attributes for seismic data on GPU and CPU (sequential and parallel). We compare the results using different derivative operator size for sequential and the parallel approaches. The results show that for high derivative operator size the parallel approach performed on GPU can achieve further speed up than the other approaches. Also, using the proposed GPU approach, it is possible to visualize volume sections in interaction time.
-
-
-
Genetic Algorithm Based Auto-Tuning of Seismic Applications on Multi and Manycore Computers
Authors C. Andreolli, P. Thierry, L. Borges, C. Yount and G. SkinnerComplex computer systems exhibit many different characteristics that the best parameter choice becomes impossible to define. The range of parameters impacting the performance is too large to be solved by simple trial and error when considering manual tuning techniques, the domain decomposition influence, the compiler capabilities and hardware impact. Then auto-tuning appears now as an elegant solution to optimize source codes before compilation, using different compiler flags or at run time by tuning the input parameters. Starting from the basic implementation of a 3D finite differences kernel, we describe first the methodology to get an estimate of the best performance an algorithm can deliver. To get close to this theoretical achievable performance we present several tuning steps from the basic up to a full intrinsic implementation in order to improve parallelism, vectorization and data locality. Then to get to the best set of parameters, we introduce an auto-tuning methodology based on a genetic algorithm search. We are able to optimize for cache blocking sizes, domain decomposition shapes, prefetching flags or even power consumption, among others. From the un-optimized to the most optimized version, we achieved more than 6x performance improvement on the E5-2697v2 and almost 30x improvement on Xeon Phi.
-
-
-
Reverse Time Migration with Heterogeneous Multicore and Manycore Clusters
Authors P. Souza, T. Teixeira, L. Borges, A. Neto, C. Andreolli and P. ThierryIn this work we propose a parallel implementation of RTM based on cooperative work between CPUs and coprocessors that proved to be competitive to other accelerated solutions available. This implementation is able to run whatever the number of coprocessors is (from 0 to the maximum available with respect to the computer vendor specifications), and is very scalable in a cluster environment. Based on standard programming model it will also be portable without modification to any future configurations of Xeon and Xeon Phi, or X-CPU + Y-CPU that supports MPI+OpenMP+C language. Here describe our unified programing model for optimized code. We also discuss load balancing of the heterogeneous cluster configuration; validate the performance; and scalability of the current implementation. In the current configuration with 4 Xeon Phi cards with 16GB GDDR5 (64 GB total), we can migrate full shot gathers on a single node. This proposed node configuration also frees memory in the 2-socket host for RTM formulations that might require saving snapshots for cross-correlation and any other auxiliary arrays between iterations of the algorithm.
-
-
-
Task-based Programming Model for Elastodynamics
Authors J. Diaz, L. Boillot, G. Bosilca, E. Agullo, H. Calandra and H. BarucqReverse Time Migration (RTM) technique is based on many successive solutions of the full wave equation. The use of the Discontinuous Galerkin Method (DGM) as space discretization technique, associated with a leap-frog time scheme, leads to a numerical problem with inherent parallelism, an interesting property to take advantage of, on the current and future, heterogeneous many-cores architectures. Nowadays, the use of supercomputers is mainstream. However, efficiently exploiting the hardware heterogeneity remains a challenging task, with a lack of long term vision regarding the performance portability and the implementation perennially. This highlights the limits of the current programming paradigms. In this context, the use of programming paradigms based on task-graph approaches allows automatic parallelism scheduling over dynamic runtime systems. In this paper, we will present our ongoing effort to introduce a task-based programming model into a industrial elastodynamics simulation code.
-
-
-
A Parallel Evolution Strategy for Acoustic Full-Waveform Inversion
Authors Y. Diouane, H. Calandra, S. Gratton and X. VasseurIn this work, we propose another alternative to find an initial velocity model for the acoustic FWI without any physical knowledge. Motivated by the recent growth of high performance computing (HPC), we tackle the high non-linearity of the problem to minimize, using global optimization methods which are easy to parallelize, in particular, evolution strategies. The first contribution adapt evolution strategies to the FWI setting where the cost function evaluation is the most expensive part. The second contribution is the parameterization of the regarded problem, by being able to represent the model, as faithfully as possible, while limiting the number of parameters needed, since each additional parameter is an additional dimension to explore. The last contribution is to propose a highly parallel evolution strategy adapted to the FWI setting. The initial results on the Salt Dome velocity model using low frequency range, show that great improvement can be done to automate the FWI.
-