摘要: |
The major achievements enabled by QMC Endstation grant include * Performance improvement on clusters of x86 multi-core systems, especially on Cray XT systems * New and improved methods for the wavefunction optimizations * New forms of trial wavefunctions * Implementation of the full application on NVIDIA GPUs using CUDA The scaling studies of QMCPACK on large-scale systems show excellent parallel efficiency up to 216K cores on Jaguarpf (Cray XT5). The GPU implementation shows speedups of 10-15x over the CPU implementation on older generation of x86. We have implemented hybrid OpenMP/MPI scheme in QMC to take advantage of multi-core shared memory processors of petascale systems. Our hybrid scheme has several advantages over the standard MPI-only scheme. * Memory optimized: large read-only data to store one-body orbitals and other shared properties to represent the trial wave function and many-body Hamiltonian can be shared among threads, which reduces the memory footprint of a large-scale problem. * Cache optimized: the data associated with an active Walker are in cache during the compute-intensive drift-diffusion process and the operations on an Walker are optimized for cache reuse. Thread-local objects are used to ensure the data affinity to a thread. * Load balanced: Walkers in an ensemble are evenly distributed among threads and MPI tasks. The two-level parallelism reduces the population imbalance among MPI tasks and reduces the number of point-to-point communications of large messages (serialized objects) for the Walker exchange. * Communication optimized: the communication overhead, especially for the collective operations necessary to determine ET and measure the properties of an ensemble, is significantly lowered by using less MPI tasks. The multiple forms of parallelism afforded by QMC algorithms make them ideal candidates for acceleration in the many-core paradigm. We presented the results of our effort to port the QMCPACK simulation code to the NVIDIA CUDA GPU platform. We restructured the CPU algorithms to express additional parallelism, minimize GPU-CPU communication, and efficiently utilize the GPU memory hierarchy. |