دانلود رایگان مقاله اجرای GPU روش قطعه سه بعدی مقیاس پذیری خطی برای محاسبات ساختار الکترونی

عنوان فارسی
اجرای GPU روش قطعه سه بعدی مقیاس پذیری خطی برای محاسبات ساختار الکترونی در مقیاس بزرگ
عنوان انگلیسی
GPU implementation of the linear scaling three dimensional fragment method for large scale electronic structure calculations
صفحات مقاله فارسی
0
صفحات مقاله انگلیسی
8
سال انتشار
2016
نشریه
الزویر - Elsevier
فرمت مقاله انگلیسی
PDF
کد محصول
E994
رشته های مرتبط با این مقاله
مهندسی کامپیوتر و فیزیک
گرایش های مرتبط با این مقاله
فیزیک کاربردی
مجله
ارتباطات کامپیوتر و فیزیک - Computer Physics Communications
دانشگاه
مرکز اطلاعات شبکه کامپیوتر، آکادمی علوم چین، پکن، چین
کلمات کلیدی
محاسبات ساختار، الکترونیکی، LS3DF GPU
چکیده

Abstract


LS3DF, namely linear scaling three-dimensional fragment method, is an efficient linear scaling ab initio total energy electronic structure calculation code based on a divide-and-conquer strategy. In this paper, we present our GPU implementation of the LS3DF code. Our test results show that the GPU code can calculate systems with about ten thousand atoms fully self-consistently in the order of 10 min using thousands of computing nodes. This makes the electronic structure calculations of 10,000-atom nanosystems routine work. This speed is 4.5–6 times faster than the CPU calculations using the same number of nodes on the Titan machine in the Oak Ridge leadership computing facility (OLCF). Such speedup is achieved by (a) carefully re-designing of the computationally heavy kernels; (b) redesign of the communication pattern for heterogeneous supercomputers.

نتیجه گیری

6. Conclusions and future work


In this paper, we presented our LS3DF GPU work on heterogeneous supercomputer. This code can calculate a system with thousands of atoms for SCF convergence within 5–25 min when enough GPU nodes are used. It is about 4.5–6 times faster than the corresponding CPU code. We have presented the detailed steps to speedup the code. This includes (1) a hybrid parallelization between G-space and band-index parallelization to speedup the FFT; Fig. 8. The SCF convergence of the CPU and GPU LS3DF code for 3877 atom Si quantum dot system. Note that the GPU and CPU code convergence is the same. The vertical axis is in logarithmic scale with a base 10. (2) moving all the computationally heavy parts into GPU to reduce CPU–GPU memory copy operations; (3) a data compression algorithm to reduce the MPI_Alltoall communication; (4) using direct point-to-point MPI for global communication when patching up the charge density. Nanosystem electronic structure calculation can now be reduced from hours to minutes. For example, one SCF step of the 8640 atoms CaTiO3 system (with 41,472 electrons) takes only about one minute. Current GPU AB-CG and Occupy takes about 80% of the total computational time. One of our future works is to further speedup this kernel. The bottleneck is with the small fragments, as mentioned in session 5. Two ways could be used to further speedup this part, the first is moving other CPU parts, e.g., calculating the nonlocal projector, into GPU; the second is to use CUDA streams to further exploit the parallelization of the small fragments. In this paper, we have used one MPI per GPU. However, on the Titan supercomputer, one node is equipped with 16 CPU cores and 1 GPU card. In order to fully utilize the CPU part, one good programming model would be MPI/OpenMP/CUDA. Nevertheless, such a programming model would be a big challenge in the implementation, as we have moved the most computationally intensive tasks already into the GPU.


بدون دیدگاه