ترجمه مقاله نقش ضروری ارتباطات 6G با چشم انداز صنعت 4.0
- مبلغ: ۸۶,۰۰۰ تومان
ترجمه مقاله پایداری توسعه شهری، تعدیل ساختار صنعتی و کارایی کاربری زمین
- مبلغ: ۹۱,۰۰۰ تومان
Abstract
We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
6. Conclusions
We have implemented the Lanczos algorithm to compute the ground state energy of a many-particle quantum lattice model on three platforms: a multi-core Intel Xeon CPU, an Intel Xeon Phi coprocessor and an NVIDIA GPU. The CPU and the Xeon Phi were parallelized with OpenMP, and with only one spin species in the model, the MKL library was used to compute the sparse matrix–vector product in the Lanczos algorithm. With two spin species, a custom OpenMP function was used. The GPU was programmed with CUDA. In the single spin species case, we used the CUSPARSE library and with two spin species we used a custom CUDA kernel. We benchmarked the programs with single and double precision arithmetic in two different lattice geometries: a 1D ring with nearest-neighbor hopping and a checkerboard lattice with hoppings up to the third nearest-neighbor lattice sites. In all cases, the CPU is the fastest of the three platforms when the particle number is very low. With larger particle numbers, the GPU is the fastest, with speedup factors of up to 7.6 compared to the CPU. While the Xeon Phi is never the fastest of the three test platforms, it does outperform the CPU when the particle number is sufficiently high, by up to a speedup of 2.5. This is important, since an existing CPU code can be run on the Xeon Phi with practically no coding effort, resulting in an instant performance gain. All in all, our results indicate that with the current hardware, graphics processors with custom low level kernels offer the best performance in exactly diagonalizing manyparticle quantum lattice models at large system sizes. The Xeon Phi was shown to be a good choice for gaining a significant speedup over an existing multi-core code with very little programming effort.