6. Conclusion
We propose a new partitioning algorithm for the 3-dim FFT grid used in the plane wave first-principles calculations. Compared with the greedy algorithm biased toward the load balancing of the plane wave computations, our approach primarily suppresses the growth in communication overhead with respect to an increasing number of processors by realizing local all-to-all communications during data transposes. Then we adjust the data distribution to improve the load balancing with the communication pattern preserved. It has been shown by numerical results that a much lower communication overhead on a relatively large number of processors is achieved at a moderate loss of load balancing. Using the new algorithm, we could scale the plane wave codes up to more nodes than the greedy algorithm. If better performance were wanted, we would combine our approach with other techniques such as the hybrid OpenMP/MPI implementation or simultaneously performing a large number of FFTs.