7. Conclusions
In this work, we have presented a methodology for amalgamating the action of various key finite element operators across a range of elements. The resulting amalgamation schemes demonstrate improved performance due to their more efficient use of data locality and reduction in data transfer across the memory bus, enabling increased performance through exploiting optimised BLAS routines and the CPU cache structure. An auto-tuning method was presented, enabling the automatic selection of the most efficient scheme at runtime. We have shown how these schemes can be leveraged to improve runtimes, both by examining the schemes individually and by applying them to a largescale simulation of the compressible Euler equations. The results clearly demonstrate the importance and benefits of streaming data from memory efficiently. As alluded to in the introduction, we stress that the results shown here are generally not specific to the spectral/hp element method, due to the fundamental nature of the operators being used. Other high-order schemes, such as the popular nodal discontinuous Galerkin method [26], rely on the evaluation of the same types of operators, which in turn have similar matrix formulations. However, we note that the SumFac scheme may not be applicable, depending on the choice of basis functions used in the local expansion of each element. As we describe in Section 2, sum-factorisation relies on the ability to write local expansion modes as the tensor product of one-dimensional functions. In the nodal DG scheme, hybrid elements such as prisms and tetrahedra typically use Lagrange interpolants together with a set of suitable solution points, such as Fekete or electrostatic point distributions. This choice of basis functions is inherently non-tensor-product based, and so the SumFac schemes we consider here cannot therefore be utilised for these element types. However, the IterPerExp and StdMat schemes are both equally applicable in this setting.