Abstract
The Thick Control Flow (TCF) model simplifies parallel programming by bundling computations with the same control flow into single flows of variable thickness, and has the prospect of alleviating redundant usage of software and hardware resources. While architectures that can support the TCF model have been proposed, current proposals cannot support concurrent memory accesses that can both simplify programming and speed up many parallel algorithms by a logarithmic factor. In this paper, we extend current TCF architectures to efficiently support concurrent read as well as write memory accesses. The solution is based on bounded size step-caches, and exploit the two-part, hybrid, frontend-backend structure of current TCF processors, and synchronization properties of the TCF model itself. According to our simulation-based evaluation, a concurrent memory access TCF processor with B backends can execute algorithms with substantial concurrent memory accesses up to B times faster than a baseline TCF processor not supporting concurrent memory access. The hardware overhead of the solution is estimated to be modest. We include parallel program code to illustrate the gains by supporting concurrent memory accesses.
1. Introduction
Standard programming models of current shared memory multicore processors expose sets of computational threads that execute individual code asynchronously [1, 2]. While this makes execution of independent threads straight-forward, data dependencies between threads make it necessary to explicitly synchronize them by constructs, like barriers, locks and atomic operations to guarantee that critical operations are carried out in the right order. Since the cost of applying these constructs in terms of execution time and usage of hardware resources is high in most current CPU and GPU architectures, many sophisticated and efficient paradigms for fine-grained parallel computing [3, 4, 5] are deemed impractical. These include concurrent memory access that lets a number of processors read and write a memory location concurrently, and execution of fine-grained parallel algorithms in general.
5. Conclusions
We have described an architectural solution to realize concurrent memory access for TCF processors. The solution is based on bounded size step caches and a two-part, hybrid, frontend-backend structure of current TCF processors. Step caches capture and hold the references made during the on-going step of an execution. Since operations within a step are independent by the definition of TCF execution, there are no coherence problems. If there is only a single-location concurrent operation per TCF, the active frontend unit can perform TCF-level accesses cost-efficiently. According to the evaluation, the concurrent memory access-aware B-backend unit TPA executes certain algorithms up to B times faster than the similarly configured baseline TPA. Employing the frontend in the case of single-location concurrent memory access can further speed up execution. The cost of the proposed technique in silicon area and power consumption is estimated to be very small. Our programming examples demonstrate that concurrent memory access can be integrated as a natural pattern of parallel programming in both high-level and assembler languages.
In the future we aim to further study and develop TCF-aware computing hardware and methodology. This includes an FPGA proof of concept implementation and studying the possibility to realize multiprefix operations on TCF architectures.