Abstract
An operating system’s design is often influenced by the architecture of the target hardware. While uniprocessor and multiprocessor architectures are well understood, such is not the case for multithreaded chip multiprocessors (CMT) – a new generation of processors designed to improve performance of memory-intensive applications. The first systems equipped with CMT processors are just becoming available, so it is critical that we now understand how to obtain the best performance from such systems.
The goal of our work is to understand the fundamentals of CMT performance and identify the implications for operating system design. We have analyzed how the performance of a CMT processor is affected by contention for the processor pipeline, the L1 data cache, and the L2 cache, and have investigated operating system approaches to the management of these performance-critical resources. Having found that contention for the L2 cache can have the greatest negative impact on processor performance, we have quantified the potential performance improvement that can be achieved from L2-aware OS scheduling. We evaluated a scheduling policy based on the balance-set principle and found that it has a potential to reduce miss ratios in the L2 by 19-37% and improve processor throughput by 27-45%. To achieve a similar improvement in hardware requires doubling the size of the L2 cache.
1. INTRODUCTION
An operating system provides a layer of abstraction between the hardware and the software. Its job is to expose the power of the hardware to applications, while hiding its complexities. It is no surprise that the architecture of the hardware influences the design of the operating system. The subject of operating system design for conventional processors has been addressed in the past. This paper begins to investigate the area of operating system design for a new family of processors: multithreaded chip multiprocessors (CMT).
6. RELATED WORK
Previous work has proposed scheduling algorithms for single-core SMT processors that have been shown to improve system response time by 17% [1, 2, 12]. These algorithms involved sampling the space of possible schedules and using the ones that performed the best. This method can be implemented with virtually no overhead but requires hardware support. Our scheduling algorithm is different in that it uses modeling to predict the best schedule. Modeling may be preferable to sampling when the sample space becomes very large, such as on a system equipped with dozens of thread contexts (i.e., a CMT processor) running hundreds of threads. It would be interesting to compare the effectiveness and costs of the method proposed in the SMT study to ours on a large CMT configuration. We are also hoping to apply ideas from the follow-up study on incorporating priorities into the SMT-aware scheduler [2] to improve fairness of our scheduler.