Abstract
As data centres continue to grow in size and complexity in order to respond to the increasing demand for computing resources, failures become the norm instead of an exception. To provide dependability at scale, traditional techniques to tolerate faults focus on reactive, redundant schemes. While the former relies on the checkpointing/restart of a job (which could incur significant overhead in a large-scale system), the latter replicates tasks, thus consuming extra resources to achieve higher reliability and availability of computing environments. Proactive fault-tolerance in large systems represents a new trend to avoid, cope with and recover from failures. However, different fault-tolerance schemes provide different levels of computing environment dependability at diverse costs to both providers and consumers.
In this paper, two state-of-the-art fault-tolerance techniques are compared in terms of availability of computing environments to cloud consumers and energy costs to cloud providers. The results show that proactive fault-tolerance techniques outperform traditional redundancies in terms of costs to cloud users while providing available computing environments and services to consumers. However, the computing environment dependability provided by proactive fault-tolerance highly depends on failure prediction accuracy.
1. Introduction
With trends in the service-oriented economy [1] being supported by distributed computing paradigms such as cloud computing, there is an increased concern about the quality and availability of offered services. However, providing quality of service (QoS) guarantees to users is a very difficult and complex task [2] due to the demands ofthe consumers’ services vary significantly with time. Moreover,this problem is exacerbated when one considers computing node availability. By definition, a system is available if the fraction of its down-time is very small, either because failures are rare or because it can restart quickly after a failure. Therefore, availability is a function of reliability, which, according to the Institute of Electrical and Electronics Engineers (IEEE) [3], is the ability of a system or component to perform its required functions under stated conditions for a specified period of time. Reliability can be measured by the mean time between failures (MTBF), which in turn is estimated by the component manufacturer.
5. Conclusion
This paper has evaluated the effectiveness of two state-ofthe-art fault-tolerance mechanisms in providing dependable cloud services to consumers. Failures represent a threat to the availability and dependability of a system and services deployed. Unfortunately, with an ever-growing number of warehouse-sized data centres built to address the increasing demand for computing resources and services, component failures become the norm instead of an exception. As such, the success of petascale/exascale computing will depend on the ability to provide reliability and availability at scale. To address this problem, different fault-tolerance mechanisms exist to provide dependable computing environments. However, dependability of services at all costs is not a solution, as it may increase energy consumption and operational costs for providers.