Abstract
Cloud computing has brought about a transformation in the delivery model of information technology from a product to a service. It has enabled the availability of various software, platforms and infrastructural resources as scalable services on demand over the internet. However, the performance of cloud computing services is hampered due to their inherent vulnerability to failures owing to the scale at which they operate. It is possible to utilize cloud computing services to their maximum potential only if the performance related issues of reliability, availability, and throughput are handled effectively by cloud service providers. Therefore, fault tolerance becomes a critical requirement for achieving high performance in cloud computing. This paper presents a comprehensive overview of fault tolerance-related issues in cloud computing; emphasizing upon the significant concepts, architectural details, and the state-of-art techniques and methods. The objective is to provide insights into the existing fault tolerance approaches as well as challenges yet required to be overcome. The survey enumerates a few promising techniques that may be used for efficient solutions and also, identifies important research directions in this area.
1. Introduction
Cloud computing refers to accessing, configuring and manipulating the resources (such as software and hardware) at a remote location [1]. R. Buyya et al. [2] defined the Cloud computing in terms of distributed computing “A Cloud is a type of parallel and distributed system containing a set of interconnected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources based on service-level agreements established through negotiation between the service provider and consumers”.
According to the U.S. National Institute of Standards and Technology (NIST) definition: “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (for example servers, networks, storage, services, and applications) that can be quickly provisioned and released with least management effort or service provider interaction" [3].
3. Conclusion
Computing in the cloud provides various features like scalability, elasticity, high availability and many more. The cloud-computing model has changed the IT industry as it brings several benefits to individuals, researchers, organizations, and even countries. Despite providing numerous advantages, the cloud system is still susceptible to failures. Failures are inevitable in cloud computing due to the scale of operation. Fault tolerance policies are commonly implemented to handle faults effectively in the cloud environment. Fault tolerance techniques help in preventing as well as tolerating faults in the system, which may occur either due to hardware or software failure. The main motive to employ fault tolerance techniques in cloud computing is to achieve failure recovery, high reliability and enhance availability.