5. Conclusions and future work
Progressive analytics can offer many advantages when adopted to manage big data. Such technique could be very efficient, especially when streams of data is the main scenario. In such cases, data are continually updated and, thus, there is not any insight on their form. In this paper, we focus on data parallelism and assume an underlying progressive analytics service. We propose a mechanism for handling responses retrieved by processors querying clusters of data. Each processor adopts a progressive analytics scheme and is responsible to return early (partial) results and a confidence value to our mechanism. We adopt the principles of the Optimal Stopping Theory (OST) and model the behaviour of a Query Controller (QC) responsible to manage multiple queries. We build on top of the processors and provide an intelligent decision making mechanism. Our aim is to alleviate users/applications from the responsibility of monitoring continuous results retrieved by processors and deciding when it is the right time to stop the process in order to save time and resources. Two models are described: the first assumes a finite horizon scheme while the second considers an infinite horizon setting. A large number of experiments reveal the efficiency of the proposed models. We focus on the throughput of the QC when working in a continuous query scenario and on the quality of the final outcome. Through our results, it is revealed that there is a trade off between throughput and the quality of the final outcome. Future extensions of our work include the definition of an intelligent scheme for creating plans and resulting assignments of queries to specific processors. Every query will be assigned to specific processors, probably, a subset of the processors available to the QC. For this, we are going to provide specific models for queries and processors characteristics. Through this approach, the efficiency of the proposed system will be maximized as the appropriate processors will be selected only for those queries that their performance will be the maximum. A learning technique will be also adopted to build an intelligent scheme for assigning queries to processors. For this, modelling the underlying data and the adoption of an algorithm that splits them to the appropriate pieces, in the most efficient way, seem to be imperative.