دانلود رایگان مقاله انگلیسی lexiDB: مجذور مقیاس پذیر سیستم مدیریت پایگاه داده - IEEE 2017

عنوان فارسی
lexiDB: مجذور مقیاس پذیر سیستم مدیریت پایگاه داده
عنوان انگلیسی
lexiDB: A Scalable Corpus Database Management System
صفحات مقاله فارسی
0
صفحات مقاله انگلیسی
5
سال انتشار
2017
نشریه
آی تریپل ای - IEEE
فرمت مقاله انگلیسی
PDF
کد محصول
E7296
رشته های مرتبط با این مقاله
مهندسی کامپیوتر
گرایش های مرتبط با این مقاله
نرم افزار
مجله
کنفرانس بین المللی کلان داده - International Conference on Big Data
چکیده

Abstract


lexiDB is a scalable corpus database management system designed to fulfill corpus linguistics retrieval queries on multi-billion-word multiply-annotated corpora. It is based on a distributed architecture that allows the system to scale out to support ever larger text collections. This paper presents an overview of the architecture behind lexiDB as well as a demonstration of its functionality. We present lexiDB’s performance metrics based on the AWS (Amazon Web Services) infrastructure with two part-ofspeech and semantically tagged billion word corpora: Historical Hansard and EEBO (Early English Books Online).

نتیجه گیری

V. CONCLUSION AND FURTHER WORK


In this paper, we have presented lexiDB, a new scalable corpus database management system designed specifically to support the indexing of text corpora and retrieval using the main methods employed in corpus linguistics. While other software achieves conceptually similar scalability, e.g. SketchEngine via virtualisation [4] and KorAP with Lucene/Solr integration, we believe that lexiDB is the first corpus database management system with in-built scalability via a distributed architecture. A key point to note about lexiDB is its fast data ingest time for extremely large scale annotated corpora. Normally this has to be traded off against fast retrieval time, but we believe that corpus linguists need both capabilities to deal with corpus updates in, for example, social network analysis where new data needs to be added regularly. Fast indexing also helps to reduce the time overhead between experiments, in other words if we improve the accuracy of automatic annotation and retag our corpus then we do not need to wait for 24 hours to complete the re-indexing before we can start obtaining updated results. lexiDB therefore addresses issues of ‘velocity’ in corpus databases and in turn, enables support for greater ‘volume’ and ‘variety’ in corpora indexed in the system. We have demonstrated the capabilities of lexiDB through evaluation on two multiply-annotated corpora of the scale of one billion words and shown the extremely fast retrieval times for the most frequent words. In addition, due to the distributed design by adding more nodes, lexiDB is able to scale to even larger corpora and we will report on these results in further papers.


بدون دیدگاه