ترجمه مقاله نقش ضروری ارتباطات 6G با چشم انداز صنعت 4.0
- مبلغ: ۸۶,۰۰۰ تومان
ترجمه مقاله پایداری توسعه شهری، تعدیل ساختار صنعتی و کارایی کاربری زمین
- مبلغ: ۹۱,۰۰۰ تومان
Abstract
lexiDB is a scalable corpus database management system designed to fulfill corpus linguistics retrieval queries on multi-billion-word multiply-annotated corpora. It is based on a distributed architecture that allows the system to scale out to support ever larger text collections. This paper presents an overview of the architecture behind lexiDB as well as a demonstration of its functionality. We present lexiDB’s performance metrics based on the AWS (Amazon Web Services) infrastructure with two part-ofspeech and semantically tagged billion word corpora: Historical Hansard and EEBO (Early English Books Online).
V. CONCLUSION AND FURTHER WORK
In this paper, we have presented lexiDB, a new scalable corpus database management system designed specifically to support the indexing of text corpora and retrieval using the main methods employed in corpus linguistics. While other software achieves conceptually similar scalability, e.g. SketchEngine via virtualisation [4] and KorAP with Lucene/Solr integration, we believe that lexiDB is the first corpus database management system with in-built scalability via a distributed architecture. A key point to note about lexiDB is its fast data ingest time for extremely large scale annotated corpora. Normally this has to be traded off against fast retrieval time, but we believe that corpus linguists need both capabilities to deal with corpus updates in, for example, social network analysis where new data needs to be added regularly. Fast indexing also helps to reduce the time overhead between experiments, in other words if we improve the accuracy of automatic annotation and retag our corpus then we do not need to wait for 24 hours to complete the re-indexing before we can start obtaining updated results. lexiDB therefore addresses issues of ‘velocity’ in corpus databases and in turn, enables support for greater ‘volume’ and ‘variety’ in corpora indexed in the system. We have demonstrated the capabilities of lexiDB through evaluation on two multiply-annotated corpora of the scale of one billion words and shown the extremely fast retrieval times for the most frequent words. In addition, due to the distributed design by adding more nodes, lexiDB is able to scale to even larger corpora and we will report on these results in further papers.