V. CONCLUSION AND FURTHER WORK
In this paper, we have presented lexiDB, a new scalable corpus database management system designed specifically to support the indexing of text corpora and retrieval using the main methods employed in corpus linguistics. While other software achieves conceptually similar scalability, e.g. SketchEngine via virtualisation [4] and KorAP with Lucene/Solr integration, we believe that lexiDB is the first corpus database management system with in-built scalability via a distributed architecture. A key point to note about lexiDB is its fast data ingest time for extremely large scale annotated corpora. Normally this has to be traded off against fast retrieval time, but we believe that corpus linguists need both capabilities to deal with corpus updates in, for example, social network analysis where new data needs to be added regularly. Fast indexing also helps to reduce the time overhead between experiments, in other words if we improve the accuracy of automatic annotation and retag our corpus then we do not need to wait for 24 hours to complete the re-indexing before we can start obtaining updated results. lexiDB therefore addresses issues of ‘velocity’ in corpus databases and in turn, enables support for greater ‘volume’ and ‘variety’ in corpora indexed in the system. We have demonstrated the capabilities of lexiDB through evaluation on two multiply-annotated corpora of the scale of one billion words and shown the extremely fast retrieval times for the most frequent words. In addition, due to the distributed design by adding more nodes, lexiDB is able to scale to even larger corpora and we will report on these results in further papers.