Conclusion and future work
We have presented SemLinker, an ontology-based data integration system for PDL and other similar data lake implementations. SemLinker allows casual users with limited technical background and with minimal efort, to integrate, process, and analyze heterogeneous raw data through a unifed conceptual representation of the data schemas regarding a widely used global ontology. To the best of our knowledge, SemLinker is the frst domain-agnostic integration system that ofers self-adapting capabilities to automatically integrate big data with frequently evolving schemas based on solid theoretical foundations. SemLinker has been evaluated on large datasets in multiple domains, and the results not only validate its integration efectiveness and functional efciency, but also indicate that SemLinker’s performance is robust and promising, albeit there is still room for improvement in multiple aspects of the system.
Although SemLinker is a generic integration solution, it targets only structured and semi-structured data, and it is, by no means, a holistic integration solution when unstructured data such as free-text documents and multimedia fles are also considered. For such data we have proposed, in an earlier paper [48], SemCluster, an automatic key phrase extraction tool that specializes in extracting keyphrases from free text documents and annotating each keyphrase with ontology-based metadata. One of our planned immediate undertakings is to combine SemLinker and SemCluster into a broader integration solution towards an efective and efcient metadata management framework for the personal data lake.