5. Conclusions
In observing carefully modern online and offline web pages and files, it gave rise to an urgent requirement of generic and na¨ıve strategy to handle documents like structured, semi-structured, unstructured, hybrid, heterogeneous and having multi-tasking and multi-lingual features. So, a method using pixel-map manipulation to extract content from Indian regional web documents is developed. This method is tested with other Indian and foreign native language words to form a more elaborate base set. To assess the similarity between trained and tested datasets, number of new datasets with new words was identified and tested using our present algorithm. More analysis on new strategies and algorithms is under progress. A detailed state-of-art analysis can be done with neural network15 and cluster analysis. A comparison of statistical, neural, pattern matching algorithms will give better analysis of this generic approach.