Abstract– Many of the efforts are being taken for semantic multilingualtext retrieval from the web (1), preceding the efforts another effort ispresented in this paper to process bi-lingual (Sindhi-English interconvert-able)text semantically before searching over the internet.
There are two major partsof the research, one is to create an ontological corpus and next to incorporatethe ontology programmatically for web search. Keywords: Pre-web search, Bi-lingual,Sindhi-English, Ontological, Corpus 1. Introduction World is culturallydiverse but linguistic imperialism kills that diversity day by day. At the sametime technological advancements especially in internet technologies made iteasy to have online information access in various languages at a time e.g.
Google Translate (2).Now multilingual information access has been very easy forthose who want to access information not only available in particular dominantlanguage but any small amount of information available in particular desiredlanguage. Primary reason tochoose two languages Sindhi and English is their richness in computationalworld and patience of the speakers towards improvement of their language. Sindhiis one of the oldest languages of Indo-Pak region, having large number ofcharacters (52 characters). Initial script was similar to Sanskrit but now adays most widely used is Persio-Arabic script which has very rich literarycontents (3). Here in this research Persio-Arabic script is preferred.
Sindhiis the official language of Sindh Province in Pakistan and currently there aremore than 34.4 million Sindhi speaking people in Pakistan (4). Sindhi is also spokenwidely in various regions of India by approximately 2.8 million people(5). As English is not only most dominant language of the world (6) andcomputationally a richest language as well. When a query is to be entered in Google’s search box in Englishlanguage, non-English results would not be expected. In case of Englishlanguage it is affordable to ignore the search results that might be availableon the internet in other languages, due to the vastness of the language and itssubsequent result bank. In case of Sindhi language, there will be a great benefitof retrieving the results that matches Sindhi text.
In parallel the resultsthat matches meaning with Sindhi words from English the proposed system (7).Sindhi-English refers to thetwo way translation between Sindhi and English languages which will beaccomplished by using Ontological Corpus. In linguistics,a corpus (plural corpora) or text corpus is a large and structured setof texts (nowadays usually electronically stored and processed) and Ontologydefines a common vocabulary for researchers who need to shareinformation in a domain. It includes machine-interpretable definitions of basicconcepts in the domain and relations among them (8). Ontological Corpusis a simply an Ontology-based corpus which differs from ordinary Text Corpusby using predicate logic. Instead of inserting intended search query in subsequent desiredlanguages manually it would be a great benefit to have a system which performsthis task smoothly in background. Inwhich regard this research introduces a novel idea of Pre-Web Searchuser query processing system that can interconvert the query in given languagesand then passes it to search engine so that it returns all available searchresults in both of the languages.
Following the above premise this researchprovides a case study based on a pilot project developed using twosample languages i.e. Sindhi and English. 2. Background The purposeof Semantic Web technologies (SWT) is to make web contents understandable formachines that can classify and recognize the information available on internetas human does (9).
However the flexible nature of XML based technologies (RDF,DLL, SPARQL etc.) made it useful for information structuring and processing inmany other areas of information technology. In recentyears the development of ontologies—explicit formal specifications of the termsin the domain and relations among them (Gruber 1993)—has been moving fromthe realm of Artificial-Intelligence laboratories to the desktops of domainexperts. Ontologies have become common on the World-Wide Web. The Ontologies onthe Web range from large taxonomies categorizing Web sites (such as on Yahoo!)to categorizations of products for sale and their features (such as onAmazon.com).
The WWW Consortium (W3C) is developing the Resource DescriptionFramework (Brickley and Guha 1999), a language for encoding knowledge onWeb pages to make it understandable to electronic agents searching forinformation. The Defense Advanced Research Projects Agency (DARPA),in conjunction with the W3C, is developing DARPA Agent Markup Language (DAML)by extending RDF with more expressive constructs aimed at facilitating agent interactionon the Web (Hendler and McGuinness 2000). Many disciplines now developstandardized ontologies that domain experts can use to share and annotateinformation in their fields. Ontologydefines a common vocabulary for researchers who need to share information in adomain.
It includes machine-interpretable definitions of basic concepts in thedomain and relations among them. Why would someone want to develop Ontology?Some of the reasons are Ontology Development 101:Ø To share common understanding of the structure ofinformation among people or software agentsØ To enable reuse of domain knowledgeØ To make domain assumptions explicitØ To separate domain knowledge from the operationalknowledgeØ To analyze domain knowledgeParallel corpuses are valuable resource for machine translation,multilingual text retrieval, language education and other applications BITS. Parallelontology can absolutely play the role of parallel corpus by limiting it inparticular relations building a bilingual bio-ontology platform For knowledgediscovery.To the best of our knowledge this research is first everattempt to develop Ontological corpus for Sindhi language. Infect Sindhilanguage is one of the fortunate language of the region which is advancing indifferent computational areas.
Work which is already done can be categorized asfollowing: a- Unicode based -Text processing b- NLP-Corpus based c- Ontological – in other similar languages Eg. Urdu Majority of the workdone in Sindhi computing is regarding word processing and Unicode based Sindhityping systemmajid bhurgri’s work and next most popular in DictionariesUnicode based bilingual sindhi English dictionaryPhonetic based Sindhi spellchecker. Countable but primary efforts are taken in the field of Natural languageprocessing Towards Sindhi corpus constructionword tokenization modelWordsegmentation model for sindhi.Urdu language, whichis syntactically and phonetically one of the closest language to Sindhi hasvaluable work in the field of Semantic Web technologies. One that isconsiderably similar to this research is semantic annotation model for webdocuments based on ontologies. This research has a fundamental difference topresented model that it focuses on semantics of information already availableon web, whereas this research focuses on semantics of information beforeinteracting it on web.