Information Retrieval Models Abstract. Informationretrieval is an emerging field of computer science that is based on the storageof documents and retrieving them on user’s request.
It includes the mostessential task of retrieving relevant document according to the requestedquery. For this task efficient and effective retrieve models have been made andproposed. Our survey paper sheds light on some of these information retrievalmodels. These models have been built for different datasets and purposes. Ahealthy comparison among these models is also shown Keywords: Informationretrieval, retrieval models.
1 IntroductionHugeamount of information is available in electronic form and its size iscontinuously increasing. Handling information without any information retrievalsystem would be impossible. As the size of data increases researchers startpaying attention on how to obtain or extract relevant information from it.Initially much of the information retrieval technology was based onexperimentation and trial error. Managingthe increasing amount of textual information available in electronic formefficiently and effectively is very critical. Different retrieval models wereformed based on different terminologies to manage and extract information.Information is mostly stored in form of documents.
The main purpose of theseretrieval systems is to find information needed. An information retrievalsystem is a software program that stores and manages information on documents,often textual documents but possibly multimedia. The system assists users infinding the information they required. A perfect retrieval system wouldretrieve only the relevant documents but practically it is not possible asrelevance depends on the subjective opinion of the user.
1.1 Basic modelAlmostevery retrieval model includes following basic steps: Document Content representation Query representation Query and collection comparison Representation of results Figure 1 information retrieval process (Hiemstra, November 2009)Manymodels represent documents in indexed form as it is efficient approach.Different algorithms are used and developed especially for indexing purpose asbetter the data is stored more accurately and efficiently it is retrieved. Query formulation is the next important step.
User tries to search data using keywords or phrases. In order to search thesephrases in indexed collection, the query must be present in same form. Indexingcan be done by different ways according to content representation of both thedocuments in the collection and the user query. (Cerulo,2004) (Hiemstra, November 2009)Resultsof any retrieval system depend on its comparison algorithm therefore itdetermines the accuracy of the system.
The better the comparison better theresults are obtained. A list of documents is obtained as the outcome thiscomparison that can be relevant or irrelevant. The main objective of aretrieval model is to measure the degree of relevance of a document withrespect to the given query. (Paik, August 13,2015)The rankof relevant documents is higher as compared to irrelevant documents and theyare shown at the top of the list to minimize user time and efforts spend insearching the documents The paper is divided in different sectionswith each section explaining different models & their results with theiradvantages and limitations.2 Retrieval Models2.1 Exact match models This modellabels the documents as relevant or irrelevant. It is also known as Boolean Model, the earliest and theeasiest model to retrieve documents. It uses logical functions in the query toretrieve the required data.
George Boole’s mathematical logic operators arecombined with query terms and their respective documents to form new sets ofdocuments. There are three basic operators AND (logical product) OR (logicalsum) and NOT (logical difference)(Ricardo Baeza-Yates, 2009). The resultant of AND operator is a set ofdocuments smaller than or equal to the document sets of any of the terms. ORoperator results in a document set that is bigger than or equal to the documentsets of single terms. Booleanmodel gives users a sense of control over the system. It distinguishes betweenrelevant and irrelevant documents clearly if the query is accurate.
This modeldoes not rank any document as the degree of relevance is totally ignored. Thismodel either retrieves a document or not, that might cause frustration for enduser. 2.2Region models Anextension of the Boolean model that reason about arbitrary parts of textualdata, called segments, extents or regions. A region might be a word, a phrase,a text element such as a title, or a complete document. Regions are identifiedby a start position and an end position. Region systems are not restricted toretrieving documents.
Theregion models did not have a big impact on the information retrieval researchcommunity, not on the development of new retrieval systems. The reason for thisis quite obvious: region models do not explain in anyway how search resultsshould be ranked. In fact, most region models are not concerned with ranking atall; one might say they – like the relational model – are actually data modelsinstead of information retrieval models.
(Mihajlovi´) 2.3Ranking ModelsBooleanmodels may skip important data as they do not support ranking mechanism.Therefore there was a need to introduce ranking algorithms in retrieval system.The results are ranked on the basis of occurrence of terms in the queries.
Someranking algorithms depend only on the link structure of the documents whilesome use a combination of both that is they use document content as well as thelink structure to assign a rank value for a given document.(Gupta, 2013) 2.4Vector based model The Vector Space Model (VSM) is a conventional informationretrieval model that represents documents and queries by vectors in amultidimensional space. The basic idea is that when indexing terms areextracted from a document collection, each document or query is represented asa vector of weighted term frequencies Similarity comparisons among documentsand/or between documents and queries are made via the similarity between twovectors (e.g. cosine similarity).2.4.
1Similaritymeasures/coefficientUsing document sets and query, a similarity measure, comparethem and the documents with more similarities are returned to the user. Manymethods are user to measure the similarity that are cosine similarity, tf-idfetc.2.4.2 CosinesimilarityThe cosine similarity compute the angles between the vectorsin n dimensional space. The cosine similarity in d documents and d’ is given by:( d * d’ ) / | d | *| d’ | The performance of retrieval vector base model can be improve byutilizing user-supplied information of those documents that are relevant to thequery in question. (Kita, oct 1 , 2000) Vaibhav Kant Singh, Vinay Kumar Singh (Vaibhav Kant Singh, 2015) describes vector space model forinformation retrieval.
The VSM provide a guide to the user that are moresimilar and have more significance by calculate the angle between query and theterms or the documents. Here documents are represented as term-vectors d = (t1, t2,t3………tn)Where ti=1<=i<=t ti is non-negative value and denotes the term i occurrences ondocument some important measures of vector space model are as follows {0,1}.2.5Probabilistic model:The probabilisticmodel is based on probability ranking principle. Some statistics are involvedfor event's probability estimation that tells whether the document retrieve isrelevant or non-relevant in accordance with information need. Probabilisticmodels employ the conditional probability under occurrence of the terms.Probabilistic model state that the retrieval system rank the set of documentsaccording to the probability which is relevant to the query with all the givenevidences.
The documents are ranked according to probabilities in decreasingorder. The term-index of term weight words are in binary representation.2.6 Bayesian network ModelBayesian network models (BNM) is acyclic graphical modelwhich means it does not have a directed path but deals with random variables.BNM contains a set of random-variables and the conditional probabilitydependencies between them. It is also known as belief networks, casual netsetc. BNM ranks the documents by usage of multiple evidences in order to computeconditional probability. Probability distribution presentation uses graphicalapproach to analyses complex conditional assumptions that are independent.
2.7 Inference Network ModelIn inferenceretrieval model the random-variables concerned with four layers of nodesthat are a query node, set of document nodes, representation nodes and indexword nodes. The random-variables are represents as edges in inference networkretrieval model. All the nodes in thismodel represents random-variables with binary variables {0, 1}.
Figure 1 simplified inference network model (Hiemstra, November 2009)2.8Language based models:Language based models are the type of retrieval models basedon the idea of speech recognition. Speech recognition depends on two main andunique models that are the acoustic-model and the language model.
It iscomputed for each collection containing set of documents and based on terms.Ranking of documents are done by probability generalization of query.2.8.1Relevance based language model (JavierParapar a, 2013).et.
al’sproposed this model is also known as “Relevance Model” (RM). It exploits theconcept of ‘relevance’ i.e. the relation between query and document, inlanguage modelling of information retrieval. The model has laid its focus onenhancing ‘effectiveness’ in retrieval as it works on the modification of querywords by matching query to the approximation set of relevant documents, inorder to achieve better search results. It is regarded among the best rankingmodels of text retrieval. RMhas been popularly applied to Recommender Systems: display related items tousers and save their search time, for enhancing their performance.
Here the’user’ replaces the query and document while ‘items’ replaces terms andmodification is now done on items search. The technique of ‘pseudo relevancefeedback’ is used to achieve enhancement, i.e.
making use of better IR systems formaking predictions and guesses about the related items. 2.9Alternative Algebraic modelIn this retrieval model we further discuss two models thatare latent semantic indexing and neural network model2.9.1Latent Semantic IndexingLatent Semantic indexing (LSI) helps accurate retrievalinformation in large database. The similarity of the documents depends on thecontexts of the existing and not existing words.
LSI comprises the idea ofsingular value decomposition (SVD) and vector space model. Latent semanticindexing only takes the documents which have semantic similarity i-e havingsame topic, but they aren’t similar in the vector space and then represents inreduced-vector-space having highest similarity. To compute LSI by using SVD amatrix A is decomposed into further 3 matrices A = U?V TWhere:? is diagonal matrixU is an orthogonal matrix andV is transpose of an orthogonal matrixJin Wang et all (Jin Wang, 9May2012) proposes a model whichuses bag of word model for the analysis of human motions in video frame.2.9.2Ontology-based Information RetrievalThe most emerging field if information retrieval andextraction now a days is ontology-based information retrieval (OBIE). OBIE isdefined as the use of ontologies in order to retrieve information.
Ontologymeans the conceptualization specification of the terms or the words. Ontologiesare particular domain-specific generally so that it means different domainswith different ontologies. As they are domain-specific so they haverelationship between the class and the entities. They are applicationdependent.
On the basis of similarities and dissimilarities an ontology-tree ishierarchal representation of classes or entities and their relationship betweendifferent grouping and classification of entities.Figure 2Ontology based information extraction (RiteshShah, February 2014) 2.9.3Neural Network ModelNeural networkmodels are the models that consists of interconnected neurons and has labelledand directed graphical structure. Neural networks graphs has some nodes thatperform some calculations in order to get output. Directed graphs containsnodes or vertices and they also have some connections that connects the nodeswhile labelled graphs are the graphs in which all the connections have somelabel to identify the properties of all connections.
The nodes in the graphbehaves as processing unit , edges behaves as synaptic connection and someweights are also assigned to edges in the graph in neural network model. In information retrieval the neural networkmodels contains query-term nodes and document-term nodes. The query-term nodesinitiate the retrieval process by sending the signals to the document-termnodes. The document-term nodes then sends signals to document nodes. IgorMOKRIŠ, Lenka SKOVAJSOVÁ (Igor MOKRIŠ, 2006)describes the neural network retrieval model in the “Slovak” language. Conclusion:Different Information-retrieval techniques are discussedwith advantages and disadvantages in this survey paper.
Each model has its owndifferent criteria to extract the relevant document for user’s requested query.So we came to the point that few methods do best for some applications whilefew do best for other applications in data retrieval. Every method has its owncriteria to extract and deal with the given query for a certain informationneed. Information-retrieval systems are being used in different organizationsand still the new-model are being worked upon to get relevant results. Model Related work Methods Advantages limitations Exact match Model i. David E.
Losada ii. Set theory based and Boolean algebra iii. Representation of query by Boolean expression iv. Terms combined with operators AND,OR,NOT v. Proximity vi. Stemming i.
Easy to implement ii. Exact match model iii. Computationally efficient i. No term weighting used in document and query ii. Add too much complexity and detail iii. Difficulty for end-users to form a correct Boolean query iv. No ranking v. No partial matching Vector space model i.
Waiting scheme used ii. Cosine similarity iii. Rank documents by similarity i. Improve retrieval performance by term weighting ii. Similarity can be used for different elements i. Term independence assumption ii. Users cannot specify relationships between terms Probabilistic Model i. Probability rank principle based ii.
relevance and non-relevance based of data i. Ranking of document ii. Does not consider index inside a document i. Binary word-in-doc weights ii. Independence of terms iii. Only partial ranking of documents iv. Prior knowledge based Language based models Probability estimation of events in text Query likelihood model Speech recognition Term based for each document in entire collection Length normalization of term frequencies Data sparsity Bayesian network Model directed graphical model random variable relationship is captured by directed edges Deals with noisy data Describe interaction between query and document space Query specification based on Boolean expressions Expensive Computation Bad performance for small collection Inference Network Model Random-variables concerned with query ,set of document and index words Provide a framework with possible strategies of Rankin used Boolean query formulation Latent Semantic Indexing Concept based retrieval of text Use SVD Retrieval of the documents even if there is no share of keyword in the query Solves problem of ambiguities(polysemy & synonymy) Expensive Works on small collection Ontology-based Information Retrieval i. Entities classification based in hierarchal manner ii.
Keyword matching based Capability to reuse and share of ontology with other applications High time consumption Difficulties come in creating ontological-tree Addition of new concept in existing ontology require considerable time and effort Neural Network Model Neural based Weights assigned to edge of neurons Easy to use but requires some statistical trainings Deals with large collection of data Detect relationship between query and retrieve documents Difficult to design expensive Complicated in nature Does not deal with small documents ReferencesCerulo, G. C. (2004). A Taxonomy of Information. Journalof Computing and Information Technology , 175–194.Daniel Valcarce, J.
P. (n.d.). A Study of SmoothingMethods for Relevance-Based Language Modelling of Recommender Systems. InformationRetrieval Lab Computer Science Department University of A Coruña, Spain{daniel.valcarce,javierparapar,barr iro}@udc.es.
Gupta, P. K. (2013). Survey Paper on InformationRetrieval Algorithms and Personalized Information Retrieval Concept.
InternationalJournal of Computer Applications.Hiemstra, D. (November 2009). Information RetrievalModels.
Goker, A., and Davies, J. Information Retrieval: Searching in the21st.Hiemstra, D. (November 2009.). Published in: Goker,A.
, and Davies, J. Information Retrieval: Searching in the 21st Century. JohnWiley and Sons, Ltd.
,.Hui Yang, M. S. (2014). Dynamic Information RetrievalModeling. SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia. ACM978-1-4503-2257-7/14/07.
http://dx.doi.org/10.1145/2600428.2602297. Igor MOKRIŠ, L. S.
(2006). Neural Network Model OfSystem For Information Retrieval From Text Documents In Slovak Language. ActaElectrotechnicaet.
Javier Parapar a, ?. A. (2013). Relevance-basedlanguagemodellingforrecommendersystems.
InformationProcessingandManagement,elsevier.Jin Wang, n. P.
(9May2012).SupervisedlearningprobabilisticLatentSemanticAnalysis. Elsevier.
Kita, X. T. (oct 1 , 2000). improvement of vectorspace information retrieval model based on supervised learnin.
IRAL ’00Proceedings of the fifth international workshop on on Information retrievalwith Asian languages ACM New York, NY, USA ©2000 , 69-74.Koltun, E. K. (4, July 2012 ). A Probabilistic Modelfor Component-Based Shape Synthesis. ACM Transactions on Graphics (TOG) -Proceedings of ACM SIGGRAPH 2012 .
Mihajlovi´, D. H. (n.
d.). A database approach toinformation retrieval: The remarkable. University of TwenteCentre forTelematics and Information Technology.
Paik, J. H. (13 august 2015). A Probabilistic Modelfor Information Retrieval Based on Maximum Value Distribution. University ofMaryland, College Park, USA ,SIGIR’15.Paik, J. H. (August 13, 2015).
A Probabilistic Modelfor Information Retrieval Based on Maximum Value Distribution. University ofMaryland, College Park, USA,SIGIR’15.Ricardo Baeza-Yates, B.-N. (2009).
ModernInformation Retrieva. ACM Press, New York.Ritesh Shah, S. J. (February 2014). Ontology-basedInformation Extraction: An Overview.
International Journal of ComputerApplications (0975 – 8887).Vaibhav Kant Singh, V. K. (2015).
VECTOR SPACE MODEL:AN INFORMATION RETRIEVAL. International Journal of Advanced Engineering Researchand Studies.Xi-Quan Yang, D. Y.-H. (2014). Scientific literatureretrieval model based on weighted term frequency.
IEEE Computing Society.