Information based on different terminologies to manage

Information Retrieval Models


Abstract. Information
retrieval is an emerging field of computer science that is based on the storage
of documents and retrieving them on user’s request. It includes the most
essential task of retrieving relevant document according to the requested
query. For this task efficient and effective retrieve models have been made and
proposed. Our survey paper sheds light on some of these information retrieval
models. These models have been built for different datasets and purposes. A
healthy comparison among these models is also shown

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


Keywords: Information
retrieval, retrieval models.

1   Introduction

amount of information is available in electronic form and its size is
continuously increasing. Handling information without any information retrieval
system would be impossible. As the size of data increases researchers start
paying attention on how to obtain or extract relevant information from it.
Initially much of the information retrieval technology was based on
experimentation and trial error. 

the increasing amount of textual information available in electronic form
efficiently and effectively is very critical. Different retrieval models were
formed based on different terminologies to manage and extract information.
Information is mostly stored in form of documents. The main purpose of these
retrieval systems is to find information needed. An information retrieval
system is a software program that stores and manages information on documents,
often textual documents but possibly multimedia. The system assists users in
finding the information they required. A perfect retrieval system would
retrieve only the relevant documents but practically it is not possible as
relevance depends on the subjective opinion of the user.

1.1   Basic model

every retrieval model includes following basic steps:

Content representation
and collection comparison

Representation of


Figure 1 information retrieval process (Hiemstra, November 2009)

models represent documents in indexed form as it is efficient approach.
Different algorithms are used and developed especially for indexing purpose as
better the data is stored more accurately and efficiently it is retrieved.

 Query formulation is the next important step.
User tries to search data using keywords or phrases. In order to search these
phrases in indexed collection, the query must be present in same form. Indexing
can be done by different ways according to content representation of both the
documents in the collection and the user query. (Cerulo,
2004) (Hiemstra, November 2009)

of any retrieval system depend on its comparison algorithm therefore it
determines the accuracy of the system. The better the comparison better the
results are obtained. A list of documents is obtained as the outcome this
comparison that can be relevant or irrelevant. The main objective of a
retrieval model is to measure the degree of relevance of a document with
respect to the given query. (Paik, August 13,

The rank
of relevant documents is higher as compared to irrelevant documents and they
are shown at the top of the list to minimize user time and efforts spend in
searching the documents

 The paper is divided in different sections
with each section explaining different models & their results with their
advantages and limitations.

2   Retrieval Models

2.1 Exact match models


This model
labels the documents as relevant or irrelevant. It is also known as Boolean Model, the earliest and the
easiest model to retrieve documents. It uses logical functions in the query to
retrieve the required data. George Boole’s mathematical logic operators are
combined with query terms and their respective documents to form new sets of
documents. There are three basic operators AND (logical product) OR (logical
sum) and NOT (logical difference)
(Ricardo Baeza-Yates, 2009). The resultant of AND operator is a set of
documents smaller than or equal to the document sets of any of the terms. OR
operator results in a document set that is bigger than or equal to the document
sets of single terms.

model gives users a sense of control over the system. It distinguishes between
relevant and irrelevant documents clearly if the query is accurate. This model
does not rank any document as the degree of relevance is totally ignored. This
model either retrieves a document or not, that might cause frustration for end


Region models

extension of the Boolean model that reason about arbitrary parts of textual
data, called segments, extents or regions. A region might be a word, a phrase,
a text element such as a title, or a complete document. Regions are identified
by a start position and an end position. Region systems are not restricted to
retrieving documents.

region models did not have a big impact on the information retrieval research
community, not on the development of new retrieval systems. The reason for this
is quite obvious: region models do not explain in anyway how search results
should be ranked. In fact, most region models are not concerned with ranking at
all; one might say they – like the relational model – are actually data models
instead of information retrieval models. (Mihajlovi´)


Ranking Models

models may skip important data as they do not support ranking mechanism.
Therefore there was a need to introduce ranking algorithms in retrieval system.
The results are ranked on the basis of occurrence of terms in the queries. Some
ranking algorithms depend only on the link structure of the documents while
some use a combination of both that is they use document content as well as the
link structure to assign a rank value for a given document.(Gupta, 2013)


Vector based model

The Vector Space Model (VSM) is a conventional information
retrieval model that represents documents and queries by vectors in a
multidimensional space. The basic idea is that when indexing terms are
extracted from a document collection, each document or query is represented as
a vector of weighted term frequencies Similarity comparisons among documents
and/or between documents and queries are made via the similarity between two
vectors (e.g. cosine similarity).


Using document sets and query, a similarity measure, compare
them and the documents with more similarities are returned to the user. Many
methods are user to measure the similarity that are cosine similarity, tf-idf

2.4.2 Cosine

The cosine similarity compute the angles between the vectors
in n dimensional space. The cosine similarity in d documents and d’ is given by

( d * d’ ) / | d | *
| d’ |


The performance of retrieval vector base model can be improve by
utilizing user-supplied information of those documents that are relevant to the
query in question. (Kita, oct 1 , 2000)

Vaibhav Kant Singh, Vinay Kumar Singh (Vaibhav Kant Singh, 2015) describes vector space model for
information retrieval. The VSM provide a guide to the user that are more
similar and have more significance by calculate the angle between query and the
terms or the documents. Here documents are represented as term-vectors

d = (t1, t2,

Where ti
=1<=i<=t ti is non-negative value and denotes the term i occurrences on document some important measures of vector space model are as follows {0,1}. 2.5Probabilistic model: The probabilistic model is based on probability ranking principle. Some statistics are involved for event's probability estimation that tells whether the document retrieve is relevant or non-relevant in accordance with information need. Probabilistic models employ the conditional probability under occurrence of the terms. Probabilistic model state that the retrieval system rank the set of documents according to the probability which is relevant to the query with all the given evidences. The documents are ranked according to probabilities in decreasing order. The term-index of term weight words are in binary representation. 2.6 Bayesian network Model Bayesian network models (BNM) is acyclic graphical model which means it does not have a directed path but deals with random variables. BNM contains a set of random-variables and the conditional probability dependencies between them. It is also known as belief networks, casual nets etc. BNM ranks the documents by usage of multiple evidences in order to compute conditional probability. Probability distribution presentation uses graphical approach to analyses complex conditional assumptions that are independent. 2.7 Inference Network Model In inference retrieval model the random-variables concerned with four layers of nodes that are a query node, set of document nodes, representation nodes and index word nodes. The random-variables are represents as edges in inference network retrieval model.  All the nodes in this model represents random-variables with binary variables {0, 1}. Figure 1 simplified inference network model (Hiemstra, November 2009) 2.8Language based models: Language based models are the type of retrieval models based on the idea of speech recognition. Speech recognition depends on two main and unique models that are the acoustic-model and the language model. It is computed for each collection containing set of documents and based on terms. Ranking of documents are done by probability generalization of query. 2.8.1Relevance based language model   (Javier Parapar a, 2013)'s proposed this model is also known as "Relevance Model" (RM). It exploits the concept of 'relevance' i.e. the relation between query and document, in language modelling of information retrieval. The model has laid its focus on enhancing 'effectiveness' in retrieval as it works on the modification of query words by matching query to the approximation set of relevant documents, in order to achieve better search results. It is regarded among the best ranking models of text retrieval.            RM has been popularly applied to Recommender Systems: display related items to users and save their search time, for enhancing their performance. Here the 'user' replaces the query and document while 'items' replaces terms and modification is now done on items search. The technique of 'pseudo relevance feedback' is used to achieve enhancement, i.e. making use of better IR systems for making predictions and guesses about the related items.   2.9Alternative Algebraic model In this retrieval model we further discuss two models that are latent semantic indexing and neural network model 2.9.1Latent Semantic Indexing Latent Semantic indexing (LSI) helps accurate retrieval information in large database. The similarity of the documents depends on the contexts of the existing and not existing words. LSI comprises the idea of singular value decomposition (SVD) and vector space model. Latent semantic indexing only takes the documents which have semantic similarity i-e having same topic, but they aren't similar in the vector space and then represents in reduced-vector-space having highest similarity. To compute LSI by using SVD a matrix A is decomposed into further 3 matrices A = U?V T Where: ? is diagonal matrix U is an orthogonal matrix and V is transpose of an orthogonal matrix Jin Wang et all (Jin Wang, 9May2012) proposes a model which uses bag of word model for the analysis of human motions in video frame. 2.9.2Ontology-based Information Retrieval The most emerging field if information retrieval and extraction now a days is ontology-based information retrieval (OBIE). OBIE is defined as the use of ontologies in order to retrieve information. Ontology means the conceptualization specification of the terms or the words. Ontologies are particular domain-specific generally so that it means different domains with different ontologies. As they are domain-specific so they have relationship between the class and the entities. They are application dependent. On the basis of similarities and dissimilarities an ontology-tree is hierarchal representation of classes or entities and their relationship between different grouping and classification of entities. Figure 2 Ontology based information extraction (Ritesh Shah, February 2014)   2.9.3Neural Network Model Neural network models are the models that consists of interconnected neurons and has labelled and directed graphical structure. Neural networks graphs has some nodes that perform some calculations in order to get output. Directed graphs contains nodes or vertices and they also have some connections that connects the nodes while labelled graphs are the graphs in which all the connections have some label to identify the properties of all connections. The nodes in the graph behaves as processing unit , edges behaves as synaptic connection and some weights are also assigned to edges in the graph in neural network model.  In information retrieval the neural network models contains query-term nodes and document-term nodes. The query-term nodes initiate the retrieval process by sending the signals to the document-term nodes. The document-term nodes then sends signals to document nodes. Igor MOKRIŠ, Lenka SKOVAJSOVÁ (Igor MOKRIŠ, 2006) describes the neural network retrieval model in the "Slovak" language.     Conclusion: Different Information-retrieval techniques are discussed with advantages and disadvantages in this survey paper. Each model has its own different criteria to extract the relevant document for user's requested query. So we came to the point that few methods do best for some applications while few do best for other applications in data retrieval. Every method has its own criteria to extract and deal with the given query for a certain information need. Information-retrieval systems are being used in different organizations and still the new-model are being worked upon to get relevant results. Model Related work Methods Advantages limitations Exact match Model            i.            David E. Losada           ii.            Set theory based and Boolean algebra       iii.            Representation of query by Boolean expression        iv.            Terms combined with operators AND,OR,NOT         v.            Proximity        vi.            Stemming i.                     Easy to implement ii.                    Exact match model iii.                  Computationally efficient   i.                     No term weighting used in document and query ii.                    Add too much complexity and detail iii.                  Difficulty for end-users to form a correct Boolean query iv.                   No ranking v.                    No partial matching Vector space model   i.                     Waiting scheme used ii.                    Cosine similarity iii.                  Rank documents by similarity i.                     Improve retrieval performance by term weighting ii.                    Similarity can be used for different elements i.                     Term independence assumption ii.                    Users cannot specify relationships between terms Probabilistic Model   i.                     Probability rank principle based ii.                    relevance and non-relevance based of data i.                     Ranking of document ii.                    Does not consider index inside a document i.                     Binary word-in-doc weights ii.                    Independence of terms iii.                  Only partial ranking of documents iv.                   Prior knowledge based Language based models   Probability estimation of events in text Query likelihood model Speech recognition Term based for each document in entire collection Length normalization of term frequencies     Data sparsity Bayesian network Model       directed graphical model random variable relationship is captured by directed edges Deals with noisy data Describe interaction between query and document space Query specification based on Boolean expressions Expensive Computation Bad performance for small collection Inference Network Model     Random-variables concerned with query ,set of document and index words Provide a framework with possible strategies of Rankin used Boolean query formulation Latent Semantic Indexing     Concept based retrieval of text Use SVD Retrieval of the documents even if there is no share of keyword in the query Solves problem of ambiguities(polysemy & synonymy) Expensive Works on small collection Ontology-based Information Retrieval              i.            Entities classification based in hierarchal manner         ii.            Keyword matching based Capability to reuse and share of ontology with other applications High time consumption Difficulties come in creating ontological-tree Addition of new concept in existing ontology require considerable time and effort Neural Network Model     Neural based Weights assigned to edge of neurons Easy to use but requires some statistical trainings Deals with large collection of data Detect relationship between query and retrieve documents Difficult to design expensive Complicated in nature Does not deal with small documents References Cerulo, G. C. (2004). A Taxonomy of Information. Journal of Computing and Information Technology , 175–194. Daniel Valcarce, J. P. (n.d.). A Study of Smoothing Methods for Relevance-Based Language Modelling of Recommender Systems. Information Retrieval Lab Computer Science Department University of A Coruña, Spain {daniel.valcarce,javierparapar,barr iro} Gupta, P. K. (2013). Survey Paper on Information Retrieval Algorithms and Personalized Information Retrieval Concept. International Journal of Computer Applications. Hiemstra, D. (November 2009). Information Retrieval Models. Goker, A., and Davies, J. Information Retrieval: Searching in the 21st. Hiemstra, D. (November 2009.). Published in: Goker, A., and Davies, J. Information Retrieval: Searching in the 21st Century. John Wiley and Sons, Ltd.,. Hui Yang, M. S. (2014). Dynamic Information Retrieval Modeling. SIGIR'14, July 6–11, 2014, Gold Coast, Queensland, Australia. ACM 978-1-4503-2257-7/14/07. Igor MOKRIŠ, L. S. (2006). Neural Network Model Of System For Information Retrieval From Text Documents In Slovak Language. ActaElectrotechnica et. Javier Parapar a, ?. A. (2013). Relevance-based languagemodellingforrecommendersystems. InformationProcessingandManagement, elsevier. Jin Wang, n. P. (9May2012). SupervisedlearningprobabilisticLatentSemanticAnalysis. Elsevier. Kita, X. T. (oct 1 , 2000). improvement of vector space information retrieval model based on supervised learnin. IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages ACM New York, NY, USA ©2000 , 69-74. Koltun, E. K. (4, July 2012 ). A Probabilistic Model for Component-Based Shape Synthesis. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2012 . Mihajlovi´, D. H. (n.d.). A database approach to information retrieval: The remarkable. University of TwenteCentre for Telematics and Information Technology. Paik, J. H. (13 august 2015). A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution. University of Maryland, College Park, USA ,SIGIR'15. Paik, J. H. (August 13, 2015). A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution. University of Maryland, College Park, USA,SIGIR'15. Ricardo Baeza-Yates, B.-N. (2009). Modern Information Retrieva. ACM Press, New York. Ritesh Shah, S. J. (February 2014). Ontology-based Information Extraction: An Overview. International Journal of Computer Applications (0975 – 8887). Vaibhav Kant Singh, V. K. (2015). VECTOR SPACE MODEL: AN INFORMATION RETRIEVAL. International Journal of Advanced Engineering Research and Studies. Xi-Quan Yang, D. Y.-H. (2014). Scientific literature retrieval model based on weighted term frequency. IEEE Computing Society.    


I'm Owen!

Would you like to get a custom essay? How about receiving a customized one?

Check it out