A Hybrid Approach For Extracting Relations Between Amharic Named Entities
Named entity relation extraction is one of the main tasks of information extraction. It takes semi
structured or unstructured text as an input and its task is to identify various semantic relations
between entities from text. For example the sentence Abebe is the president of Ethiopia carries
the semantic relationship president between named entities Abebe (PERSON) and Ethiopia
(LOCATION). Extracting relation that describes any semantic interaction found between named
entities is a very important research topic in the area of information extraction. MUC-7/MET-2
2 gives a specic denition of named entities on the level of entity extraction as Named Entities
(NE) is proper names and quantities of interest. Person, organization, and location names were
marked as well as dates, times, percentages, and monetary amounts. The recognition of these
entities is basic and rst task for building semantic analysis and information extraction system.
Named entity extraction is an information extraction task aimed at identifying and classifying
words of a sentence, a paragraph or a document into predened categories of named entities. The
idea of Named Entity Recognition (NER) is identifying named entities like people, place, date,
number and etc. The rst step in named entity recognition is the identication of proper nouns
from a text and the second task is the classication of these proper nouns in to any one of the
classes like person name, organization name, place name etc. We can dene Relation Extraction as the process of recognizing the type of relation that
connects two or more Named Entities. As stated in 4 rst, the concept of relation extraction
was introduced as part of the Template Element Task, one of the information extraction tasks
in the Sixth Message Understanding Conference (MUC-6) (Defense Advanced Research Pro jects
Agency, 1995). MUC-7 added a Template Relation Task, with three relations. Following MUC,
the Automatic Content Extraction (ACE) meetings (National Institute of Standards and Tech-
nology, 2000) are pursuing information extraction. In the ACE Program, Relation Detection and
Characterization (RDC) were introduced as a task in 2002. Named entity relation extraction is a signicant research topic in the eld of information
extraction, and aims at nding various semantic relations between named entities 6. This consti-
tutes very important move toward natural language processing (NLP) applications. This type of
information is enables the task of discovering a useful relationship or interaction between entities
7. A relation among Named Entities can be either introduced directly through words from a
context or expressed implicitly from a context of a sentence.
The extraction of relations between named entities received a high attention because Named
Entity Relations are a foundation of semantic networks, ontology and the semantic Web, and are
widely used in information retrieval and machine translation, as well as automatic question and
answering systems 6. In fact, the Named Entities relations extraction can be exploited to extract
more precise and correct answers. For instance if we take the example where was Alemu born?
the expected answer will be Alemu was born in Jimma. The relational triple is born-in (Person,
Location), where Person and Location are the Named Entities. So to give the answer like the
above quires we have to analyze relevant documents to collect the necessary information. Indeed,
there is a growing need to automatically extract semantic knowledge from texts. Thus, we have
to go beyond the detection of named entities and try to extract relation between them. Therefore, several studies on NE recognition have already been performed in many languages,
such as English, French, Arabic and Chinese and etc.
Relation extraction from Amharic named entities has not received a signicant concern when
it compare with English, Arabic and chines languages. Some named entities recognition systems
have been done for Amharic language. From those proposed systems 8 based on hand crafted
rules, called rule based approach using gazetteers and 9 based on supervised machine learning
approach. Several methods have been proposed to extract semantic relation between named entities.
These methods can be classied as rule based machine learning and hybrid approach. Rule based approach contains set of hand written rules. Rules are written by the language
experts so for this approach human experts are required. The rule-based method oers a signicant
analysis of the context for each Named Entities and its relations with the other Named Entities.
To extract the relation between named entities a noticeable eort is required to write down all
the rules for discovering relations between Named Entities
To fully automate the relation extraction between named entities and to avoid writing man-
ually a rule that enable us to extract relation between named entities it is better to use machine
learning approach which requires a large amount of annotated training data. This approach in-
cludes supervised, semi supervised and un-supervised techniques. Supervised technique requires a
fully labeled corpus. The most often used supervised techniques include support vector machine
(SVM), conditional random eld (CRF), decision tree and maximum entropy model (ME).
The hybrid approach uses both rule based and machine learning methods. So in the hybrid
approach we combine any of the two methods in order to improve the performance of the extraction
of relation between named entities. Dierent studies were developed using hybrid approach for English, Arabic and other lan-
guages. However there is no hybrid approach developed for Amharic language to extract relations
between named entities.so based on the benet that we get from this method we propose our
system which aimed to detect relations between Amharic named entities using a hybrid approach.
In this case we are using rules mainly for two purposes. The rst one is to improve the quality
and accuracy of our system output and the second one is to avoid ambiguity and invalid relations
between named entities by writing some rules.
Amharic is the language with rich and complex morphological structures and being the working
language of federal government of Ethiopia and a lot of valuable information is being published
in Amharic currently and we can nd high frequency of Amharic named entities in electronic
Amharic documents. Despite having large number of speakers, Amharic is one of under resourced language. And
there is no any complete system for extracting relations between Amharic named entities that
could be contributes for advance researches in natural language processing (NLP) and information
extraction systems like ontology design, question answering, machine translation and any other
systems with a best performance. So there is a growing need to automatically extract semantic
relation from Amharic named entities. Hence above all these facts imitate us to do this work.
Generally, the advantage and the application areas of extracting relations between named
entities in many NLP applications and information extraction tasks motivate us to do on extracting
relation between Amharic named entities.
1.2 Statement of the Problem
Currently the number of Amharic electronic data is increasing than ever before and we can nd
the high frequency of Amharic named entities in electronic texts. Named entity recognition can be considered the rst step towards semantic analysis of texts
and a crucial subtask of information extraction systems. But named entities recognition is only
the rst step for full language processing. If we want to go beyond the detection of entities, a
natural step is establishing semantic relations between these entities. But the relations between
these entities are not enough represented in the used resources. The advantage and the application areas of extracting relations between named entities in
many NLP applications and information extraction tasks motivate us to do on extracting relation
between Amharic named entities. So there is a growing need to automatically extract semantic
relation from Amharic named entities. As it is stated in 6,10 Named entity relations are a
foundation of semantic networks, ontology and the semantic Web, and are widely used in infor-
mation retrieval and machine translation, as well as automatic question and answering systems
and summarization. Amharic is the language with rich and complex morphological structures and being the work-
ing language of federal government of Ethiopia and a lot of valuable information is being published
in Amharic currently. Amharic is written with a version of the Ge’ez script known as Jð
and has its own unique grammar, syntax, character (Fidel) representation and statement forma-
tion and spoken by a large number of population. According to 11 Amharic is the second most
spoken sematic language next to Arabic and the second largest language in Ethiopia(after oromifa,
a Cushitic language) and possibly one of the ve largest languages on the African continent. Despite having large number of speakers, Amharic is one of under resourced language. And
there is no any complete system for extracting relations between Amharic named entities that
could be contributes for advance researches in natural language processing (NLP) and information
extraction systems like ontology design, question answering, machine translation and any other
systems with a best performance.
Dierent researchers propose dierent methods for extracting relations between named enti-
ties. The system developed for extracting relations between named entities of one language that
works eectively cannot be work for other language with the same accuracy and eciency or may
not work at all. This is because the relation extraction system between named entities has to
be trained with the nature of the given language. In this case developing an ecient system for
extracting relations between Amharic named entities is an important task. There is only one attempt 8 to extract the relation between Amharic named entities. But
it gives a great emphasize on designing and developing automatic information extraction system
for Amharic text using knowledge-poor approach for infrastructure domain. To extract relation
this method used extraction rules which specied dierent patterns and then the text is matched
against those patterns and if match is found the element of the text is extracted. The following gaps are identied from this work:
It is domain dependent. It works only for the infrastructure domain and it extracts relations
between named entities with in the given domain only.
It uses gazetteer for the identication of those entities which cannot have any xed pattern
such as organization and location named entities. So the extraction cannot work beyond the
entities among the list or gazetteer.
This work extracts only explicit relations between named entities but there are also implicit
relations between named entities which can be extracted from the context.
To the best of our knowledge there is no study that used a hybrid approach to extract the
relationship between Amharic named entities. The above mentioned limitations decrease the
quality of extraction of relation between named entities. Hence this work improves the quality
of the output of the system through developing a hybrid system to combine the advantages of
Machine Learning and rule based approaches. Therefore, we will address the following research questions:
How Can we eectively extract the relationship between Amharic named entities
What are the best feature sets for improving the performance of Amharic named entity
Does the combination of dierent approaches improve the performance of the system?
How to make a system domain independent?
Which is the most eective method among state of the art relation extraction method
How our method performs compared with other methods
1.3 Ob jectives
The ob jectives of the research are stated as general and specic ob jective as follow.
1.3.1 General ob jective
The general ob jective of this research is to design and develop automatic relation extraction system
for Amharic named entities using a hybrid approach.
1.3.2 Specic ob jective List and analyze the state of art of the extraction of relation between named entities
Design a model for automatic Amharic named entity relation extraction
Design a model for Amharic named entities recognition
Identify features and methods that bring better performance for the extraction of relations
between Amharic named entities.
Automatic extraction of semantic relation between named entities is a compound task that is to
be done with dierent components in dierent steps.
For this research we will use a hybrid method which is the combination of machine learning
and rule based approach. we will use Apriori and decision tree algorithm for extracting relation and
for the output of the machine learning as a post processing we will use a rule based approach which
is used for better performance of the output of the system by removing unnecessary relations and
by extracting un seen relations. To achieve the main ob jective the following step by step activates
will be performed.
1.4.1 Literature review
To determine gaps with previous works and understand the extraction of relations between named
entities better dierent related literatures from books, journal articles and internet is reviewed in
1.4.2 Corpus preparation
Relevant Amharic corpus will collect and prepared from dierent sources in order to use it for train
the system. There is dicult to nd publicly available annotated corpora that contain necessary
information for Amharic language. This makes machine learning method especially supervised
approach dicult to use since it needs a large number of annotated training data. Since Amharic
is under resource language and we cannot nd annotated corpus that is helpful for our task we
are enforce to prepare our own corpus and annotate it using available tools. We will construct our
corpus from dierent resources. During the preparation of the corpus we will select a sentence
that contains at least two named entities because our aim is to extract relation between those
1.4.3 NLP pre processing
Dierent NLP preprocessing activates are performed in this step including tokenization, stop word
removal, steaming and normalization to produce the training data set. Given a text input the pre-
processing module segment it into single sentence. Each sentence is then divided in to a sequence
of words or tokens.
1.4.4 Named entity recognition
Before going to extract relations between named entities rst we have to determine named entities
found in a given sentence. After determining named entities next we extract the semantic relation
between those entities.
1.4.5 Automatic rule extraction
Rules are extracted automatically to discover words that predict relations between named entities.
These rules are generated using machine learning (ML) algorithms. For this research work we will
use a hybrid approach to extract the semantic relations between Amharic named entities
1.4.6 Development tool
Java programming language is used to implement dierent language specications algorithms and
1.5 Scope and Limitation
The focus of the study is extracting relationship between Amharic named entities using a hybrid
approach. The scope of the study is limited on determining relations between Amharic named
entities which is found with in the same sentence
1.6 Application of Results
As stated in the statement of the problem, extracting relationship between named entities is use-
ful for many areas of Natural language processing (NLP) and information extraction of Amharic
language. So that the beneciaries of this research includes researchers involved (want to be
involved) in dierent NLP and information extraction researches in which it needs relations be-
tween Amharic named entities. In addition this automatic extraction of relation between Amharic
named entities benets dierent users by enabling them to get relevant information quickly for
their complex quires because entity relation extraction is very useful for question answering and
solving complex quires. So it saves time and eort of users and they do not need to spend much
time by reading unnecessary documents. Generally relation extraction between Amharic named entities is used for dierent applications
such as question answering, text summarization, semantic network and ontology learning.
2 Literature review
In the sections below dierent approaches to relation extraction, sub tasks of information extrac-
tion which includes named entity recognition, co-reference resolution and relation construction,
evaluation metrics and any concepts related with named entity relation extraction are reviewed
in order to understand the problem domain and the extent of the work to be done.
2.1 information extraction
Now a day there is a rapid growth of textual information available in digital form in the internet
and other electronic Medias. A signicant part of such information like government documents,
legal acts, online news, and social media communication is transmitted in unstructured form
and thus it is dicult to search in. this resulted in a growing need for eective and ecient
techniques for analyzing free text data and discovering valuable and relevant knowledge from
it in the form of structured information. This leads to the concept of information extraction
technologies. Information Extraction refers to the automatic extraction of structured information
such as entities, relationships between entities, and attributes describing entities from unstructured
sources. The goal of Information Extraction (IE) is to extract pieces of information that are important
to the user’s need from large volume of text. The IE tasks may vary in detail and reliability, but two
subtasks are very common and closely related: named entity recognition and relation extraction.
Named entity recognition identies named ob jects of interest such as person, organizations or
locations. Relation extraction involves the identication of appropriate relations among these
entities. Examples of the specic relations are employee-of and parent-of. Employee-of relation
holds between a particular person and a certain organization and parent-of holds between a father
and his child 12.
Information Extraction has not received as much attention as Information Retrieval (IR) and
is often confused with Information Retrieval 13. Information extraction and information retrieval
are two dierent concepts. IE is dier from IR in which The IR process usually returns a ranked
list of documents, where the rank corresponds to the relevance score that the system assigned to
the document in response to the query. Whereas The goal of IE is not to rank or select documents,
but to extract from the documents relevant facts about pre-specied types of events, entities, or
relationships, in order to build more meaningful, rich representations of their semantic content. the Message Understanding Conference (MUC) 15,16 and Automatic content Extraction
(ACE) 17program inuence the scope of information extraction. before these two competetions
(MUC and ACE) the extraction task were mainly focus on the identication of named entities
like person and location names and relations between them fron natural language text.14 Extraction of structured information from a text started gaining much attention when DARPA
initiated and funded the Message Understanding Conference in the 90’s 18. Early MUCs dened
information extraction as lling a predened template that contains a set of predened slots. The
message understanding conference MUCs provide a forum for assessing and discussing progress
in the eld of natural language processing. Each conference is preceded by a formal evaluation
of text analysis system that has been developed to perform a shared task, as designed by the
government in consultation with evaluation participants from the research community 19. Automatic Content Extraction (ACE) 17is an evaluation conducted by NIST to measure
the tasks of Entity Detection and Tracking (EDT) and Relation Detection and Characterization
(RDC). The Entity Detection task requires that selected types of entities mentioned in the source
data be detected, their sense disambiguated, and that selected attributes of these entities be
extracted and merged into a unied representation for each entity 20. As stated in 21the goal
of RDC is to detect and characterize relations of the targeted types between EDT entities. ACE
denes the following NE types: PERSON, ORG, LOCATION, FACILITY, GEO POLITICAL
ENTITY (GPE), WEAPON etc.
The ob jective of the ACE program is to develop technology to automatically infer from human
language data the entities being mentioned, the relations among these entities that are directly
expressed, and the events in which these entities participate. Structured databases, labeled unstructured data, linguistic tags, etc., are the type of input
resources available for extraction. Structured data is the data that can be easily organized.
It is simple, clean, analytical and usually stored in databases. Fully structured data follows a
predened schema. A typical example for fully structured data is a relational database system.
and unstructured data refers to information that either does not have a predened data model or
identiable structure. Collection of data from social media is an example for unstructured data. Usually IE, as many other NLP tasks, can be regarded as a pipeline process, where some kind
of information is extracted at each stage. Dierent types of information extracted are 23:
Named Entities (NE)
Relation between entities
2.2 sub tasks of information extraction
2.2.1 Named entity recognition
A named entity (NE) is often a word or phrase that represents a specic real-world ob ject. Named
entities play a central role in conveying important domain specic information in text, and good
named entity recognizers are often required in building practical information extraction systems
8.The task of named entity recognition is to identify named entities from free-form text and to
classify them into a set of predened types such as person, organization and location. Named
Entity Recognition (NER) is one of the ma jor tasks in Natural Language Processing (NLP). It
is essential to recognize information units like names, including person, organization and location
names, and numeric expressions including time, date, money, and percent expressions within a
text. Research on named entity recognition has been promoted by the Message Understanding
Conferences (MUCs, 1987-1998), the Conference on Natural Language Learning (CoNLL, 2002-
2003), and the Automatic Content Extraction program (ACE, 2002-2005) 24. At rst, Named
Entity Recognition (NER) was present as a subtask of MUC-6(Message Understanding Confer-
ence) 25. Throughout the MUC series, the term named entity came to include seven categories;
persons, organizations, locations (usually referred to as ENAMEX), temporal expressions, dates
(TIMEX), percentages, and monetary expressions (NUMEX). information extrction makes information easier to locate. This is done by rst locating named
entities and then categorizing them under dierent labels. Named entity recognition is probably
the most fundamental task in information extraction. Extraction of more complex structures such
as relations and events depends on accurate named entity recognition as a preprocessing step. As
stated in 26 apart from being a building block in information extraction Named entity recognition
has many applications like question answering, information retrieval, machine translation, parsing,
meta data for semantic and fast information gathering. 2.2.2 Coreference resolution
Any given entity in a text can be referred several times dierently. So to know that all ways used
to refer that named entity coreference resolution is very important. Entity coreference resolution
is the task of determining which entity mentions in a text or dialogue refer to the same real-world
entity. Entity mentions can be named in case an entity is referred to by name, pronominal when
an entity is referred to with a pronoun and nominal in case an entity is expressed by nominal
expressions 13. Coreference resolution systems began in earnest in 1996 in response to the
MUC-6 competition organized by NRAD with the support of DARPA.
There are several types of coreference, but the most common types are pronominal (when a
noun is replaced by a pronoun) and proper names coreference(when a noun is replaced byanother
noun or a noun phrase ) 27. Coreference resolution involves identifying relations between entities in texts. Besides entities
identied by named entity recognition, this may also include anaphoric references to those entities.
It is concerned with entities and references (such as pronouns) that refer to the same thing.
Coreference resolution enables the association of descriptive information scattered across texts
with the entities to which it refers. 8
2.3 Relation extraction
A relation is an aspect or quality that connects two or more things or parts as being, belonging,
working together, or as being of the same kind. So, in formal, we can dene Relation Extraction as
the process of recognizing the type of relation that connects two or more Named Entities. Examples
of such entities include: names of persons, organizations, locations, expressions of times, quantities,
monetary values, percentages, etc. The input is multi-structured data, including structured data
(info box form), semi-structured data (tables and lists) and non-structured data (free text). And
the output is a set of fact triples extracted from input data. relation extraction (RE) is one of
the steps of information extraction. It typically follows named entity recognition and coreference
resolution and aims to gather relations between NEs. 28 dene relation extraction as:”the task
of discovering semantic connections between entities. In text, this usually amounts to examining
pairs of entities in a document and determining (from local language cues) whether a relation exists
between them. Recently it has received more and more attention in many areas like information
extraction, ontology construction, and bioinformatics etc. The concept of relation extraction was rst introduced as part of the Template Element Task,
one of the information extraction tasks in the Sixth Message Understanding Conference (MUC-
6) (Defense Advanced Research Pro jects Agency, 1995). MUC-7 added a Template Relation
Task, with three relations. Following MUC, the Automatic Content Extraction (ACE) meetings
(National Institute of Standards and Technology, 2000) are pursuing information extraction.
The relation extraction task identies various semantic relations such as location, aliation,
revival and so on between entities from text. For example, the sentence “Abebe is the president of
Ethiopia.(” “ è ¢uî5ë U,Øó•u Í “) conveys the semantic relation President(” U,Øó•u “),
between the entities “Abebe (” “”) (PERSON) and Ethiopia(” ¢uî5ë”)(GPE).
Many applications in information extraction, natural language understanding, and informa-
tion retrieval require an understanding of the semantic relations between entities. Extracting
semantic relations between entities in natural language text is a crucial step towards natural lan-
guage understanding applications. A relation is dened in the form of a tuple t = (e1,e2,…,en)
where the ei are entities in a predened relation r within document D. relations can be found
between two entities ( binary relation) or more than two entities but Most relation extraction
systems focus on extracting binary relations30. There are dierent relation types and 31 present relation types from ACE 2003 and these
relation types can be:
ROLE: relates a person to an organization or a geopolitical entity
Subtypes: member, owner, aliate, client, citizen
PART: generalized containment
subtypes: subsidiary, physical part-of, set membership
AT: permanent and transient locations
subtypes: located, based-in, residence
SOCIAL: social relations among persons
subtypes: parent, sibling, spouse, grandparent, associate
Before MUC-7, relations between entities were part of the scenario-specic template outputs
of IE evaluations. In order to capture more widely useful relations, MUC-7 introduced the template
relation task. Extraction of relations among entities is a central feature of almost any information
extraction task, although the possibilities in real-world extraction tasks are endless 27.
Before starting to extract relations, it is a good idea to determine which words refer to
the same “ob ject” in the real world. These ob jects are called entities. For example, “Barack”,
“Obama” or “the president” may refer to the entity “Barack Obama”. Let’s say we extract
relations about one of the words above. It would be helpful to combine them as being information
about the same person. Figuring out which words, or mentions, refer to the same entity is a
process called entity linking. Entity Linking (EL) is a central task in information extraction given a textual passage,
identify entity mentions (substrings corresponding to world entities) and link them to the corre-
sponding entry in a given Knowledge Base (KB, e.g. Wikipedia or Freebase)32.
2.4 methods to extract semantic relations between named entities
There are three main methods to the design of named entity relation extraction.
Rule based approach
Machine learning approach
2.4.1 Rule based approach
Utilize predened linguistic (syntactic and semantic) rules written manually to extract relation-
ships based on part of speech information. It is very interesting for a restricted domain and has
a good quality of analysis. The ma jor drawback of this approach is the disability to perform well
in dealing with a wide range or new domain data. This is due to two reasons: rules should be
rewritten for dierent tasks or when the application is enlarged to dierent domains and nding
rules manually is very hard and time-consuming 33.
2.4.2 Machine learning approach
Machine learning (ML) techniques are widely used as a component of relation extraction methods.
Machine learning methods are based on statistical analysis of data to infer general rules. The task
of a ML method is either to learn rules from the structure of the underlying data, or to distinguish
instances of data from each other. Therefore, the outcome of a ML method is either the learning
rules or a model which is used to predict unknown data based on previous seen data 28. To
fully automate the relation extraction task, some research studies have been oriented toward ML
methods, including un-supervised, semi-supervised and supervised learning techniques.
Supervised learning-based methods have been shown to be eective and perform much better
than the other two alternatives. However, their performance much depends on the availability of
a large amount of manually labeled high-quality data and annotating large corpora with relation
instances is expensive and tedious. 34Supervised methods based on training set where domain
specic examples have been tagged. Such systems automatically learn extractors for relations
by using machine learning techniques. The main problem of using these methods is that the
development of a suitably tagged corpus can take a lot of time and eort. On the other hand,
these systems can be easily adapted to a dierent domain, provided there is training data 37.
This approach considers relation extraction as a classication task. Support Vector Machines
(SVM), Conditional Random Fields (CRF), decision tree and maximum Entropy (MaxEnt) are
the most used supervised machine learning techniques. Unsupervised methods use a set of generic patterns to automatically instantiate relation-
specic extraction rules and then learn domain-specic extraction rules. The whole process is
repeated iteratively. It is also known as self-supervised learning method 37. Unsupervised
learning-based methods normally perform very poorly, though they do not depend on the avail-
ability of any manually labeled data .The un-supervised methods make use of massive quantities
of unlabeled text and are based almost entirely on clustering techniques and similarities between
features or context words 7. To solve the problems with the unsupervised approach, Some supervised systems also use
bootstrapping to make construction of the training data easier. These methods are also some-
times referred to as weakly supervised information extraction. It uses an initial small set of
seeds or a set of hand-constructed extraction patterns to begin the training process. After the
occurrences of needed information are found, they are further used for recognition of new patterns.
A sample of linguistic patterns or some target relation instances can be used to acquire more basic
relations until discovering all the target relations To achieve better balance between human eorts
and extraction performance, semi-supervised learning has been drawing more and more attention
recently in semantic relation extraction and other NLP applications as well 34.
2.4.3 Hybrid approach
The two categories of approaches described above can be combined to obtain a mixed approach.
Recently, research studies have been oriented toward the use of hybrid approaches because such
an approach achieves an enhanced performance that is better than either the rule-based approach
or the MLbased approach alone 7. This approach uses manually handcrafted rules and those
extracted from data through Machine Learning (ML)-algorithms 36.Among the systems based
on this approach, we can mention the system developed by 7 to extract relations between Arabic
Named Entities.The developed system used linguistic modules employed as a post-processing to
ameliorate the obtained results. Initially, these results were obtained from a ML-based method.
This system extracts the Semantic Relations, which are complicated or expressed through more
than one word and it annotates them using a dened markup.
2.5 Evaluation matrices
to evaluate the performance of the system it is necessary to use well accepted performance measures
such as precision,recall and F-measure.Precision refers to the ability to avoid type I errors (false
positives); recall is the ability to avoid type II errors (false negatives); and F-score is dened as
the harmonic mean of precision and recall. the measures are dened in 5 as below:
precision =N o of phrases properly recognized as representing the f eature
(T P ) N o of phrases properly recognized as representing the f eature
(T P )
recall=N o of phrases properly recognized as representing the f eature
(T P ) N o of phrases representing the f eature
(T P +F N )
F score =2
(precision recall ) precision
………………………………………………………………………………(2.3) where TP means true positive value, FP means false positive value, and FN means false
2.6 The Amharic language
Amharic is a Semitic language spoken predominantly in Ethiopia. It is the working language of
the country. The language is spoken as a monther tongue by a large segment of the population in
the northern and central regions of Ethiopia and as a second language by many others. Following
the Constitution drafted in 1993, Ethiopia is divided into nine independent regions, each with
its own regional language. Then, Amharic become the ocial or working language of several of
the states regions within the federal system, including Amhara ,Gambella,Benshangul and the
multi-ethnic Southern Nations, Nationalities and Peoples region. It is the second most spoken
Semitic language in the world next to Arabic and the most commonly learned second language
throughout Ethiopia. 37 It is the second largest language in Ethiopia (after Afan Oromo, a
Cushitic language) and possibly one of the ve largest languages on the African continent9. As
a result it has ocial status and used nationwide. Despite it has large speaker population, the
language has little computational linguistic resources.
2.6.1 The Amharic Writing
as it stated in 8Amharic is written using a writing system called del – Jð
or “character”) adapted from Ge’ez ( the liturgical language of the Ethiopian Orthodox Church)
language. In modern written Amharic, each syllable pattern comes in seven dierent forms (called
orders), reecting the seven vowel sounds.These seven orders (the rst basic order and the other
six orders) represent the dierent sounds of a consonant-vowel combination (a characterization
known as syllabic).The non-basic forms are derived from the basic ones by somewhat regular mod-
ications. the alphabet is written from left to right, in contrast to some other Semitic languages..
There are 33 basic characters, each of which has seven forms called orders depending on which
vowel is to be pronounced in the syllable. The seven orders were represent seven vowel sounds.
Therefore, these 33 basic characters with their seven forms will give 7*33=231 syllable patterns
(syllographs), or dels. In addition to the 231 characters, there are other non-standard alphabets
which contain special features usually representing labialization.
2.6.2 Amharic word categories
based on the recent works( Baye,2000) as it is cited in 8 the Amharic language has ve word
categories based on the role of words in syntax, which means by considering the clear role of words
in Amharic grammar. these ve categories 5(noun),
5 (verb), EE
(Adverb), and 5pËõõ(preposition).
Noun: Like English, Amharic nouns are words used to name or identify any of a class
of things, people, places, organization or ideas or a particular one of these.3.A word will be
categorized as a noun, if it can be pluralized by adding the sux ¦}/Î}and used as nominating
something like person, animal, and so on 1.In Amharic sentences noun is used as to indicate
sub ject of a sentence.Pronoun is a word that is used instead of a noun or noun phrase. They are
characterized based on number, gender and possessiveness. Some of pronouns for deictic specier
such as í ¥5Ë ¥” •z ¥1 ¥1 . . . .. ; Quantity speciers such as •ó•õ%Bu eÙ . . . . and
possession specier such as è¥1 è¥” è¥1. . . .. .
Verb: it is described by 8 as any word which can be placed at the end of a sentence and
which can accept suxes as =etc. which is used to indicate masculine, feminine, and
plurality is classied as a verb. as a result of this property a word at the end of such a sentence
is expected to be tagged as a verb by an Amharic tagger.Verb expresses accomplishment of an
action and used to close the sentence. For example in a sentence p0 ¨c-ó- #” the word
#” is verb since it appears at end of the sentence and closes the meaning of sentence.
Adjective: Adjectives in a sentence modify nouns to denote quality of a thing; that is, it
species to what extent a thing is as distinct from something else,Adjectives in Amharic usually
precede the nouns that they modify or describe to qualify a noun with some form of size, kind
and behavior.For example in the sentence