ABSTRACTWith the rapid growth of the information in the web data the need to retrieve, analyse and understand a large amount of information has increased.
Huge amount of data is being generated and consumed by the people and machines every second. Access to huge amount of information often leads to confusion in identification of the core idea of that information. Understanding large text documents and found crucial information out of it is often a laborious and time-consuming tasks. So, there is need of the automatic text summarization to understand the salient or core meaning of original text. Text summarization is the process of distilling the most important information from a source to produce a concise summary.
Text summarization has two broad approaches: extractive and abstractive. Extractive methods aim to select salient phrases, sentences or paragraphs from the text while abstractive methods focus on generating summaries from scratch without the constraints of reusing phrases from the original text. Abstractive summarization is the better than extractive summarization because abstractive methods generate a novel words in summary. The majority of work done in abstractive summarization is based on traditional approach, where the features are manually compiled. Neural network based approach are not rely on compiled features.
Usually long news article contains large amount of information. Many times due to lack of time, people cannot read the whole news article. Therefore, headline is required in order to get core or complete idea of news without reading whole news article. In this work, we create a news headline from the Gujarati news article using neural network based abstractive summarization method. For achieving this, pointer-generator model is used. Chapter 1IntroductionText summarization is one of the challenge in natural language processing. Text summarization is the process to create short and concise summaries from the text document with the aim of provide most important or salient content in a condense form from the source document. In this chapter, we describe the motivation of our research, applications of summarization, problem statements, objective and scope behind this work.
BackgroundNLP is a way for computers to analyse, understand, and derive meaning from human language in a smart and useful way. It is concerned with the development of computational models of aspect of human language processing. NLP is a key component of artificial intelligence (AI) and relies on machine learning.
Natural language processing can be used to analyse parts of a sentence to better understand the grammatical construction of the sentence. The benefits of natural language processing are innumerable. Natural language processing can be leveraged by companies to improve the efficiency of documentation processes, improve the accuracy of documentation, and identify the most pertinent information from large databases.A summary is a text produced from one or more texts that contains a significant portion of the information in the original text which is no longer than half of the original text.
For a human, making sense of large documents and grasping crucial information out of it is often a laborious and time-consuming task. The need for quick acquisition and take useful information insights from a large corpus of information on the Internet has driven the development of various automated summarization systems. Automatic text summarization is the process of reducing text documents with a computer program in order to create a summary that retains the most important points of the original documents. The automatic summarization system creates short and concise summaries of documents. Text summarization approaches are two broadly divided into two groups: extractive and abstractive. Extractive method assemble summaries from the document, which directly take sentence or phrase from the original document. Abstractive methods focus on generating summaries from scratch without the constraint of reusing phrases from the original text. It generates novel words or phrases not featured in the source text.
MotivationIn this era of digital world, the Internet or web is overloaded with variety of information. Huge amount of data is being generated and consumed by the people and machines every second. Online news, legal documents, reviews of products, opinions, online question answering forums and various social media posts like tweets, Facebook posts, blogs, etc. are few examples of data generating sources.
These sources are generating billions of documents day by day. People have no enough time to read whole documents or papers. So, it is necessary to generate a significant and relevant information from the whole document.
There are different two approaches for abstractive text summarization: traditional approach and neural network based approach. The traditional approach are rule based and rely mostly on manually compiled features. Neural network approaches are not rely on manually compiled features and are rules independent. Majority of the work has been done in extractive text summarization. Extractive method is easier than abstractive, because copying large chunk of data from the source document ensures that generated summary is grammatically correct. To generate an abstractive high-quality summaries, paraphrasing, generalization and real-world knowledge are needed. In many languages, abstractive text summarization has also been done.
But in Gujarati language, abstractive text summarization is not applied. The long news article has lot of information. Due to lack of time people are unable to read whole news article. Headline are useful to reduce the reading and interpretation time for getting the complete idea of entire news article. This motivates to research work in this area.Problem DefinitionHuge amount of information is used across digital world. Text summarization has become an important and timely tool for user to quickly understand the large volume of information. It is highly essential to have automatic text summarization techniques to generate concise summary.
Extractive and abstractive two different techniques are used for text summarization. It is necessary to produce a news headline from the newspaper article. Extractive and abstractive both text summarization technique is applied in many different languages.
But in Gujarati language, abstractive text summarization is not applied for news headline generation. So, there is a need for system which automatic generate the headline from the Gujarati newspaper using abstractive method. ApplicationsAutomatic text summarization has wide range of applications. In this section, we present some important applications of text summarization which are as follows.News Summarization: The summarization systems leverage the power of multi-document summarization techniques in order to summarize the news coming from various sources.
It generates the compact summary which is informative and non-redundant. Methods used can be extractive or abstractive. Social media Summarization: It deals with summarization of social media text such as: tweets, blogs, community forums etc. These summarization systems are built keeping in mind the needs of the user and are dependent on the genre of social media text. Tweets summarization system will be different from blog summarization systems as tweets has short text and are often noisy.
While the blog has considerably longer length text with a different writing style.Storylines of event: It deals with identifying and summarization of events that leads to event of interest. It helps in providing background information about an event and structured timeline for an entity. The news collections from various news sources are collected over a period of time. Each of these news collections define a main even. A graph of events (news collection) is defined using similarity, then heaviest path ending in a given event is identified and finally, the events on this path related to salient entities in target event are summarized to obtain storylines of the event.
Domain specific summaries: Summarization systems are often used in generating domain specific summaries. These systems are designed in accordance with the needs of the user for a specific domain. For example: legal document summarization deals with generating summary out of a legal/law documents, medical report summarization has aim of generating a summary form a patient report history such that it includes all important clinical events in order of timeline.Objective and scopeThe aim of this work is to create a headline from newspaper article using abstractive text summarization techniques. The objective is to create a proper headline from Gujarati newspaper article using neural network based method of abstractive text summarization and save reader’s time and effort in finding the useful information from a detailed news article.Different methods are used for abstractive text summarization for different languages. Scope of this work is to neural network based abstractive text summarization model not found for Gujarati language.Structure of reportThis report is organizes as follows:Chapter 1 describes Background, Motivation, Problem definition, Applications, Objective and scope of the study work.
Chapter 2 discusses different types of summarization based on categories and different abstractive text summarization methods and related work for abstractive text summarization and comparative analysis of different models.Chapter 3 discusses proposed work for news headline generation.Chapter 4 concludes the proposed work and talk about the future work.References describes list of research papers used for reference.
Chapter 2Literature SurveyIn order to generate summaries of the documents, various text summarization approaches and their techniques are used. In this chapter, we present several existing works on automatic text summarization. Automated text summarization has been an extensively studied problem in the field of Natural Language Understanding. First, we present the types of summarization techniques based on different categories, abstractive text summarization approaches and related work in abstractive text summarization in detail.Types of summarization techniquesDifferent types of summary might be useful in various applications and summarization systems can be categorized based on these types. Categories are based on types of generated summary, types of details, types of content, based on language and types of input document.Summarization categories based on type of generated summary.Extractive summarization: It is the most common approach for text summarization.
This method extract important phrases from the source documents and group them to produce a summary without changing the source text.Abstractive summarization: This method aims to generate a summary which is closer to what a manual summary looks like. It builds an internal semantic representation of the text and then uses natural language generation techniques to create a summary from scratch. It may generate novel words and phrases not contain or featured in the source text for creating summary.Summarization categories based on types of details.Indicative summary: An indicative summary is used for quick view of a lengthy document and it provides only the main idea of source text that encourage a user to read the document.
Informative summary: Informative serve as a substitution to the original document. It provides the concise information of the source document to the user.Summarization categories based on types of content.
Generic summarization: It is system which can be used by any type of the user and summary does not depend on the topic of the document.Query-based summarization: Query-based summarization is question answer system where the summary is the result of user’s query. The system picks out only the information which are related to the given query and present a concise summary to the user. The query can be a phrase, keyword or a question. Search engines uses this kind of summarization to produce snippets for the suggested web pages related to user’s query.Summarization categories based on language.
Monolingual summarization: The system takes input text in one language and produces the summary in the same language. For example: summarization system for English language.Multilingual summarization: The systems can be used for multiple languages. However, their input text and generated summary text are in the same language.Cross-lingual summarization: The system can be process several languages. However the output summary is in different language than input side text. For example: summarization of Gujarati news to English.Summarization categories based on input document.
Single document summarization: The system takes one document as an input and produces a concise summary of the input document.Multi-document summarization: The system takes multiple documents referring to same theme or topic and produces a single summary of the original multi-document text. Due to the high information redundancy on the internet, researchers’ interests shifted towards the problem of multi-document summarization.Abstractive Text Summarization methodsThe aim of the abstractive text summarization is to provide concise summary of document, which is useful to user for quickly understand the large volume of information. There are two approaches for abstractive text summarization: Traditional approach and neural network based approach.2286000257810Abstractive text summarization0Abstractive text summarization309562595250 971550273050Traditional approach00Traditional approach3895725214630Neural network based approach00Neural network based approach2409825622300030575256223008953496667400164782557151001962150275590Semantic based approach00Semantic based approach47625285115Structure based approach00Structure based approachTraditional approach for abstractive text summarization:The traditional approach are rule based and rely mostly on manually compiled features. It requires extensive effort and domain understanding in order to come up with such human compiled features. It can be broadly classified into two categories, namely: Structure based, and Semantic based.
Structure based approachStructured based abstractive summarization generates abstractive summary by populating the prominent sentences in a predefined structure without losing its meaning. The predefined structures used are templates, tree-based structure, ontology based structure, lead and body phrase structure and rule based structure.In the template based method, the extracted information is populated into a template to generate the final summary. The tree based method extracts similar sentences from the source text with the help of a parser and populates them into a tree structure which follows predicate-argument structure.
The ontology based method pre-processes and extracts the required keywords from the source text and maps them as concepts and relations with the help of predefined ontology which is eventually converted to a meaningful summary. The lead and body phrase methods focus on revamping the lead sentences by either substituting or inserting the information rich similar phrases from the body which are called triggers. The lead phrase gets substituted by the body phrase if the body phrase has higher syntactic similarity with the lead phrase, provided the information is richer than the lead phrase. The rule based method uses rules and categories to represent the document summaries. Rules are fed into this module to obtain required meaningful candidates from which the best candidate is selected & passed to summary generation module to generate summary using generation pattern.Semantic based approachSemantic based abstractive summarization involves inputting the semantic representation of the document to natural language generation module in order to obtain the desired summary.
The different methods used in this approach are: Multimodal semantic model, Information item based method, and Semantic Graph based method.Multimodal semantic model constructs a semantic model by making use of the concepts and finding the relations between these concepts with the help of an ontology. The next stage involves identifying the important concepts by using information density metrics. The summary is generated from these important concepts in the final stage. The information item based method first identifies informative items by performing syntactic analysis on the text. Sentences are generated from these items by following the subject-verb-object structure using a sentence generator. The generated sentences are then ranked based on the average Document Frequency score.
From this list, the highly ranked sentences are taken to create the summary. Finally, the semantic graph approach consists of three phases: 1. representing the entire document by a Rich Semantic Graph (RSG), 2. Applying heuristic rules to reduce the complexity of the semantic graph, and 3. Generation of abstract summary form the reduced graph. Neural network based abstractive text summarization Neural network approaches are not rely on manually compiled features and are rules independent. The advent of deep neural networks has driven the success of useful abstractive summarizer systems.
Recently, quite an effort has been made for modelling the abstractive summarization task using deep neural network architectures. The neural abstractive summarization system use an encoder-decoder architecture. The encoder captures the thought of the source sequence into a continuous vector from which the decoder generates the target summary.