Automatic Text Summarization and Question-Answer Generation Using Deep Learning Techniques
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Automatic text summarization and question-answer generation are integral components of natural
language processing (NLP) that facilitate efficient information retrieval and enhance educational
tools. Automatic text summarization techniques aim to capture the essence of a document, article,
or passage and provide a condensed version that highlights the key points and main ideas. The
ever-growing ocean of text can be overwhelming, making text summarization a crucial tool for
navigating information efficiently. The question-answer generation approach involves generating
questions and their corresponding answers from a user given source text or knowledge base by
directly selecting and rephrasing existing sentences or phrases. This approach is beneficial when
the goal is to generate questions and answers quickly and accurately from existing content, and the
answers are readily available in the source material, such as factual questions. This thesis
investigates the utilization of deep learning techniques to advance the capabilities of these tasks,
focusing on developing, implementing, and evaluating novel models and methodologies.
The research work presented in this thesis provides a framework for question-answer generation
and summarization. A system has also been developed to generate and summarize question
answers using deep learning techniques. The developed system is capable of effectively addressing
the challenges posed by the ever-expanding volume of textual data. The question-and-answer
generation model generates different question-answer pairs, including subjective and objective
type questions over a given text. The questions generated by our approach are grammatically and
contextually correct, and the answers generated match the questions in the textual context. A
query-based answer summarization system has been proposed for question-answer summarization.
The query-focused answer summarization model produces a summarized answer relevant to the
given query question. This approach saves a significant amount of user time by tailoring the
summary to answer the user’s query directly rather than condensing the entire document.
The study begins by reviewing the evolution of text summarization and question-answer
generation, highlighting the transition from traditional rule-based approaches to contemporary
deep learning models. A systematic taxonomy for text summarization and question-answer
generation has been given that provides a structured framework to categorize and comprehend the
multifaceted nature of approaches and techniques, along with the nature of output and input.
Central to this research are advanced architectures such as sequence-to-sequence models,
transformers, and attention mechanisms, which have revolutionized the field by improving the
coherence and relevance of generated summaries and question-answers. Additionally, the thesis
investigates the current landscape of the available tools in the field, and the publicly available
datasets for conducting the research within the corresponding domains are also discussed. The
latest research studies and commonly used evaluation parameters are discussed, and research gaps
have been identified. In order to bridge the gap, the research presented a framework for question
answer generation and summarization. The automatic question-answer generation and
summarization facilitate extracting relevant information and insights from extensive textual
content, enhancing accessibility and comprehension for users.
The automatic question-answer generation greatly benefits users by saving time, repeating core
concepts for reinforcement learning, and motivating learners to engage in learning activities. The
question-answer summarization framework helps those who urgently need information by
providing the user with condensed relevant information in real-time while minimizing redundancy,
thus enhancing user experience. The framework briefs various phases and sub-phases involved in
generating and summarizing question-answers along with the input and output of these phases.
The work also mentions the approaches, models, and datasets used in the framework phases for
training or fine-tuning the computationally intense architectures. The methodologies employed in
this thesis include the application of pre-trained language models like T5, BART, PEGASUS, and
GPT for optimizing generation quality and the training of models on large-scale datasets such as
Stanford Question Answering Dataset (SQuAD), Question Answering in Context (QuAC), and
Boolean Questions (BoolQ) for question-answer generation and Quora question pairs dataset,
Microsoft Machine Reading Comprehension (MS-MARCO) dataset, and CNN/DailyMail dataset
for summarized answer generation. It has been found that the system outperforms the existing
baseline question-answer generation models over BLEU-4 and METEOR evaluation metrics with
a score of 18.87 and 25.24, respectively. This question-and-answer generation system acts as a
one-stop destination for generating subjective and objective-type questions and is capable of
generating fill-in-the-blank, multiple-choice, boolean, and long/short answers. As an outcome, this
paves an automatic way for fulfilling the need for a persistent supply of question-answers for the
tutors and self-evaluators, thus enabling users to save their effort, resources, and time.
The question-answer summarization model produces a summarized answer relevant to the given
query question. A query-focused answer summarization architecture utilizing a keyword extraction
mechanism (QFAS-KE) is presented for this model. This QFAS-KE is a four-phased framework.
The first phase normalizes the input text by eliminating irrelevant details. The second phase
retrieves semantically similar questions to the asked query, the third phase extracts candidate
answers relevant to the query question, and the fourth phase generates a summary of selected
candidate answers. A BERT-based bi-encoder and cross-encoder siamese structure have been
utilized with FAISS indexing to identify semantic similarity between query-to-questions and
question-to-answers. For answer summarization, fine-tuning of BART, T5, and PEGASUS has
been performed on summarization datasets with keyword guidance by applying a keyword
extractor such as KeyBERT. QFAS-KE (BART) outperforms baseline models, showing
superiority in terms of ROUGE-1, ROUGE-2, and ROUGE-L with 46.2%, 24.8%, and 42.3%
respectively. QFAS-KE (PEGASUS) achieves superior results compared to the baseline models
in ROUGE-1 and ROUGE-2. QFAS-KE (T5) surpasses baseline models, demonstrating the best
performance in ROUGE-1 and ROUGE-L. The results indicate significant improvements in both
summarization and question-answer generation tasks, with models producing more concise and
accurate summaries and generating questions that closely align with human-crafted ones.
The future scope of this work lies in exploring additional modalities to extend the proposed
system’s applicability and effectiveness in information comprehension for multimodal
information, customization of models for specific domains, such as healthcare, finance. The
findings and methodologies presented in this thesis provide a foundation for future research and
development, aiming to make these technologies more robust, versatile, and widely applicable.
