Automatic Text Summarization and Question-Answer Generation  Using Deep Learning Techniques

Goyal, Rupali

Automatic Text Summarization and Question-Answer Generation Using Deep Learning Techniques

Files

Revised_901703014_Rupali_PhD Thesis (1).pdf (3.96 MB)

Date

2025-02-10

Authors

Goyal, Rupali

Supervisors

Kumar, Parteek

Singh, V. P.

Abstract

Automatic text summarization and question-answer generation are integral components of natural language processing (NLP) that facilitate efficient information retrieval and enhance educational tools. Automatic text summarization techniques aim to capture the essence of a document, article, or passage and provide a condensed version that highlights the key points and main ideas. The ever-growing ocean of text can be overwhelming, making text summarization a crucial tool for navigating information efficiently. The question-answer generation approach involves generating questions and their corresponding answers from a user given source text or knowledge base by directly selecting and rephrasing existing sentences or phrases. This approach is beneficial when the goal is to generate questions and answers quickly and accurately from existing content, and the answers are readily available in the source material, such as factual questions. This thesis investigates the utilization of deep learning techniques to advance the capabilities of these tasks, focusing on developing, implementing, and evaluating novel models and methodologies. The research work presented in this thesis provides a framework for question-answer generation and summarization. A system has also been developed to generate and summarize question answers using deep learning techniques. The developed system is capable of effectively addressing the challenges posed by the ever-expanding volume of textual data. The question-and-answer generation model generates different question-answer pairs, including subjective and objective type questions over a given text. The questions generated by our approach are grammatically and contextually correct, and the answers generated match the questions in the textual context. A query-based answer summarization system has been proposed for question-answer summarization. The query-focused answer summarization model produces a summarized answer relevant to the given query question. This approach saves a significant amount of user time by tailoring the summary to answer the user’s query directly rather than condensing the entire document. The study begins by reviewing the evolution of text summarization and question-answer generation, highlighting the transition from traditional rule-based approaches to contemporary deep learning models. A systematic taxonomy for text summarization and question-answer generation has been given that provides a structured framework to categorize and comprehend the multifaceted nature of approaches and techniques, along with the nature of output and input. Central to this research are advanced architectures such as sequence-to-sequence models, transformers, and attention mechanisms, which have revolutionized the field by improving the coherence and relevance of generated summaries and question-answers. Additionally, the thesis investigates the current landscape of the available tools in the field, and the publicly available datasets for conducting the research within the corresponding domains are also discussed. The latest research studies and commonly used evaluation parameters are discussed, and research gaps have been identified. In order to bridge the gap, the research presented a framework for question answer generation and summarization. The automatic question-answer generation and summarization facilitate extracting relevant information and insights from extensive textual content, enhancing accessibility and comprehension for users. The automatic question-answer generation greatly benefits users by saving time, repeating core concepts for reinforcement learning, and motivating learners to engage in learning activities. The question-answer summarization framework helps those who urgently need information by providing the user with condensed relevant information in real-time while minimizing redundancy, thus enhancing user experience. The framework briefs various phases and sub-phases involved in generating and summarizing question-answers along with the input and output of these phases. The work also mentions the approaches, models, and datasets used in the framework phases for training or fine-tuning the computationally intense architectures. The methodologies employed in this thesis include the application of pre-trained language models like T5, BART, PEGASUS, and GPT for optimizing generation quality and the training of models on large-scale datasets such as Stanford Question Answering Dataset (SQuAD), Question Answering in Context (QuAC), and Boolean Questions (BoolQ) for question-answer generation and Quora question pairs dataset, Microsoft Machine Reading Comprehension (MS-MARCO) dataset, and CNN/DailyMail dataset for summarized answer generation. It has been found that the system outperforms the existing baseline question-answer generation models over BLEU-4 and METEOR evaluation metrics with a score of 18.87 and 25.24, respectively. This question-and-answer generation system acts as a one-stop destination for generating subjective and objective-type questions and is capable of generating fill-in-the-blank, multiple-choice, boolean, and long/short answers. As an outcome, this paves an automatic way for fulfilling the need for a persistent supply of question-answers for the tutors and self-evaluators, thus enabling users to save their effort, resources, and time. The question-answer summarization model produces a summarized answer relevant to the given query question. A query-focused answer summarization architecture utilizing a keyword extraction mechanism (QFAS-KE) is presented for this model. This QFAS-KE is a four-phased framework. The first phase normalizes the input text by eliminating irrelevant details. The second phase retrieves semantically similar questions to the asked query, the third phase extracts candidate answers relevant to the query question, and the fourth phase generates a summary of selected candidate answers. A BERT-based bi-encoder and cross-encoder siamese structure have been utilized with FAISS indexing to identify semantic similarity between query-to-questions and question-to-answers. For answer summarization, fine-tuning of BART, T5, and PEGASUS has been performed on summarization datasets with keyword guidance by applying a keyword extractor such as KeyBERT. QFAS-KE (BART) outperforms baseline models, showing superiority in terms of ROUGE-1, ROUGE-2, and ROUGE-L with 46.2%, 24.8%, and 42.3% respectively. QFAS-KE (PEGASUS) achieves superior results compared to the baseline models in ROUGE-1 and ROUGE-2. QFAS-KE (T5) surpasses baseline models, demonstrating the best performance in ROUGE-1 and ROUGE-L. The results indicate significant improvements in both summarization and question-answer generation tasks, with models producing more concise and accurate summaries and generating questions that closely align with human-crafted ones. The future scope of this work lies in exploring additional modalities to extend the proposed system’s applicability and effectiveness in information comprehension for multimodal information, customization of models for specific domains, such as healthcare, finance. The findings and methodologies presented in this thesis provide a foundation for future research and development, aiming to make these technologies more robust, versatile, and widely applicable.

Keywords

Text Summarization, Question-Answer Generation, Transformers, Semantic Search, Deep Learning Techniques

URI

http://hdl.handle.net/10266/6956

Collections

Doctoral Theses@CSED

Full item page

Automatic Text Summarization and Question-Answer Generation Using Deep Learning Techniques

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By