Please use this identifier to cite or link to this item: http://hdl.handle.net/10266/5885
Title: Universal Networking Language Based Question Answering System for Information Retrieval of Punjabi Language
Authors: Agarwal, Vaibhav
Supervisor: Kumar, Parteek
Keywords: UNL;Question Answering System;Punjabi Language;Information Retreival
Issue Date: 24-Oct-2019
Abstract: During the last couple of years, in the field of Natural Language Processing, QA (i.e., Question Answering) systems and UNL have been an area of immense research among researchers. This thesis is devoted to develop a language independent Question Answering System (i.e., QAS) for Punjabi language based on UNL. The complete abstract and chapter-wise summary of thesis is given as follows. First chapter provides an overview of QA systems including its need, challenges, classification, and significance of UNL for QA systems. This chapter also highlights the basics of UNL and its building blocks. It highlights the differences and advantages of UNL over other traditional approaches. Unlike the traditional approaches and techniques in natural language processing, scope and use of UNL is not limited to one domain. How UNL can be exploited for other NLP tasks has been covered under this chapter. The working principle of UNL for QA system has been explained in this chapter. Based on analysis and survey of various QA systems, several gaps were identified and objectives were set in order to fill those gaps. The gaps that were identified are: non availability of QA system for Punjabi language, lack of integration with other QA system, lack of support for multilingualism, lack of integration of different NLP applications, and complexity. In order to address these gaps, the core objectives of this research proposal has been framed and accomplished. This chapter highlights the contributions to this thesis. Second chapter of this thesis is background theory and literature review. It focuses on the analysis and study done of various QA systems, comparison of some important QA systems, and study of various UNL based activities. In this chapter, the literature review done has been documented. The complete literature review has been divided into two parts, i.e., research activities in question answering system, and research activities in UNL. The various parameters on the basis of which the comparison of various QA systems has been done are: corpus used for testing, evaluation metrics, evaluation metric’s value, domain, is working algorithms of all components explained, is question answering system available online, does question answering system supports multiple language, can question answering system be extended to support other foreign languages, can question answering system integrate other NLP applications, and is source code available for the given question answering system. Some of the important QAS which have been covered in this subsection are MMQA developed by Gupta et al. (2018), EARL developed by Mohnish et al. (2018), CQASMD developed by Feng et al. (2018), QA4IE developed by Lin et al. (2018), Web Shodh developed by Chandu et al. (2017), automatic question answering system for the Arabic Quran developed by Mohamed (2017), FelisCatusZero proposed by Kotaro et al. (2017), a semantic network theory to build up intelligent answering system based on remote service framework designed by Xiaoyi et al. (2017), a question answering system supporting vector machine method for hadith domain developed by Nabeel and Saidah (2017), a Wikipedia Based Essay Question Answering System for University Entrance Examination proposed by Takaaki et al. (2017) , an automated QA system using a hybrid approach proposed by Kwong and Chih (2017), a design of intelligent tourism QA system based on semantic web proposed by Hua and Shi-zheng (2017), an automatic web-based question answering system for e-learning developed by Waheeb and Babu (2017), a public platform for developing language-independent applications developed and tested by Agarwal and Kumar (2017), an information retrieval system using UNL by Goel (2016), a multilingual cross-domain client application prototype for UNLization and NLization for NLP applications developed by Agarwal and Kumar (2016), a modular QA system pipeline called as YodaQA developed by Baudiš (2015), use of vector space model in Question Answering System proposed by Hartawan and Suhartono (2015), a Long Short-Term Memory Model for Answer Sentence Selection in Question Answering proposed by Wang and Nyberg (2015), a Question Answering system using learning knowledge graphs through conversational dialog proposed by Ben et al. (2015), a Hybrid QA system (ISOF) over linked data and text data developed by Park et al. (2015), an Answer Selection for community Question Answering (QCRI) developed by Nicosia et al. (2015), CICBUAPnlp which is a Graph-Based Approach for Answer Selection in Community Question Answering Task proposed by Helena et al. (2015), CASIA@V2 which is an MLN-based Question Answering System over Linked Data developed by Shizhu et al. (2014), Al-Bayan which is an Arabic QA system for the Holy Quran developed by Heba et al. (2014), Forst which is a QA system using basic element at NTCIR-11 QA-Lab Task developed by Kotaro et al. (2014), a Knowledge Based QA as Machine Translation developed by Junwei et al. (2014), CMU Multiplechoice Question Answering System developed by Di et al. (2014), a natural language QA system in Malayalam using domain dependent document collection as repository developed by Pragisha and Reghuraj (2014), a QA system called as Watsonsim using the Indri, Lucene, Bing and Google search engines, Apache UIMA, Open NLP, and Weka developed by Sean et al. (2014), architecture of a Question-Answering System for a Specific Repository of documents proposed by Manuel and Riofrio (2010), a ‘LOOK4’ system using Universal Words (UWs) to enhance web search results proposed by Avetisyan and Avetisyan (2010). The subsection of research activities in UNL highlights the major research activities in UNL. Some of the important research activities in UNL covered in this subsection are English to Tamil machine translation system using UNL by Sridhar et al. (2016), Development of dictionary entries of Bangla repetition words to integrate them into UNL by Roy et al. (2016), Formation of word dictionary of Bangla vowel ended roots for first person for UNL by Ali et al. (2015), Multilingual acquiring of e-content definition based on UNL by Sathiyamurthy et al. (2015), creation of a LanguageIndependent Discourse Parser using UNL by Navaneethakrishnan et al. (2015), a new approach of solving semantic ambiguity problem of Bangla Root words using UNL proposed by Mridha et al. (2014), Development of Analysis Module for Punjabi language by Agarwal (2013), Development of Generation Module for Punjabi language by Verma and Bhatia (2013); Singh and Bhatia (2013), Development of Punjabi EnConverter and DeConverter by Kumar and Sharma (2012, 2013), Enhancement of web search results through UNL by Avetisyan and Avetisyan (2010), Development of English EnConverter and DeConverter by Jain and Damani (2009), Multilingual search engine with the use of UNL proposed by Karande (2007), Development of Arabic DeConversion system by Adly and Alansary (2009), and Language-Independent Universal Digital Library within UNL framework proposed by Alansary et al. (2006). Having done an exhaustive survey on various QA systems, it has been observed that the proposed UNL based (online available) QA system will definitely be a major step in NLP and removing the language barrier. In third chapter of this thesis, architecture and working of the proposed question answering system has been discussed. Requirements and functionalities of all these architecture components have been documented. This chapter also describes the interface of the developed question answering system along with the technology and programming language used. Towards the last of this chapter, data structures, i.e., JSON objects which are formed during the initial phase of the question answering system have also been highlighted. These data structures are further used by different modules of the developed question answering system to give the final result. This chapter sets the expectation, understanding, high-level overview of all things that would be covered in subsequent chapters of this thesis. It basically lays the foundation of the subsequent chapters and thesis organization. The architecture/working of the proposed QA system has been divided into the phases viz. UNLization phase (Analysis Module), Preprocessing and Crawling phase (UNL Crawler), Optimizing and Ranking phase (Optimizer), and NLization phase (Generation Module). Analysis module of the proposed question answering system invokes UNLization module of the source natural language to UNLize the question asked by the user and the corpus. The UNL corpus forms the UNL repository. UNL crawler crawls UNL of the question and UNL corpus to find the answer. Optimizer analyses the answer given by UNL crawler and gives ranking to it. This answer is converted to UNL and is given as an input to generation module of the proposed question answering system which invokes the NLization module of the target natural language to get the required final answer in this target natural language. In the proposed UNL based question answering system user can ask a question in any natural language and can get the output in any natural language. This is possible because the proposed system converts natural language to UNL and works on this generated UNL. Similarly, the output given by the optimizer is converted to UNL and given to the generation module (discussed in Chapter 6) which gives final output in the target natural language. This feature of the proposed question answering system makes it natural language independent. However, in order to UNLize the corpus and question asked by the user, UNLization module of the source language (in which question is being asked) needs to be developed and invoked by the analysis module. UNLization is done with the help of online tool IAN (i.e., Interactive ANalyzer) whereas NLization is done by using EUGENE (i.e., dEep-to-sUrface GENErator). Both IAN and EUGENE have been developed by UNDL foundation available at http://dev.undlfoundation.org/analysis/login.jsp. The fourth chapter focuses on the UNLization process and results of the UNLization module for the Punjabi language. This chapter also illustrates how the UNLization module of the source natural language is invoked by the analysis module of the proposed system. In starting of this chapter, the framework of IAN tool which is used for UNLization has been explained. This chapter gives the idea of phases of the UNLization process. Each of these phases has been explained in this chapter with the help of example sentences. Documentation regarding UNLization artifacts like Normalization, TRules, DRules, and Analysis Grammar etc. has been done in this chapter. This chapter introduces the X-Bar theory and its need in UNLization. How UNLization is done using this X-Bar theory has been illustrated in this chapter. The working of the developed UNLization module using IAN has been illustrated with the help of example sentences. Towards the end of this section, a brief introduction about EUGENE is also given so that the configuration and invoking of UNLization and NLization modules should be clear to the reader. This chapter highlights the use of UNLization and NLization module from the developed question answering system perspective and gives details about how analysis and generation modules of the proposed question answering system can be used to invoke UNLization and NLization module. It gives details about the prerequisites and other necessary steps for configuration. After explaining about IAN, analysis, and UNLization modules, the evaluation metrics and results of the UNLization module of Punjabi language have been described in this chapter. The same section also gives details of the corpora that are used for testing the UNLization module. This chapter highlights the types of errors which exist in the UNLization modules due to which F-Measure becomes less than 1. Details of how to calculate those errors have also been explained in this chapter. The achievements and contributions of the developed UNLization module for the Punjabi language have been highlighted in this chapter. The UNLization module had been submitted for UNL Olympiad II, III, and IV conducted by UNDL foundation in July 2013, March 2014, and November 2014 for UC-A1, UGO-A1, and AESOP-A1 respectively. UC-A1, UGO-A1, and AESOP-A1 are the corpora provided by UNDL foundation. The language selected for the proposed question answering system i.e., Punjabi, had been selected in top 5 (Based upon the F-Measures) UNLization grammars for Olympiad III, and Olympiad IV while it was selected in top 10 best grammars for Olympiad II. The current updated UNLization module for the Punjabi language has 1798 dictionary entries, 24 NRules, 50 DRules, and 1259 TRules. In fifth chapter, the detailed information of UNL crawler and optimizer has been documented. Initially, in this chapter the concept of preprocessing has been explained with the help of example sentence. This chapter also documents the crawling, optimizing, and ranking processes. The pseudocodes of crawling, optimizing, and ranking have been explained in this chapter. In the proposed system, the questions asked by the user from the developed question answering system have been categorized into three different types, i.e., missing type question, polar (Yes/No type) question, and non-missing type question. This chapter explains these question types with the help of example sentences. After optimizing phase, in the ranking phase, optimizer gives rank 1, rank 2, rank 3, rank 4, or rank 5 to the answer found by the UNL crawler. In this chapter, the ranking mechanism used by the optimizer has been explained for every case with the help of example sentences. If all the information in question’s UNL is present in corpus’s UNL and answer given by UNL crawler is not blank, then optimizer says that “it’s a perfect match. Your answer is:”. If not all the information in question’s UNL is present in corpus’s UNL and answer is not blank, then optimizer reports that “It’s ALMOST a perfect match. We cannot find some information in Database”. If there is a mismatch between the information in question’s UNL and information present in corpus’s UNL and answer given by UNL crawler is not blank, then optimizer reports that “It’s a partial match. The most probable answer is”. If all the information in question’s UNL is present in corpus’s UNL but the answer given by UNL crawler is blank, then optimizer reports that “Your question is correct but we are sorry because our database does not contain sufficient information to answer this”. If answer given by UNL crawler is blank, then optimizer reports “Cannot find the answer”. Optimizer also makes sure that if user didn’t ask any question, then no further processing is done and it stops the execution and searching/finding process immediately by alerting the user that “Sorry we cannot find any question”. In sixth chapter, the detailed information of generation and NLization modules for the Punjabi language has been documented. This chapter highlights the framework of EUGENE tool which is used for NLization. The working of the developed NLization module using EUGENE has been illustrated with the help of example sentences. This chapter highlights the use of NLization module from the developed question answering system perspective and gives details about how generation module of the proposed question answering system can be used to invoke NLization module of target natural language. After explaining about EUGENE, generation and NLization modules, the state of art of NLization module of Punjabi language has been discussed in this chapter. The achievements and contributions of the developed NLization module for the Punjabi language have been highlighted in this chapter. The language selected for the proposed QA System, i.e., Punjabi, has been selected in top 5 (Based upon the FMeasures) NLization grammars for Olympiad III, and Olympiad IV conducted by UNDL Foundation in March 2014, November 2014. The current updated NLization module for the Punjabi language has 704 UWs, 462 TRules, and 10 inflectional paradigms. In seventh chapter, the detailed information of experimentation and evaluation of the proposed question answering system has been documented. This chapter gives the details of corpora used for testing the developed question answering system. This chapter introduces the evaluation metrics viz. Conciseness, Relevance, Correctness, Precision, Recall, and F-Measure for evaluating the developed question answering system. The details regarding the total number of questions and their types have been highlighted in this chapter. The Conciseness, Relevance, Correctness, Precision, Recall, and F-Measure of the developed question answering system came out to be 89.5%, 86.4%, 100%, 86.4%, 100%, and 92.7% respectively. The analysis/comparison of the developed question answering system on the basis of F-Measure/accuracy has been done with different question answering systems. The error analysis of Relevance and Conciseness has also been performed and explained in this chapter. The testing methodology and example questions along with their answers have also been listed in this chapter. Chapter eight presents the conclusion and future scope of this research work. Limitations of the developed QA system have also been documented in this chapter. As a part of this PhD thesis, various gaps were identified in the existing question answering systems and objectives were framed to address these identified gaps. A framework for UNL based question answering system has been proposed to meet all the objectives. Based on this framework, the proposed question answering system has been developed and tested for Punjabi language. The Punjabi language resources for UNLization and NLization modules have been created. A public platform for developing UNL based language-independent applications has been developed and tested. The analysis and generation modules of this platform invoke IAN and EUGENE module of the source and target natural language (Punjabi in this case) for UNLization and NLization respectively. IAN/UNLization and EUGENE/NLization modules have been developed for Punjabi natural language. UNDL foundation had conducted a series of Olympiads over the years. The UNLization/ IAN module had been submitted for UNL Olympiad II, III, and IV conducted by UNDL Foundation in July 2013, March 2014, and November 2014 for UC-A1, UGO-A1, and AESOP-A1 respectively. The language selected for the proposed question answering system i.e., Punjabi, had been selected in top 5 (Based upon the F-Measures) UNLization grammars for Olympiad III, and Olympiad IV while it was selected in top 10 best grammars for Olympiad II. The current updated UNLization/IAN module for the Punjabi language has 1798 dictionary entries, 24 NRules, 50 DRules, and 1259 TRules. The language selected for the proposed QA system, i.e., Punjabi, has been selected in top 5 (Based upon the F-Measures) NLization grammars for Olympiad III, and Olympiad IV conducted by UNDL Foundation in March 2014, November 2014. The current updated NLization/EUGENE Module for the Punjabi language has 704 UWs, 462 TRules, and 10 inflectional paradigms. The Conciseness, Relevance, Correctness, Precision, Recall, and F-Measure of the developed question answering system came out to be 89.5%, 86.4%, 100%, 86.4%, 100%, and 92.7% respectively. The ‘Conciseness’, ‘Relevance’, ‘Precision’, and ‘F-Measure’ metric values of the proposed QA system can be improved. IAN and EUGENE modules can be enriched for Punjabi language so that their F-Measure can be increased. Since a public platform for developing language-independent applications has been developed and tested, therefore other NLP applications which are UNL based like sentiment analysis, text summarization, machine translation etc. can be developed and integrated with this. The developed system can be extended to support the feature to upload the UNL corpus by the user so that questions can be asked by the worldwide audience.
Description: Doctoral Thesis - CSED
URI: http://hdl.handle.net/10266/5885
Appears in Collections:Doctoral Theses@CSED

Files in This Item:
File Description SizeFormat 
Vaibhav_Thesis_Signatures.pdf6.1 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.