Please use this identifier to cite or link to this item: http://hdl.handle.net/10266/1729
Title: UNL Based Machine Translation System for Punjabi Language
Authors: Kumar, Parteek
Supervisor: Sharma, R. K.
Keywords: Machine Translation;UNL;Enconverter;Deconverter
Issue Date: 27-Jun-2012
Abstract: Machine Translation (MT) has been an area of immense interest among researchers during last couple of decades. This area has witnessed a few lows and highs during its life span and has also witnessed integration of research works from different fields including linguistics, computer science, artificial intelligence, statistics, mathematics, philosophy and others. Researchers have proposed different paradigms for machine translation across natural languages with reasonable success. Universal Networking Language (UNL) based MT is also an effort in this direction. The UNL programme was launched in 1996 in Institute of Advanced Studies (IAS) of United Nations University (UNU), Tokyo, Japan. The approach in UNL revolves around the development of an EnConverter and a DeConverter for a natural language. The EnConverter is used to convert a given sentence in natural language to an equivalent UNL expression; and the DeConverter is used to convert a given UNL expression to an equivalent natural language sentence. In the work carried out in this PhD project, these two components, namely, EnConverter and DeConverter have been developed for Punjabi language. The PhD thesis on the work carried out in this project is divided into seven chapters. These chapters are: Introduction, Review of Literature; UNL Framework and Creation of Punjabi-Universal Word Lexicon; Punjabi-UNL EnConverter; UNL-Punjabi DeConverter; Results and Discussions; Conclusion and Future Scope of the work. First chapter contains introduction to Machine Translation and its need in the age of Information Technology. In this chapter, the challenges of MT, approaches of MT, objectives of this research, methodology adopted to achieve these objectives and the features of Punjabi language have been presented. Machine Translation approaches that can be classified into four categories, namely, direct MT, rule-based MT, corpus-based MT and knowledge-based MT, have also been discussed in this chapter. The main objective of this study is to design and develop a multi-lingual machine translation system for Punjabi language. A major outcome of this research work is the development of Punjabi EnConverter, Punjabi DeConverter and a web interface for online EnConversion and DeConversion task. Second chapter of the thesis provides the details of the findings of research in the area of MT. The review of literature on MT has been performed by tracking the historic developments in this field. The literature reported in this work has been so organised as to include the state of MT before the invention of computers, beginning of automated MT (1946-1954), decade of high expectations and disillusion (1955-1966), post ALPAC decade (1967-1976), the revival of MT research (1977 to 1989), decade of 1990 to 2000, research since 2000 and MT for Indian languages and research activities in UNL. Computerized translation was first performed by Georgetown Automatic Translation (GAT) system at Georgetown University, USA. During the period of 1955-1966, RAND Corporation’s statistical analysis of a large corpus of Russian Physics texts, ‘CETA’ a hybrid system, Mechanical Translation and Analysis of Languages (METAL) with the use of Chomsky's transformational paradigm, and ‘Logos’ machine translation system had been some of the important developments in this area. During the post ALPAC period (1967-1976), ‘Ariane’ system was developed and ‘TAUM’ project was undertaken by using syntactic transfer for English-French translation. The revival of research on MT took place during 1977-89 with ‘Ariane’ system that was developed on linguistics based transfer approach. Other important MT projects such as ‘ATLAS’ system, ‘UNITRAN’ system, IBM’s ‘Candide’ system, Universal Networking Language (UNL) based MT system, ‘KANTOO’ system, ‘ALEPH’ a pure example-based machine translation system, ‘SisHiTra’ a hybrid Machine Translation system, Google Translation and ‘OpenLogos’ system have also been discussed in this chapter. The review of literature on MT for Indian languages has also been presented in this chapter. The important machine translation systems for Indian languages like ‘AnglaBharti’, ‘AnuBharati’, ‘Anusaaraka’, ‘MANTRA’ (MAchiNe assisted TRAnslation tool) system, English-Bangla-ANUBAD system, a translation system for bi-lingual Hindi-English (Hinglish) text, ‘Shakti’ system, ‘MaTra’ system, a Punjabi to Hindi MT system, English to Urdu translation system, and ‘Sampark’ a hybrid system for translation among Indian languages have been discussed in this chapter. The research activities in UNL have been presented in three distinct sections in this chapter. These sections are: development of EnConversion and DeConversion modules; applications of UNL in other contexts; and use of external lexical and ontological resources to enhance some of the processes of UNL. The work on conversion of Brazilian Portuguese into UNL and vice-versa, EnConversion and DeConversion tools for Tamil language, ‘HERMETO’ system, French EnConverter and French DeConverter, UNL DeConverter for Chinese language, ‘Manati’ DeConversion model, UNL-Nepali DeConverter, UNL-Hindi DeConverter, Arabic MT system based on UNL and a Bangla EnConversion system have also been reviewed in this chapter. Third chapter discusses UNL format for information representation that includes the details on UWs and their four types (Basic UWs, Restricted UWs, Extra UWs and Temporary UWs), UNL relations, UNL attributes and formats to write UNL sentence. Compound UWs in UNL are used to denote compound concepts that are to be interpreted as a whole so that one can use their parts at the same time. This chapter also provides the details on UNL system that consists of EnConverter, DeConverter, Dictionary Builder, Grammar Rules, UNL Key Concept in Context (KCIC), UW Gate Universal Parser, UNL Verifier, Language Server and Word Dictionary (Language-UW Dictionary). UNL system makes use of word dictionaries in the form of Language-UW lexicon of respective languages for its processing. An entry of the word dictionary contains three parts, namely, a headword, a UW and a set of morphological, syntactic and semantic attributes. This chapter discusses the grammatical attributes of Punjabi-UW lexicon, important issues in creation of Language-UW dictionary and creation of Punjabi-UW dictionary. A Punjabi-UW dictionary having 1,15,000 entries has been developed in this work by taking Hindi-UW dictionary as a reference. Fourth chapter of this thesis provides the details of EnConversion system for conversion of input Punjabi sentences into UNL. This chapter also discusses the framework for designing the EnConverter for Punjabi language with a special focus on generation of UNL attributes and relations from Punjabi source text. The architecture of Punjabi EnConverter has been divided into seven phases. These phases are: (i) Parser phase (to parse the input sentence with Punjabi shallow parser), (ii) Linked list creation phase, (iii) Universal Word lookup phase, (iv) Case marker lookup phase, (v) Unknown word handling phase, (vi) User interaction phase (this phase is optional) and (vii) UNL generation phase. These phases have been implemented using Java for the development of proposed system. All these phases are illustrated by giving example sentences. Fifth chapter discusses UNL to Punjabi DeConverter that generates natural language Punjabi sentence from a given input UNL expression. The architecture of Punjabi DeConverter has been divided into five phases, namely, (i) UNL parser phase, (ii) Lexeme selection phase, (iii) Morphology generation phase, (iv) Function word insertion phase and (v) Syntax planning phase. The first stage of a DeConverter is UNL parser which parses the input UNL expression to build a node-net from the input UNL expression. During lexeme selection phase, Punjabi root words and their dictionary attributes are selected for the given UWs in the input UNL expression from the Punjabi-UW dictionary. After that, the nodes are ready for generation of morphology according to the target language in the morphology phase. The proposed system makes use of morphology rule base for Punjabi language to handle attribute label resolution morphology; relation label resolution morphology; and noun, adjective, pronoun and verb morphology. In function word insertion phase, the function words are inserted to the morphed words. These function words are inserted in the generated sentence based on nine column rule base. Finally, the syntax planning phase is used to define the word order in the generated sentence so that output matches with a natural language sentence. The pseudocodes and algorithms for building data structures in UNL parser; processing of noun morphology and adjective morphology rule base; processing of function word insertion rule base; controlling the syntax planning of nodes of simple UNL graph; syntax planning of UNL graph with a scope node; handling of untraversed parent node and nodes with multiple parents nodes during syntax planning, handling of some special cases of syntax planning and syntax planning of noun, adjective and adverb clause sentences have been presented in this chapter. All these pseudocodes and algorithms have been implemented in Java to develop the proposed UNL-Punjabi DeConverter. Sixth chapter of this thesis contains the results and discussion on the work done in this project. The evaluation of proposed system, consisting of Punjabi EnConverter and DeConverter, has been performed with the help of one thousand Punjabi sentences. These sentences have been selected in such a way that generation of all possible UNL relations and attributes can be tested. For testing purpose, Spanish UNL Language Server and agricultural domain threads developed by IIT Bombay, India are considered as gold-standards. Spanish Language Server contains English sentences with their corresponding UNL expressions generated by the system (Spanish Language Center, 2004) while agricultural domain threads developed by IIT Bombay have Hindi language sentences with their equivalent UNL expressions. These sentences were translated manually into equivalent Punjabi sentences and then inputted to the proposed Punjabi EnConverter system for their EnConversion to UNL. The UNL expression generated by proposed system is compared with the UNL expression given by the gold-standard EnConverter. These two UNL expressions match with each other if the UNL relations, including associated UWs and UNL attributes present in the expressions are same. It has been seen that proposed system handles the resolution of UNL relations and generation of attributes for these sentences with a very reasonable accuracy. Proposed Punjabi DeConverter is also evaluated by inputting UNL expressions generated by Punjabi EnConverter to it and the output of Punjabi DeConverter is compared with input Punjabi sentence given to Punjabi EnConverter. The two Punjabi sentences (original sentence inputted to Punjabi EnConverter and the sentence generated by DeConverter) are compared and the system is evaluated based on adequacy test and BLEU score. Subjective tests like adequacy and fluency tests have been performed on the proposed system. BLEU score has been calculated to evaluate the quality of output. The quantitative test, namely, error analysis has also been performed by calculating Sentence Error Rate (SER) and Word Error Rate (WER). From this analysis, it has been concluded that proposed system generates 89.0% intelligible sentences and generates 92.0% sentences that are faithful to the original sentences. The system could achieve a fluency score of 3.61 (on a 4-point scale) and adequacy score of 3.70 (on a 4-point scale). The proposed system is able to achieve a BLEU score of 0.72. The proposed system has a word error rate of 5.43% and sentence error rate of 20.8%. These scores of the proposed system can be improved further by improving the rule base and lexicon. Chapter seven presents the conclusion, limitations and future scope of work in order to refine the proposed development of EnConverter and DeConverter for Punjabi language.
URI: http://hdl.handle.net/10266/1729
Appears in Collections:Doctoral Theses@CSED

Files in This Item:
File Description SizeFormat 
1729.pdf2.51 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.