Please use this identifier to cite or link to this item: http://hdl.handle.net/10266/5526
Title: Quote Examiner: Verifying quoted images using web-based text similarity
Authors: Banerjee, Sneha
Supervisor: Kumar, Parteek
Keywords: OCR;Text recognition;text similarity
Issue Date: 26-Jul-2019
Abstract: In recent times, there has been a rapid advancement in digital data mainly in visual formats, such as images from the web, mobiles, digital cameras, screenshots, etc. Images with quotes are spreading virally through online platforms like the internet, Facebook, WhatsApp, etc. Misquotations often spread like a forest fire through social media, which highlights the lack of responsibility of the web users when circulating poorly cited quotes. Thus, it is important to authenticate the text contained in the images being circulated online. Hence, there is a need to retrieve the information within such textual images. Optical Character Recognition (OCR) is a method used for converting textual images into readable text format. There are various OCR tools available which help in converting visual data into editable textual documents. In this study, a performance analysis between various OCR tools like Tesseract-OCR, Google Cloud Vision and AWS rekognition is presented on natural scene images. Further, a post-processing technique has been applied on the obtained text and it has been observed that after removing spelling errors from the identified text in images resulted in a significant improvement in the accuracy of the output text. There has been an improvement of around 2% in the case of natural scene images and approximately 8% in the case of text obtained from handwritten images. Additionally, it has been observed that in case of natural scene images, Google Cloud Vision gives an overall F1-score of 88.32%, AWS rekognition gives an overall F1-score of 68.1% and Tesseract-OCR gives an F1-score of 54.58%. Accordingly, it can be deduced from the results that Google Cloud Vision outperforms the other two tools in consideration and has therefore been used for extracting text from quoted images. In this experiment, a web-based text similarity approach has been used to examine the authenticity of the content of the quoted images. Google Custom Search Engine has been used to retrieve the URLs of the similar text followed by verification of the obtained domain names against authentic quotation sites. Approximately, 96.26% accuracy has been achieved in classifying quoted images as verified or misquoted by using the verification results.
URI: http://hdl.handle.net/10266/5526
Appears in Collections:Masters Theses@CSED

Files in This Item:
File Description SizeFormat 
801732050_ME_Sneha.pdf4.34 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.