Conversational and Image Recognition Chatbot

This project proposes a chatbot framework that adopts a model which consists of natural language processing and image recognition technology. Based on this chatbot framework, neural encoder-decoder model is utilized with Late Fusion encoder and 2 different decoders(generative and discriminate). We are using Encoder-decoder CNN architecture for fusion of images and Resnet[15] architecture for object detection and localization. Localization of object or persons is done using Mask-RCNN model which not only localizes the object, but also provides a mask for localized object. We are utilizing COCO dataset to train the data, images are fused together to get a combined more informative output for detection of a doubtful presence. On training the complete Encoder-Decoder Network with Self-Attention stabilized the training a lot and decreased the loss further and improved the performance on the used metric on the validation set. The chatbot is able to detect object in the image, tell about and recognize the image, later on the chatbot is also able to answer the questions about this image. Integrated with self attention model to better the performance. The basic workflow is that given an Image (I), current question (Q) and a history of Question and answers (H), the agent should be able to generate the answer of the current question. The purpose of this project is to utilize natural language processing and computer vision models for efficient identify and answer following question to any images and following up questions, this could apply further in organizations, schools, hospitals and military areas. I believe this area has huge impact in natural language processing and visual recognition in industry or academic.