Translating Code-Switched Texts From Bilingual Speakers

Code-switched language is commonly found in interactions between bilingual individuals, yet has not been optimized in NMT. A recent machine translation paper concluded that it would be interesting to identify implicitly the language system of foreign word segments and carry out the translation with an appropriate translation system. Therefore, the goal of this project is to create a model best suited for code-switching translation tasks based on this intuition. Specifically, given an input of bilingual, or code-switched text, we want to create a model that outputs a translation of the text in one desired language. We experiment with two approaches for this task. First, a LID-Translation Model Pipeline approach, a two model pipeline that 1) uses a language identification model to figure out what words need to be translated in a bilingual text and 2) translates these identified words via a standard language translation model. This approach includes a language translation model we have fine-tuned and was motivated by the fact that our code-switched dataset did not have ground truth translations. Second, during the course of our project, a new code-switched dataset was released, and thus we also trained a Direct-Translation Bilingual Model that we created with the newly released bilingual data. This bilingual model was also tested with and without the LID-Translation Pipeline. Our results show two main findings. First, the LID-Translation Model Pipeline performs better than a direct translation pipeline. Second, the Direct-Translation Bilingual Model performs better than regular translation models. These results suggest that there is significant potential for machine translation optimized for code-switched data, particularly with the recent rise of bilingual corpus availability.