Seeking Higher Truths and Higher Accuracies with Multilingual GAN-BERT
Buddhist scriptures are often intentionally written to mirror the style of prior scriptures and quote prior texts verbatim. Moreover, the Buddhist canon is not uniform, split across many languages and schools. We therefore set out to build a model that accepts text from various languages and predicts the overall branch of Buddhism the text originates from, as well as the specific school of origin, formulated as two separate multi-class problems, respectively. In an effort to incorporate and improve upon state-of-the-art approaches in low-resource NLP tasks, we re-implemented and refined the GAN-BERT architecture to investigate methods to enhance fine-tuning for BERT. We also investigate the performance of standalone BERT, mBERT and LSTM models. We report that the LSTM model without pretrained embeddings obtains the highest accuracy on the 17-class classification task.