Template-free organic retrosynthesis with syntax-directed molecular transformer networks

Retrosynthesis-the process of identifying precursors that can be used to synthesize a product-is one of the fundamental problems in organic chemistry. The advent of generative deep learning models has rapidly improved template-free retrosynthesis planning, where a retrosynthetic step can be modeled as a sequence-to-sequence task between the string representations (SMILES) of the molecules involved in the reaction. However, many existing methods either prune reaction datasets of important stereochemical information, or they output SMILES strings that are often not syntactically correct. We address both of these issues by developing a syntax-directed molecular transformer (SDMT), trained on template- and rule-free reaction data without removal of stereochemical designation. SDMT adds a lightweight modification to the traditional transformer architecture by using the syntactic dependency tree of the input SMILES string to restrict self-attention. SDMT performs competitively in accuracy with the current state-of-the-art text-based and graph-based retrosynthesis models, while outperforming them in invalid SMILES rate. We show that SDMT more consistently outputs syntactically and semantically valid SMILES strings across all top predicted results, and it can be used as an effective way to directly integrate the syntactic structure of SMILES strings into transformer models for reaction prediction.