Machine learning, Natural language processing
Together with my colleague Aditya Yedetore, we developed a model capable of dealing with error correction of English text, along the lines of Grammarly or autocorrect. The model is able to correct both spelling errors and sentence-level grammatical mistakes. For instance, given “the dog eated the bone”, the model is able to to produce “the dog ate the bone” (the incorrectly spelled word “eated” is corrected), and given “the dog ate bone” the model produces “the dog ate the bone” (the grammatically incorrect dropped determiner ‘the’ is inserted).
A relevant subproblem was figuring out how to deal with vocabulary that was never seen in the input training data. For instance, though the model may not have seen the word “abjad” in training, when given “an abjad is a wrtiing system” it should properly produce “an abjad is a writing system”, choosing to fix the incorrectly spelled “writing”, but leaving the correct “abjad” alone. So the model should be able to differentiate between correct words that have not been seen, and incorrectly spelled words.
We modified the model in Zheng Yuan et al., (2016). This work frames the task of grammatical error correction as a machine translation task, where the source language consists of inputs that may contain errors, and the target language consists of the corrected versions of those inputs. Zheng Yuan et al. employ the RNNsearch model of Bahdanau et al. (Bahdanau et al., 2014), which contains a bidirectional RNN as an encoder and an attention-based decoder, to perform this translation. Out of vocabulary (OOV) tokens in the input will be mapped to the <UNK> token in the output. For instance, when given “an abjad is a wrtiing system”, the translation model will produce “an <UNK> is an <UNK> system”
To deal with correcting the spelling of (or copying) those OOV tokens to fill in the <UNK> gaps, Zheng Yuan et al. take a two-step approach:
1. Aligning the unknown words (i.e. <UNK> tokens) in the target sentence to their origins in the source sentence with an unsupervised aligner.
For the aligner, Zheng Yuan et al. use METEOR (Banerjee and Lavie, 2005), which not only identifies words with exact matches, but also words with identical stems, synonyms, and unigram paraphrases.
2. Building a word-level translation model (a spelling correction model) to translate the source words responsible for the target unknown words in a post-processing step. For this step, Zheng Yuan et. al. (2016) use IBM models and create word-aligned data.
In our work, we explored the effect of making this model fully neural. We experiment with deriving alignments from the attention mechanism inherent in the RNNsearch model. We will also explore the effects of replacing the IBM word spelling correction model with another RNNsearch model which maps characters to characters.
To train our model on grammatical error correction, we used the publicly available FCE dataset, which contains 1,244 scripts produced by learners taking the First Certificate in English (FCE) examination between 2000 and 2001. The texts have been manually annotated by linguists using a taxonomy of approximately 80 error types. We used those marked up sentences to create pairs of input and output sentences for our sequence to sequence task. See below for an example pair in our dataset.
We ended up with a training dataset of 28350 pairs, a dev set of 2191 pairs, and a test set of 2695 pairs. This is about 70 times smaller than the dataset used in Zheng Yuan et al, (2016), which consisted of 1,965,727 pairs. While they augmented the FEC dataset with the Clang-8 dataset, we decided not to given the low quality of the Clang-8 data.