eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing

A. English--German and English--Italian corpus

The eSCAPE corpus is a large-scale Synthetic Corpus for Automatic Post-Editing consisting of millions of (source, MT, post-edit) triplets created via machine translation. eSCAPE is designed to support the recent trends in automatic post-editing (APE), which show a clear predominance of data-demanding neural approaches. To cope with such demand, the current version of the corpus contains millions of triplets for two language pairs: English--German (14.4 millions) and English--Italian (6.6 millions). For both language pairs, half of the artificial data is obtained via phrase-based translation, while the other half is produced by neural MT models. Starting from freely available parallel corpora, the (source, MT, post-edit) triplets of eSCAPE have been created by automatically translating the source side of the parallel sentences and using the corresponding target side as an approximation of human post-edit. Having the same source sentences translated with both paradigms aims to enable comparisons in the application of APE technology on the two types of output. The size of the corpus (the largest of its kind) aims to support and ease the training of APE models with an unprecedented amount of data. Preliminary experiments reported in (Negri et al. 2018) confirm the usefulness of the eSCAPE corpus: though trained on artificially-created instances, APE models significantly outperform baseline results in both language directions, independently from the MT technology underlying the data generation process.

Download the English-{German,Italian} corpus

B. extension: English--Russian corpus (February 2019)

This dataset is an extension of the eScape corpus [Negri et al., 2018] and it covers a new language direction: En-Ru. It contains ~7.7 M of triplets (source, MT output, and reference). The source and the reference sentences were collected from the data released in the 2018 news translation task at WMT and a relatively small set was collected from OPUS selecting only the corpora belonging to the Information Technology domain. The MT output was generated by translating the source sentences of the parallel corpus using a neural-based MT system. The neural MT system is based on the Transformer (base) architecture from the paper "Attention is all you need". The MT system was itself built using the parallel corpus using a 4-fold jack-knifing strategy - that is, 75% of the parallel data was used to train the model and remaining 25% was used as test set. A subset of the training data (selected randomly) was used as a validation set to evaluate model checkpoints. Before training the system, the dataset was preprocessed using the following steps:
  1. cleaning: all the pairs where the source and reference sentences 1) are very different in length and 2) do not belong to the expected languages are removed;
  2. the data was then tokenized using the Moses tokenizer;
  3. the tokenized data was then segmented to generate sub-word by learning 40K BPE codes separately for source and target.
After translation, the BPE codes are removed. The final data are tokenized.

Download the English-Russian corpus

Reference: Matteo Negri, Marco Turchi, Rajen Chatterjee and Nicola Bertoldi (2018). eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), Miyazaki, Japan, 7-12 May 2018.

Acknowledgments: this tool was realized under the project QT21 (grant 645452) that received funding under the European Union's H2020 programme