WMT 2014 English–German (WMT14 En–De) is the English⇄German parallel data collection used in the Ninth Workshop on Statistical Machine Translation (WMT 2014) shared translation task. The corpus is a combination of multiple parallel sources commonly used in MT research (e.g., Europarl, Common Crawl, News Commentary, and other parallel collections) and is distributed with standard splits used for training, validation and testing. For the English→German task the training set contains on the order of ~4.5 million sentence pairs (this is the size reported and used in many papers, including “Attention Is All You Need”); commonly used validation/dev and test sets are newstest2013 (dev) and newstest2014 (test). The Hugging Face dataset card (wmt/wmt14) provides per-language-pair configs (e.g., de-en) and lists splits and sizes; it also includes a warning about issues in the Common Crawl portion (misaligned / non-English files). Typical preprocessing applied in literature includes tokenization and Byte-Pair Encoding (BPE) with a shared vocabulary (~37k) as used in the Transformer paper. Primary sources / references: the WMT14 workshop pages (statmt.org/wmt14) and the Hugging Face dataset card (https://huggingface.co/datasets/wmt/wmt14).
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.