The Penn Treebank (PTB) WSJ portion is a widely used annotated corpus of Wall Street Journal newswire text (roughly 1 million words). It was originally described in Marcus et al., 1993 ("Building a Large Annotated Corpus of English: The Penn Treebank") and distributed as the Treebank releases (e.g. Treebank-3 / LDC99T42). The WSJ portion is annotated for part-of-speech (POS) and syntactic constituency trees and is commonly used for parsing, POS tagging and language modeling research. Section 23 of the WSJ is the standard test set in many parsing and language-modeling evaluations (e.g., parsing train/dev/test splits often use sections 02–21 for training, 22 for development and 23 for test). Hugging Face hosts a text-only PTB dataset (ptb-text-only/ptb_text_only) which provides the PTB text splits (the HF dataset notes that the source is the Penn Treebank Project / WSJ material and that licensing is via LDC). Note: the original Penn Treebank was published in Computational Linguistics (Marcus et al., 1993) and the corpus distribution is controlled by the LDC (Treebank releases such as LDC99T42).
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.