Codesota · Natural Language Processing · Language Modeling · Penn Treebank (WSJ Section 23)Tasks/Natural Language Processing/Language Modeling
Language Modeling · benchmark dataset · EN

Penn Treebank (Wall Street Journal, Section 23).

The Penn Treebank (PTB) WSJ portion is a widely used annotated corpus of Wall Street Journal newswire text (roughly 1 million words). It was originally described in Marcus et al., 1993 ("Building a Large Annotated Corpus of English: The Penn Treebank") and distributed as the Treebank releases (e.g. Treebank-3 / LDC99T42). The WSJ portion is annotated for part-of-speech (POS) and syntactic constituency trees and is commonly used for parsing, POS tagging and language modeling research. Section 23 of the WSJ is the standard test set in many parsing and language-modeling evaluations (e.g., parsing train/dev/test splits often use sections 02–21 for training, 22 for development and 23 for test). Hugging Face hosts a text-only PTB dataset (ptb-text-only/ptb_text_only) which provides the PTB text splits (the HF dataset notes that the source is the Penn Treebank Project / WSJ material and that licensing is via LDC). Note: the original Penn Treebank was published in Computational Linguistics (Marcus et al., 1993) and the corpus distribution is controlled by the LDC (Treebank releases such as LDC99T42).

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
Penn Treebank (WSJ Section 23) — Language Modeling benchmark · Codesota | CodeSOTA