Codesota · Benchmark · DoTA (en->zh)Home/Leaderboards/Language & Knowledge/Machine Translation/DoTA (en->zh)
Unknown

DoTA (en->zh).

DoTA (Document image machine Translation dataset of ArXiv articles in markdown format) is a large-scale dataset of document-image → translation pairs introduced for document image machine translation (DIMT). It was created from arXiv articles rendered in markdown format and is intended to evaluate translation of long-context, complex-layout document images (e.g., whole pages with tables/figures/sections) into markdown-formatted target text. The NAACL 2024 paper reports a filtered set of about 126K image–translation pairs; the authors also provide an unfiltered collection of ~139K samples in the public repository/dataset. The dataset includes multilingual content (source English and target Chinese for the en→zh subset used in evaluations; the dataset metadata indicates other language variants are present) and is distributed under an MIT license on Hugging Face (the Hugging Face dataset is gated and requires agreeing to access conditions).

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.
Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

COMET

COMET is the reported evaluation metric for DoTA (en->zh). Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for COMETverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01HunyuanOCR (1B)
dataset: DoTA (en->zh); task: 6
paper83.48N/APaper ↗Code ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Machine Translation