A comprehensive benchmark for evaluating LLMs writing capabilities across 1,000 real-world queries spanning 6 primary domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, and Advertising & Marketing) and 100 fine-grained subdomains. Each query averages 1,500+ tokens and is paired with 5 instance-specific evaluation criteria. The benchmark uses a hybrid construction pipeline combining Model-Augmented Query Generation and Human-in-the-Loop Refinement. Evaluation is conducted through a query-dependent framework with dynamic criteria generation and rubric-based scoring on a 10-point scale, using either LLM evaluators (Claude-Sonnet-4) or a fine-tuned critic model.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.