Codesota · Guides · Prompting techniquesOriginal research · 100 samples · 9 frameworksPublished December 2025
Guide · Prompting

The prompting framework tarpit.

We benchmarked RTF, TAG, RACE, and five other popular “prompting frameworks.” None improved accuracy. Some made it worse — by as much as nineteen points.

LinkedIn is flooded with prompting frameworks promising 340% ROI. We ran the numbers. The claims are fabricated. Here is why smart people keep falling for them — and what actually works.

What actually works See the data
§ 01 · Premise

Style is not substance.

Why the acronyms feel useful, and what they are actually changing.

Prompting frameworks affect style — how the model speaks. They do not affect substance — what it knows or how it reasons. Telling the model to act like Einstein gives you Einstein's vocabulary. It does not give you his reasoning.

What frameworks change: vocabulary and formality, response structure, tone and personality, the confidence level of language. What they do not change: actual accuracy on tasks, reasoning ability, knowledge depth, logical correctness.

Why smart people fall for it
Frameworks help you organise your thoughts — but LLMs do not think like humans. They parse intent from tokens, not templates. The structure is scaffolding for your brain, not theirs. Add confirmation bias (you remember the time RTF “worked”), authority heuristic (50K-follower prompt engineers seem credible), and complexity bias (a seven-step framework feels more real than “describe what you want”).
§ 02 · Benchmark

Nine frameworks, one hundred samples.

Methodology: Llama 3.3-70b via Groq, four task types, pass@1.

Nine prompting approaches on 100 samples across email classification, sentiment analysis, data extraction, and Q&A. Code and data at github.com/codesota/benchmarks/prompting.

FrameworkAccuracyΔ baselineAvg tokensToken waste
Baseline97%93
APE97%108+16%
RACE97%123+32%
TRACE97%122+31%
COAST95%-2%121+30%
ROSES95%-2%118+27%
RTF94%-3%119+28%
STAR80%-17%132+42%
TAG78%-19%132+42%
Key finding
Baseline ties or beats every framework. STAR and TAG hurt performance by 17 and 19 points respectively — their rigid structure confused the model on data-extraction tasks.
§ 03 · Literature

What peer-reviewed papers say.

Four citations. None support the frameworks.

“When ‘A Helpful Assistant’ Is Not Really Helpful” · arXiv 2024
Tested personas in system prompts across multiple LLMs. Personas do not improve performance. “You are an expert…” is essentially a no-op for accuracy.
“Persona is a Double-edged Sword” · ACL 2024
Role prompting hurts reasoning in 13–14% of cases, helps in 15–16%. Net effect is nearly random. Random persona choice works as well as careful selection.
“The Decreasing Value of Chain of Thought” · Wharton 2025
CoT benefits are shrinking with newer models. For reasoning models, CoT provides only 2.9–3.1% improvement.
RTF, TAG, RACE, COAST, STAR, APE, TRACE, ROSES · LinkedIn 2023–25
Zero peer-reviewed papers. Zero reproducible experiments. “340% ROI” has no citations. These are marketing constructs.
§ 04 · Evidence

Techniques that have empirical support.

They change HOW the model reasons, not how it formats output.

TechniqueAccuracyTokensLatencyCost
Zero-Shot60%1×1×1×
Few-Shot (3 examples)72%2.5×1.2×2.5×
Chain-of-Thought78%3.5×1.8×3.5×
Self-Consistency (n=5)84%5×5×5×
Few-Shot + CoT82%4.5×2×4.5×

Self-Consistency achieves the highest accuracy but at five times the cost. For most applications, Few-Shot + CoT provides the best accuracy-per-dollar ratio.

§ 05 · Model variance

Chain-of-Thought gains, by model.

Newer, more capable models benefit less from explicit reasoning prompts.

ModelZero-shotWith CoTGain
Gemini Flash 2.071.2%80.8%+13.5%
Claude 3.5 Sonnet74.1%82.8%+11.7%
GPT-4o76.4%82.7%+8.2%
GPT-4o-mini68.9%71.9%+4.4%
Claude 3 Haiku62.3%66.1%+6.1%
Warning
CoT can hurt performance on easy questions. The model may overthink and introduce errors where a direct answer would be correct. Profile your task distribution before defaulting to CoT.
§ 06 · Implementations

Each technique, in code.

Four patterns, copy-paste, with the edges they work on and the edges where they fail.

Zero-shot

Direct instruction, no examples. Baseline for comparison. Best token efficiency when it works. Strong on simple classification and extraction; weak on multi-step reasoning.

# Zero-shot: Direct instruction
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "Classify this review as positive, negative, or neutral: 'The product arrived late but works great.'"
    }]
)

Few-shot

Provide two to five examples to establish the pattern. About +12% over zero-shot on average. Diminishing returns beyond five. Best for format-sensitive output and domain jargon; costs context window.

# Few-shot: Provide examples to establish pattern
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": """Classify reviews as positive, negative, or neutral.

Review: "Absolutely love it, best purchase ever!"
Classification: positive

Review: "Broken on arrival, total waste of money."
Classification: negative

Review: "It's okay, nothing special but works fine."
Classification: neutral

Review: "The product arrived late but works great."
Classification:"""
    }]
)

Chain-of-Thought

Request step-by-step reasoning. Most effective on math and logic. Adds 20–80% latency; can yield 15–20% accuracy gains on complex tasks. Weak on factual recall.

# Chain-of-Thought: Request step-by-step reasoning
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": """Solve this step by step:
A store has 45 apples. They sell 1/3 of them in the morning
and 1/2 of the remaining in the afternoon.
How many apples are left?

Let's think through this step by step:"""
    }]
)

Self-consistency

Run the same prompt multiple times with temperature > 0, then vote on the most common answer. High cost but highest accuracy for critical decisions.

import asyncio
from collections import Counter

async def self_consistency(prompt: str, n: int = 5) -> str:
    """Run prompt n times, return majority answer."""
    responses = await asyncio.gather(*[
        client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7  # Need variation
        )
        for _ in range(n)
    ])

    # Extract final answers and vote
    answers = [extract_answer(r.choices[0].message.content)
               for r in responses]
    most_common = Counter(answers).most_common(1)[0][0]
    return most_common
§ 07 · Efficiency

Measuring the token-budget tradeoff.

TALE (Dec 2024) showed budget-aware prompting reduces tokens 68.9% with <5% accuracy loss.

# TALE-inspired: Budget-aware prompting
def budget_aware_prompt(task: str, complexity: str) -> dict:
    """Adjust prompting strategy based on task complexity."""

    if complexity == "simple":
        # Zero-shot, minimal tokens
        return {
            "prompt": task,
            "max_tokens": 50,
            "strategy": "zero-shot"
        }

    elif complexity == "medium":
        # Few-shot with 2 examples
        return {
            "prompt": f"{FEW_SHOT_EXAMPLES}\n\n{task}",
            "max_tokens": 200,
            "strategy": "few-shot"
        }

    else:  # complex
        # Full CoT with examples
        return {
            "prompt": f"{FEW_SHOT_COT_EXAMPLES}\n\n{task}\nLet's solve step by step:",
            "max_tokens": 500,
            "strategy": "few-shot-cot"
        }

Efficiency calculation

def calculate_prompting_efficiency(
    accuracy: float,
    tokens_used: int,
    latency_ms: int,
    cost_per_1k: float = 0.01
) -> dict:
    """Calculate efficiency metrics for a prompting strategy."""

    # Token efficiency: accuracy per 1000 tokens
    token_efficiency = (accuracy / tokens_used) * 1000

    # Cost efficiency: accuracy gain per dollar
    cost = (tokens_used / 1000) * cost_per_1k
    cost_efficiency = accuracy / cost if cost > 0 else 0

    # Time-accuracy tradeoff
    accuracy_per_second = accuracy / (latency_ms / 1000)

    return {
        "accuracy": accuracy,
        "tokens_used": tokens_used,
        "latency_ms": latency_ms,
        "cost_usd": round(cost, 4),
        "token_efficiency": round(token_efficiency, 2),
        "cost_efficiency": round(cost_efficiency, 2),
        "accuracy_per_second": round(accuracy_per_second, 2)
    }

# Example comparison
zero_shot = calculate_prompting_efficiency(60, 150, 800)
few_shot = calculate_prompting_efficiency(72, 450, 1100)
cot = calculate_prompting_efficiency(78, 650, 1500)

print(f"Zero-shot efficiency: {zero_shot['token_efficiency']}")  # 400.0
print(f"Few-shot efficiency: {few_shot['token_efficiency']}")    # 160.0
print(f"CoT efficiency: {cot['token_efficiency']}")              # 120.0
Interpretation
Zero-shot has 400 token efficiency; CoT drops to 120. Zero-shot extracts 3.3× more accuracy per token. Use CoT only when the accuracy gain justifies the cost.
§ 08 · Decision

Choose your technique by task and constraint.

Task Type               | Recommended Technique | Expected Gain | Token Cost
------------------------|----------------------|---------------|------------
Simple classification   | Zero-shot            | Baseline      | 1x
Format-sensitive output | Few-shot (2-3 ex)    | +12-15%       | 2-3x
Math/logic problems     | Chain-of-Thought     | +15-20%       | 3-4x
High-stakes decisions   | Self-Consistency     | +20-25%       | 5x
Complex domain tasks    | Few-shot + CoT       | +25-30%       | 4-5x
Latency-critical        | Zero-shot or cached  | Baseline      | 1x
When to use zero-shot
  • Latency under 1 second required
  • High volume (>10K requests/day)
  • Task accuracy already >80%
  • Simple classification or extraction
When to use CoT
  • Math or logic problems
  • Multi-step reasoning required
  • Accuracy is paramount
  • Tasks where errors are costly
When to use few-shot
  • Output format must match exactly
  • Domain-specific terminology
  • Edge cases in training data
  • Style consistency matters
When to use self-consistency
  • Error cost > 100× prompt cost
  • Medical / legal / financial decisions
  • Low volume, high stakes
  • When you can wait for results
§ 09 · Mistakes

Common traps.

Using CoT for everything
CoT adds latency and cost. On simple factual queries it can decrease accuracy by overthinking. Profile the task distribution first.
Too many few-shot examples
Beyond five examples, returns diminish rapidly. Three diverse examples typically suffice.
Not measuring baseline first
Always establish zero-shot accuracy before adding complexity. If zero-shot hits 90%, the ceiling is 10%.
Ignoring model-specific behavior
Prompts that work for GPT-4 may fail for Claude or Gemini. Test on your target model.
§ 10 · Method

Your measurement checklist.

  1. 01
    Establish baseline
    Run zero-shot on 100+ samples. Record accuracy, latency, tokens.
  2. 02
    Test techniques independently
    Few-shot alone, CoT alone, then combinations. Measure each.
  3. 03
    Calculate efficiency
    Token efficiency = (accuracy / tokens) × 1000. Compare ratios.
  4. 04
    Apply cost constraints
    At your volume, what is the monthly cost difference? Is accuracy worth it?
  5. 05
    Monitor in production
    Track metrics over time. Model updates can change optimal strategy.
§ 11 · Toolkit

The evidence-based toolkit.

Two tiers: proven prompting, and the techniques that actually change capability.

Tier 1 · Proven prompting
Clear, specific instructions
The single highest-impact factor. Describe exactly what you want, in what format, with what constraints. +20–40% over vague prompts.
Few-shot examples
Two to five examples establishing the pattern. Most effective for format matching and domain vocabulary. +10–15% accuracy, 2–3× token cost.
Chain-of-thought for complex reasoning
Ask the model to show its work. Works for math, logic, multi-step problems. Diminishing returns on newer models. +5–20%, 2–4× tokens.
Structured output (JSON mode)
Use the model's native JSON mode or schema enforcement. 100% format compliance, minimal cost.
Tier 2 · Beyond prompting

These techniques add real knowledge or change model weights. Prompting cannot make a model know things it was not trained on.

RAG — Retrieval-Augmented Generation
Inject relevant documents at query time. Model can now answer questions about your private data, recent events, or specialised domains. Use when the model lacks knowledge you have in documents.
Fine-tuning
Train on your specific data. Changes actual weights. Model learns your domain’s patterns, terminology, edge cases. Use for consistent style/format at scale.
Tool use / function calling
Let the model call external APIs, databases, or code. Extends capability to real-time data and actions.
Agentic workflows
Chain multiple LLM calls with planning, reflection, and tool use. Handles complex multi-step tasks.
§ 12 · Take

What actually moves the needle.

RAG works. Fine-tuning works. But they are still somewhat overhyped as silver bullets. RAG requires careful chunking, retrieval tuning, and context management. Fine-tuning requires quality data, evaluation pipelines, and ongoing maintenance.

What actually changed my productivity: agents that can iterate. Not “agent” as a marketing term. I mean systems that learn from mistakes, ask clarifying questions, research autonomously, and iterate on solutions.

A simple prompt plus an iteration loop beats a perfectly-crafted one-shot prompt every time. The frameworks LinkedIn loves optimise for single-turn interactions. Real work is multi-turn, iterative, and requires adaptation.

§ 13 · TL;DR

The six-line summary.

  • SkipRTF, TAG, RACE, COAST, STAR, APE, TRACE, ROSES. Zero evidence. Wasted tokens.
  • Be skepticalRole prompting (“You are an expert…”) — research shows unpredictable effects.
  • UseClear instructions, few-shot examples, structured output, CoT for complex reasoning.
  • Level upRAG for knowledge, fine-tuning for consistency, tool use for actions.
  • Real breakthroughAgents that iterate, learn from errors, and ask questions. Multi-turn beats perfect one-shot.
  • MeasureRun your own benchmarks. Your task is unique.
§ 14 · Sources

Citations and data.

Peer-reviewed research
Our original research
  • Framework benchmark code & data
    100 samples · 9 frameworks · 4 task types
  • Model: Llama 3.3-70b via Groq
    500 tok/s inference speed
  • Tasks: Email, Sentiment, Extraction, Q&A
    Real-world office scenarios
Not research — marketing
RTF, TAG, RACE, COAST, STAR, APE, TRACE, ROSES. Zero citations. Zero reproducible experiments. LinkedIn virality only.