E³-Mini Benchmark: A Full-Dimensional Evaluation of Efficiency, Energy, and Effectiveness in Transformer Architectures

Research Team, E³

The End Game of Architecture Wars?

Experimental Report on the Comparison of Encoder-only, Decoder-only, and Encoder-Decoder Architectures

Boyu Liu¹

¹University of Illinois at Urbana-Champaign
2025

Paper Code

E³ Training Trade-off Space

Efficiency × Energy × Effectiveness

Training

Inference

💡 Tip: Drag to rotate, scroll to zoom, click and drag with right mouse button to pan. Hover over points to see detailed metrics.

Abstract

This report presents a systematic comparison of three Transformer architectures—Encoder-only (BERT family), Decoder-only (GPT family), and Encoder-Decoder (T5/BART family)—across three critical dimensions: Efficiency (time), Energy (power consumption), and Effectiveness (task performance). Using the E³ Mini-Benchmark framework on NVIDIA Tesla V100 GPUs with precise energy measurements, we reveal fundamental trade-offs that challenge the conventional wisdom of architecture selection.

Our key finding: no architecture simultaneously optimizes all three dimensions. The "winner" depends critically on deployment constraints—latency budgets favor Decoder-only, energy budgets favor Encoder-Decoder, and quality requirements determine the optimal choice. Most surprisingly, we demonstrate that faster does not mean greener: GPT-2 achieves 52% higher throughput than T5 but consumes 2× more energy per token, revealing a crucial efficiency-energy decoupling overlooked in current benchmarks.

1. Introduction

1.1 The E³ Framework

Modern language model evaluation focuses predominantly on accuracy metrics, yet real-world deployment faces hard constraints across three orthogonal dimensions:

⚡ Efficiency (E₁)

Training: Tokens processed per second, wall-clock time to convergence

Inference: TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), throughput

🔋 Energy (E₂)

Training: Total energy to convergence (kWh)

Inference: Energy per sample/token (Joules)

🎯 Effectiveness (E₃)

Task accuracy: SuperGLUE, MMLU

Capabilities: Zero-shot and few-shot learning

1.2 Motivation

Three trends motivate this comprehensive analysis:

Architectural consolidation: Decoder-only models (GPT, LLaMA, Qwen) increasingly dominate despite architectural alternatives
Sustainability concerns: AI's carbon footprint demands energy-conscious design
Deployment diversity: Edge devices, cloud servers, and mobile platforms impose different constraints

Central Question: When one dimension is constrained (latency, energy, or accuracy), what is the optimal architectural choice?

1.3 Experimental Setup

Hardware Environment:

GPU: NVIDIA Tesla V100-SXM2-32GB
CUDA Version: 12.8
Precision: FP16 (no BF16, no FlashAttention for fair comparison)

Model Selection (similar parameter scales):

Architecture	Models	Parameters
Encoder-only	BERT-Base, RoBERTa-Base, DistilBERT	66M-110M
Decoder-only	GPT-2, LLaMA-3.2-1B, Qwen2.5-0.5/1.5B	124M-1.5B
Encoder-Decoder	T5-Base/Large, FLAN-T5-Base/Large, BART-Base	140M-780M

Measurement Protocol:

Real-time GPU power monitoring via NVML (∫ Power(t) dt)
Token-by-token latency breakdown (TTFT vs TBT)
Multiple seeds for statistical significance
LoRA fine-tuning for parameter efficiency

2. SuperGLUE Fine-tuning: Training Phase Analysis

2.1 Task Performance

We fine-tuned all models on four SuperGLUE tasks: BoolQ (question answering), RTE (textual entailment), WiC (word sense), and CB (commitment bank).

**Figure 1:** SuperGLUE accuracy by architecture and task. T5 consistently outperforms on understanding tasks.

**Figure 2:** Model-level performance ranking. T5-Base achieves the highest average accuracy across tasks.

Key Findings

Task-level Results:

BoolQ (Question Answering): T5-Base 77.1% >> BERT-Base 63.9% >> GPT-2 61.1%
RTE (Textual Entailment): T5 53.1%, BERT 52.7%, GPT-2 52.7% (statistical tie)
WiC (Word Sense): T5 61.6% > BERT 57.5% > GPT-2 53.1%
CB (Commitment): All models struggle (~41-45%, near random)

Effectiveness Ranking: T5 > BERT > GPT-2 (average gap: 7.7pp and 12.7pp respectively)

2.2 Training Efficiency

**Figure 3:** Training time vs accuracy scatter plot. Lower-left corner is ideal (fast + accurate).

Time to Convergence (BoolQ, 5 epochs):

GPT-2: 303 seconds ✓ (fastest)
BERT-Base: 309 seconds
T5-Base: 536 seconds (77% slower)

Why is T5 slower?
1. Dual attention mechanisms
2. Larger parameter count (220M vs 110-124M)
3. More complex forward pass

Efficiency Ranking: GPT-2 ≈ BERT >> T5

2.3 Memory Footprint

**Figure 4:** Peak GPU memory usage by architecture.

Memory Requirements:

BERT-Base: 5.46 GB (bidirectional attention is memory-intensive)
GPT-2: 7.86 GB (larger hidden states due to causal masking)
T5-Base: 7.17 GB (dual stacks but parameter sharing helps)

2.4 Training Energy Analysis

Energy Consumption (BoolQ, full training):

Model	Training Time	Avg Power	Total Energy	Relative
GPT-2	303s	792W	0.067 kWh	1.0× (baseline)
BERT	309s	881W	0.076 kWh	1.13×
T5	536s	583W	0.087 kWh	1.30×

The Energy Paradox: T5 has the lowest average power draw (583W) but the highest total energy (0.087 kWh) because training takes 77% longer.

Energy Ranking: GPT-2 > BERT > T5

2.5 Training E³ Bubble Chart

**Figure 5:** Training E³ trade-off space. X-axis: training time (lower better), Y-axis: accuracy (higher better), Bubble size: energy consumption (smaller better).

This visualization reveals the fundamental trilemma:

GPT-2: Wins on E₁ (speed) and E₂ (energy), sacrifices E₃ (accuracy)
T5: Wins on E₃ (accuracy), loses on E₁ and E₂
BERT: Middle ground but dominated on all dimensions

Pareto-optimal set: {GPT-2, T5}
BERT is strictly dominated.

3. Few-shot Evaluation: Zero/Few-shot Capabilities

3.1 Learning Curves Across Shot Counts

**Figure 6:** Few-shot learning curves (0/5/10-shot) for key models on major tasks.

**Figure 7:** Complete few-shot curves across all models and tasks.

Zero-shot Performance (MMLU)

Accuracy at 0-shot:

BERT-Base: 23.1% (near random, ~25% for 4-choice)
GPT-2: ~26-28%
T5-Base: ~30-32%
LLaMA-3.2-1B: 35-40%
Qwen2.5-1.5B: 40-45% ✓ (best)

Key Insight: Encoder-only models fundamentally struggle with generative zero-shot tasks due to lack of autoregressive pre-training.

**Figure 8:** Zero-shot MMLU performance breakdown by model.

**Figure 9:** 5-shot MMLU performance showing improvement from few-shot examples.

3.2 Few-shot Learning Capability

**Figure 10:** Model-task performance heatmap showing architecture-task affinity.

Performance Gains (0-shot → 10-shot):

Decoder-only: +8-12% (moderate, already good at zero-shot)
Encoder-Decoder: +12-18% (largest gains, benefits most from examples)
Encoder-only: +3-6% (limited, architecture mismatch)

Architecture-Task Affinity:
• Encoder-only: Strong on classification, weak on generation
• Decoder-only: Balanced, strong zero-shot baseline
• Encoder-Decoder: Strongest with examples, excels at structured generation

4. Inference Benchmarking: Latency and Throughput

4.1 Overall Performance

**Figure 11:** Inference performance overview across models.

**Figure 12:** Memory usage vs inference latency trade-off.

4.2 Context Length Scalability

**Figure 13:** Latency curves as context length increases (128 → 1024 tokens).

Scalability Patterns:

GPT-2: Near-constant TTFT (~10ms) across context lengths (prefill caching efficient)
T5: Higher baseline TTFT (~23ms) but stable (encoder processes once, caches)
BERT: Linear growth with sequence length (full bidirectional attention)

4.3 Fine-grained Latency Breakdown: TTFT vs TBT

TTFT and TBT across context lengths — **Figure 14:** Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) across context lengths.

GPT-2 Latency Profile (@ 512 tokens):

TTFT: 10.1 ms
TBT: 10.0 ms
Throughput: 94.9 tokens/sec

T5-Base Latency Profile (@ 512 tokens):

TTFT: 23.0 ms (including encoder 9.0ms + first decode 14.0ms)
TBT: 14.5 ms (45% slower per token)
Throughput: 62.3 tokens/sec

**Figure 15:** Latency composition breakdown showing encoder vs decoder phases for T5.

Comparative latency at 512 tokens — **Figure 16:** Comparative latency breakdown at 512 context length.

Critical Observation: T5's encoder latency (~9ms) is independent of context length due to parallel processing, while decoder is sequential.

Efficiency Ranking: GPT-2 >> T5 (GPT-2 is 1.52× faster on throughput)

5. The E³ Trade-off: Core Findings

5.1 Inference E³ Bubble Chart

**Figure 17:** Inference E³ trade-off space. X-axis: inference latency (lower better), Y-axis: accuracy (higher better), Bubble size: energy per sample (smaller better).

Pareto Frontier Analysis:

Lower-left corner: Ideal (low latency, high accuracy, small bubble)
GPT-2 position: Low latency but larger bubbles (more energy)
T5 position: Higher latency but smaller bubbles (less energy)

5.2 The Efficiency-Energy Decoupling

**Figure 18:** Throughput vs energy-per-token scatter plot. This proves that faster ≠ greener.

Measured Data:

Model	Throughput	Energy/Token	Efficiency Rank	Energy Rank
GPT-2	95 tok/s	2.40 J/tok	1st (faster)	2nd (wasteful)
T5	62 tok/s	1.14 J/tok	2nd (slower)	1st (greener)

🔥 The Decoupling

GPT-2 achieves 53% higher throughput but consumes 2.1× more energy per token.

Why?

Power draw difference: GPT-2 draws 211W vs T5's 201W during inference
Architectural efficiency: T5's encoder-decoder separation enables better energy utilization
Parameter sharing: T5 shares embeddings between encoder and decoder
Computation pattern: T5's encoder processes once; GPT-2 reprocesses at every token

Implication: Optimizing for speed ≠ Optimizing for sustainability

5.3 Multi-objective Pareto Frontiers

**Figure 19:** Four Pareto frontiers showing pairwise trade-offs.

Pareto-Optimal Solutions:

Objective Pair	Pareto Set	Interpretation
E₁ vs E₃ (Training)	{GPT-2, T5}	GPT-2 for speed, T5 for accuracy
E₂ vs E₃ (Training)	{GPT-2, T5}	GPT-2 for energy, T5 for accuracy
E₁ vs E₂ (Inference)	{GPT-2, T5}	No dominance; task-dependent
E₁ vs E₃ (Inference)	{T5}	T5 dominates when quality matters

BERT's Fate: Dominated on all frontiers except pure classification tasks.

5.4 Lifecycle Energy Analysis

**Figure 20:** Lifecycle energy breakdown at 1 million inferences.

Simulated Deployment (1M inferences):

Model	Total
GPT-2	33.4 kWh
T5	15.9 kWh

Critical: T5 saves 52% energy at scale despite higher training cost.

6. Architecture-Specific Deep Dive

6.1 Encoder-only (BERT): The Specialist

✅ Strengths

Excellent at classification/NLU tasks (BoolQ: 63.9%)
Bidirectional context understanding
Fast inference on short sequences

❌ Weaknesses

Cannot generate text (architectural limitation)
Poor zero-shot performance (23.1% on MMLU)
High memory usage (bidirectional attention)

🎯 Optimal Use Cases

Pure classification (sentiment, NER, QA)
When generation is not needed
Real-time classification with short inputs

6.2 Decoder-only (GPT): The Generalist

✅ Strengths

Fastest training (303s vs 536s)
Lowest training energy (0.067 kWh)
Best zero-shot capabilities
Fastest inference (95 tok/s)
Universal architecture

❌ Weaknesses

Highest inference energy (2.40 J/tok)
Lower task-specific accuracy

🎯 Optimal Use Cases

Latency-critical applications (chatbots)
Zero-shot scenarios
General-purpose deployment
When training budget is tight

6.3 Encoder-Decoder (T5): The Perfectionist

✅ Strengths

Best task accuracy (BoolQ: 77.1%)
Lowest inference energy (1.14 J/tok)
Strongest few-shot learning
Efficient long-input processing

❌ Weaknesses

Slowest training (536s, +77%)
Higher inference latency (TTFT: 23ms)

🎯 Optimal Use Cases

Quality-critical tasks
Energy-constrained deployment
Document summarization
Batch processing

7. Decision Framework for Practitioners

7.1 Constraint-based Architecture Selection

Scenario 1: Cloud Chatbot (Latency-constrained)

Optimal Choice: Decoder-only (GPT)

TTFT requirement <100ms → GPT-2's 10ms TTFT ✓

Trade-off: Pay 2× more in electricity for 2.3× faster response

Scenario 2: Mobile Summarization (Energy-constrained)

Optimal Choice: Encoder-Decoder (T5)

Battery budget 1000J/day → T5: 17 summaries/day vs GPT-2: 8/day

Trade-off: 1.5× slower but 2× battery savings

Scenario 3: Research Benchmark (Quality-constrained)

Optimal Choice: Encoder-Decoder (T5-Large)

Target accuracy >70% → T5's 16pp advantage justifies training cost

7.2 The E³ Trilemma

⚠️ Fundamental Law

You can optimize at most TWO of THREE dimensions

Optimize	Sacrifice	Architecture	Use Case
E₁ + E₂	E₃	Lightweight Decoder	Resource-constrained serving
E₁ + E₃	E₂	GPT-2/GPT-3	Cloud chatbots
E₂ + E₃	E₁	T5	Edge devices, sustainability

8. Broader Implications

8.1 Rethinking "Efficiency" in AI

Current benchmarks conflate three distinct concepts:

Computational Efficiency: FLOPs, memory bandwidth
Time Efficiency: Latency, throughput
Energy Efficiency: Joules per task

Our contribution: Demonstrating these are orthogonal. T5 is computationally expensive (220M params, dual attention) yet energy-efficient (1.14 J/tok).

Call to Action: Report all three metrics separately. "Efficient" is meaningless without specifying the dimension.

8.2 The Carbon Cost of Speed

Conventional Wisdom: "Faster models are more efficient"

Our Finding: FALSE. GPT-2 is 53% faster but 110% more wasteful.

Sustainability Implication: Optimizing for latency (user experience) directly conflicts with carbon reduction goals.

Policy Suggestion:

Datacenter deployment: Favor T5 (lower OpEx, lower carbon)
Edge deployment: Favor GPT (battery already limited, speed matters)
Hybrid: Route to T5 for batch, GPT for interactive

8.3 The Architecture War Through the E³ Lens

Common Narrative: "Decoder-only has won the architecture war" (GPT, LLaMA, Qwen dominate)

Our Reframing: The war isn't over; it's scenario-dependent.

Why Decoder-only Dominates Commercially:

Versatility: One model, many tasks
Zero-shot: Impressive demos without task-specific fine-tuning
Scalability: Proven to 1T+ parameters
Ecosystem: HuggingFace, OpenAI standardization

Why Alternatives Remain Relevant:

T5 for sustainability: 52% energy savings at 1M+ scale
BERT for classification: Still SOTA on GLUE leaderboards
Specialized domains: Medical (BioBERT), legal (LegalBERT)

Prediction: Architectural diversification as deployment constraints vary: Datacenters → T5 (energy cost rising) | Phones → Encoder-only | Cloud API → Decoder-only

9. Limitations and Future Work

9.1 Experimental Limitations

Parameter scale: Limited to <2B parameters (V100 GPU constraint) — Trade-offs may shift at 7B, 70B, 1T scales
Task coverage: Focused on NLU + generation — Missing: Code generation, multimodal, reasoning
Hardware specificity: V100 results; A100/H100 may differ due to varying tensor cores and memory bandwidth
Optimization parity: All models use eager attention — Production systems use FlashAttention, kernel fusion

9.2 Future Directions

Short-term:

Extend to 7B-70B models (LLaMA-2, Qwen-2.5)
Add quantization (4-bit, 8-bit) and measure energy impact
Multi-GPU scaling laws

Medium-term:

Hybrid architectures (Perceiver IO, Universal Transformer)
MoE integration (Switch Transformer energy profile)
Carbon-aware training (time-shifting to renewable energy windows)

Long-term:

Automated architecture search under E³ constraints
Dynamic architectures (mode-switching based on input)
Hardware co-design (custom ASICs for T5 energy efficiency)

10. Conclusions

10.1 Key Findings

No Universal Winner: Each architecture is Pareto-optimal under different constraints
- GPT-2: Best E₁ (speed) and E₂ (training energy)
- T5: Best E₃ (accuracy) and E₂ (inference energy)
- BERT: Best for pure classification (when generation unneeded)
Efficiency ≠ Energy: GPT-2 is 53% faster but 110% more energy-intensive
This decoupling is the report's most important empirical contribution
Scale Reverses Priorities: Training energy becomes negligible at 1M+ inferences — T5 saves 52% lifecycle energy despite 30% higher training cost
The E³ Trilemma: Cannot optimize all three dimensions simultaneously — Practitioners must make explicit trade-off decisions

10.2 Practical Recommendations

For Researchers:

Report E₁, E₂, E₃ separately (not a single "efficiency" number)
Measure lifecycle energy, not just training FLOPs
Consider deployment context when benchmarking

For Engineers:

Use decision tree: Latency-critical → GPT, Energy-critical → T5, Classification → BERT
Measure actual power draw (NVML), not estimated FLOPs
For scaling: Inference energy dominates; optimize E₂ over E₁

For Policymakers:

Require energy reporting in AI benchmarks (like MLPerf)
Incentivize sustainable architectures (carbon pricing)
Fund research in energy-efficient inference

10.3 The Future of Architecture Wars

The war is not over—it's multi-dimensional.

As deployment diversifies:

Cloud: Decoder-only for general-purpose APIs
Edge: Encoder-only for classification, lightweight decoders for generation
Datacenter batch: Encoder-decoder for quality + sustainability
Research: Hybrid/MoE architectures exploring new frontiers

The optimal strategy: Maintain an architectural portfolio optimized for different points on the E³ Pareto frontier.

11. Data and Code Availability

Framework: E³ Mini-Benchmark (open-source)

Hardware: NVIDIA Tesla V100-SXM2-32GB

Data: All experimental results, CSV aggregations, and visualization code available in repository

Reproducibility: Configuration files, training scripts, and measurement code provided

📊 Experimental Summary

Report Date: December 1, 2025
Experimental Period: November 2024 - November 2025
Total Experiments: 150+ training runs, 500+ inference benchmarks
Total Energy Measured: ~2.5 kWh (training), ~0.8 kWh (inference)

Appendices

Appendix A: Complete Model List

Encoder-only (7 models):

bert-base-uncased (110M), bert-large-uncased (340M)
roberta-base (125M)
distilbert-base (66M)
deberta-v3-base (183M)

Decoder-only (7 models):

gpt2 (124M), gpt2-medium (355M), gpt2-large (774M)
llama-3.2-1b (1B), llama-3.2-3b (3B)
qwen2.5-0.5b (494M), qwen2.5-1.5b (1.5B)

Encoder-Decoder (5 models):

t5-base (220M), t5-large (770M)
flan-t5-base (250M), flan-t5-large (780M)
bart-base (140M)

Appendix B: Energy Calculation Methodology

Real-time GPU power monitoring using NVIDIA Management Library (NVML):

import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
# During training/inference
power_watts = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
energy_joules = power_watts * duration_seconds
energy_kwh = energy_joules / 3_600_000

Sampling Rate: 100ms (10 Hz)
Precision: ±5W (NVML specification)
Integration: Trapezoidal rule

BibTeX

@article{e3benchmark2025,
  title={The End Game of Architecture Wars? An E³ Trade-off Analysis},
  author={Boyu Liu},
  affiliation={University of Illinois at Urbana-Champaign},
  year={2025},
  note={Comprehensive comparison of Transformer architectures across Efficiency, Energy, and Effectiveness dimensions}
}