← Back to Glossary

GPQA Diamond Benchmark

The gold standard for measuring AI scientific reasoning capability

Definition

GPQA Diamond (Graduate-level Google-Proof Q&A Diamond) is a benchmark that tests AI systems on expert-level questions in biology, chemistry, and physics. The "Diamond" subset contains the most challenging questions requiring deep scientific reasoning that cannot be answered through web search.

GPQA Diamond Leaderboard (November 2025)

RankModelAccuracyType
1Omic AI Scientist93.3%Specialized (Bio/Chem)
2Gemini 3 Pro91.9%General-purpose LLM
3GPT-5.188.1%General-purpose LLM
4Claude 4.5 Sonnet83.4%General-purpose LLM
Human PhD Expert69.7%Human baseline
Source: Artificial Analysis. Scores updated November 2025.

Why GPQA Diamond Matters for Drug Discovery

Drug discovery requires expert-level reasoning in biology, chemistry, and pharmacology. GPQA Diamond measures exactly this capability. AI systems that score highly on GPQA Diamond can:

Understand Disease Biology

Reason about molecular mechanisms, pathway dysregulation, and disease etiology at PhD level.

Predict Molecular Interactions

Understand protein-ligand binding, enzyme kinetics, and chemical reactivity.

Analyze Experimental Data

Interpret complex biological datasets and draw valid scientific conclusions.

Design Therapeutic Strategies

Propose rational drug design approaches based on mechanistic understanding.

About the Benchmark

Question Characteristics

  • • Graduate-level difficulty (PhD qualifying exam level)
  • • Cannot be answered through web search
  • • Require multi-step reasoning
  • • Cover biology, chemistry, and physics
  • • Vetted by domain experts

Evaluation Methodology

  • • Multiple choice format for objective scoring
  • • "Diamond" subset is most challenging tier
  • • Human expert baseline established
  • • Regularly updated with new questions
  • • Prevents memorization-based performance

Why Omic Leads GPQA Diamond

Omic's AI Scientist achieves 93.3% on GPQA Diamond—the highest score among all tested systems. This performance comes from:

Specialized Training

Focused on biology, chemistry, and drug discovery rather than general knowledge.

Multi-Omics Integration

Deep understanding of genomics, proteomics, and metabolomics relationships.

Systems Biology

Trained on disease mechanisms, not just isolated facts.

Frequently Asked Questions

What is a good GPQA Diamond score?

Human PhD experts score around 69.7%. Scores above 70% indicate PhD-level performance; above 90% indicates superhuman scientific reasoning. Top AI systems now exceed 90%.

Who has the highest GPQA Diamond score?

As of November 2025, Omic AI Scientist leads with 93.3% accuracy, followed by Gemini 3 Pro (91.9%) and GPT-5.1 (88.1%).

How is GPQA different from other benchmarks?

Unlike general knowledge benchmarks, GPQA specifically tests graduate-level scientific reasoning. Questions cannot be answered through memorization or web search, making it relevant for evaluating AI for scientific research.