Question 1

What is GPQA Diamond?

Accepted Answer

GPQA Diamond (Graduate-level Google-Proof Q&A Diamond) is a benchmark that tests AI systems on expert-level questions in biology, chemistry, and physics. The "Diamond" subset contains the most challenging questions that require deep scientific reasoning and cannot be answered through simple web search. It measures whether AI can match or exceed PhD-level expertise.

Question 2

What is a good GPQA Diamond score?

Accepted Answer

Human PhD experts score around 69.7% on GPQA Diamond. Top general-purpose AI models like GPT-5.1 score 88.1%. Specialized scientific AI systems like Omic achieve 93.3%. Any score above 70% indicates PhD-level performance; scores above 90% indicate superhuman scientific reasoning capability.

Question 3

Why is GPQA Diamond important for drug discovery?

Accepted Answer

GPQA Diamond measures the scientific reasoning ability crucial for drug discovery tasks like: understanding disease mechanisms, predicting molecular interactions, analyzing experimental data, and designing therapeutic strategies. High GPQA scores indicate an AI can reason at or beyond expert level on the biological and chemical problems central to pharmaceutical R&D.

Question 4

Who has the highest GPQA Diamond score?

Accepted Answer

As of late 2025, Omic AI Scientist leads with 93.3% accuracy, followed by Gemini 3 Pro (91.9%) and GPT-5.1 (88.1%). Omic's specialization in biology and chemistry gives it an edge on the scientific reasoning required for drug discovery applications.

Question 5

How is GPQA Diamond different from other AI benchmarks?

Accepted Answer

Unlike general knowledge benchmarks, GPQA Diamond specifically tests graduate-level scientific reasoning. Questions are vetted by domain experts to ensure they cannot be answered through memorization or web search. This makes it particularly relevant for evaluating AI systems intended for scientific research and drug discovery.

Rank	Model	Accuracy	Type
1	Omic AI Scientist	93.3%	Specialized (Bio/Chem)
2	Gemini 3 Pro	91.9%	General-purpose LLM
3	GPT-5.1	88.1%	General-purpose LLM
4	Claude 4.5 Sonnet	83.4%	General-purpose LLM
—	Human PhD Expert	69.7%	Human baseline

GPQA Diamond Benchmark

GPQA Diamond Leaderboard (November 2025)

Why GPQA Diamond Matters for Drug Discovery

Understand Disease Biology

Predict Molecular Interactions

Analyze Experimental Data

Design Therapeutic Strategies

About the Benchmark

Question Characteristics

Evaluation Methodology

Why Omic Leads GPQA Diamond

Frequently Asked Questions

Related Terms