Human Benchmark Testing

AI benchmarks are broken. Here’s what we need instead.

One-off tests don’t measure AI’s true impact. We’re better off shifting to more human-centered, context-specific methods.

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

6don MSN

AI dangerously close to solving test that only the brightest minds on Earth could: ‘Human expertise still matters’

This system could game us. Artificial intelligence is already outperforming humans at various intelligence-based activities ...

Decrypt

Is AGI Here? Not Even Close, New AI Benchmark Suggests

ARC-AGI-3 dropped the same week Jensen Huang declared AGI achieved. Gemini scored 0.37%. GPT-5.4 got 0.26%. Humans hit 100%.

Hosted on MSN

New AI benchmark checks if chatbots protect human well-being

Artificial intelligence systems are increasingly woven into everyday decisions about health, money and work, yet most tests of these models still focus on how smart they are, not whether they keep ...

Gizmodo

OpenAI Claims Its New Model Reached Human Level on a Test for ‘General Intelligence.’ What Does That Mean?

A new artificial intelligence (AI) model has just achieved human-level results on a test designed to measure “general intelligence”. On December 20, OpenAI’s o3 system scored 85% on the ARC-AGI ...

The New York Times

Are You Smarter Than A.I.?

Some experts predict that A.I. will surpass human intelligence within the next few years. Play this puzzle to see how far the machines have to go. By Dylan Freedman and Cade Metz Produced by Juliana ...

Forbes

Beyond The Llama Drama: 4 New Benchmarks For Large Language Models

Forbes contributors publish independent expert analyses and insights. AI researcher working with the UN and others to drive social change. Apr 13, 2025, 07:56pm EDT The April 2025 drama around Llama's ...

Knoxville News Sentinel

AIMomentz Launches Open AI Image Evaluation Platform With Human Preference Benchmark and Provenance Tracking

Text-based AI models have LMArena, which reached a $1.7 billion valuation by letting humans compare GPT, Claude, and Gemini in blind A/B tests. The resulting human preference data became the industry ...

The Conversation

An AI system has reached human level on a test for ‘general intelligence’. Here’s what that means

Michael Timothy Bennett receives funding from the Australian government. Elija Perrier receives funding from the Australian government. A new artificial intelligence (AI) model has just achieved human ...

2don MSN

Human context missing: AI benchmarks are flawed, researcher explains why

Every few weeks, you will see a tweet by some AI CEO about how their latest model has topped a benchmark. Then the headlines ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results