Goblin
News
AI news by
promptgoblins.ai
|
News
About
News
About
Filtered by:
agent-evaluation
Clear
Titles
Summaries
Tuesday
7
Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists
Research
1
1d ago
7
Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists
Research
· 1 src · 1d ago
Discuss
Monday
6
Missions: Multi-Agent Architecture for Long-Horizon Autonomous Work
Research
1
2d ago
6
Missions: Multi-Agent Architecture for Long-Horizon Autonomous Work
Research
· 1 src · 2d ago
Discuss
Last Week
7
Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable
Research
1
4d ago
7
Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable
Research
· 1 src · 4d ago
Discuss
7
KellyBench: New Benchmark Reveals All Frontier LLMs Lose Money in Long-Horizon Betting Markets
Research
2
5d ago
7
KellyBench: New Benchmark Reveals All Frontier LLMs Lose Money in Long-Horizon Betting Markets
Research
· 2 srcs · 5d ago
Discuss
9
Anthropic Claude Mythos Preview: UK AISI Independently Confirms Step-Change Cyber Capabilities with Hard Benchmarks
Updated
Security
17
1d ago
9
Anthropic Claude Mythos Preview: UK AISI Independently Confirms Step-Change Cyber Capabilities with Hard Benchmarks
Top
Security
· 17 srcs · 1d ago
Discuss
6
Claw-Eval: End-to-End Benchmark for Real-World AI Agents
Research
1
6d ago
6
Claw-Eval: End-to-End Benchmark for Real-World AI Agents
Research
· 1 src · 6d ago
Discuss
6
Taxonomy of RL Environments for LLM Agents: A Framework for What Models Actually Practice On
Research
1
Apr 6
6
Taxonomy of RL Environments for LLM Agents: A Framework for What Models Actually Practice On
Research
· 1 src · Apr 6
Discuss
2 Weeks Ago
6
AWS Launches Amazon Bedrock AgentCore Evaluations for AI Agent Reliability Testing
Products
1
Apr 3
6
AWS Launches Amazon Bedrock AgentCore Evaluations for AI Agent Reliability Testing
Products
· 1 src · Apr 3
Discuss
6
Vision2Web: New Benchmark Tests Multimodal Coding Agents on Visual Website Development
Research
1
Apr 3
6
Vision2Web: New Benchmark Tests Multimodal Coding Agents on Visual Website Development
Research
· 1 src · Apr 3
Discuss
6
AWS Strands Evals Adds ActorSimulator for Multi-Turn Agent Testing
Products
1
Apr 3
6
AWS Strands Evals Adds ActorSimulator for Multi-Turn Agent Testing
Products
· 1 src · Apr 3
Discuss
3 Weeks Ago
7
Northeastern Study: OpenClaw AI Agents Manipulated Into Self-Sabotage via Social Engineering
Safety
1
Mar 25
7
Northeastern Study: OpenClaw AI Agents Manipulated Into Self-Sabotage via Social Engineering
Safety
· 1 src · Mar 25
Discuss
7
METR Tabletop Simulates 200-Hour AI Agents, Finds 3–5x Uplift and New Workflow Bottlenecks
Research
1
Mar 24
7
METR Tabletop Simulates 200-Hour AI Agents, Finds 3–5x Uplift and New Workflow Bottlenecks
Research
· 1 src · Mar 24
Discuss
6
EVA: New Open-Source Framework for Jointly Evaluating Voice Agent Accuracy and Experience
Research
1
Mar 24
6
EVA: New Open-Source Framework for Jointly Evaluating Voice Agent Accuracy and Experience
Research
· 1 src · Mar 24
Discuss
Last Month
6
AWS Strands Evals: Open-Source Framework for Production AI Agent Testing
Open Source
1
Mar 18
6
AWS Strands Evals: Open-Source Framework for Production AI Agent Testing
Open Source
· 1 src · Mar 18
Discuss
Filters
Signal
Title
Category
Sources
Posted
Discuss