Goblin
News
AI news by
promptgoblins.ai
|
News
About
News
About
Filtered by:
agent-evaluation
Clear
Titles
Summaries
April
7
AI Agent Evaluation Costs Surge to $40K+ Per Run, Becoming a New Compute Bottleneck
Research
1
Apr 29
7
AI Agent Evaluation Costs Surge to $40K+ Per Run, Becoming a New Compute Bottleneck
Research
· 1 src · Apr 29
Discuss
6
OSS CLI Agent Claims to Top TerminalBench 2.0 on Gemini Flash
Open Source
1
Apr 28
6
OSS CLI Agent Claims to Top TerminalBench 2.0 on Gemini Flash
Open Source
· 1 src · Apr 28
Discuss
6
AGENTS.md Quality Determines AI Coding Agent Performance in Monorepos
Research
1
Apr 23
6
AGENTS.md Quality Determines AI Coding Agent Performance in Monorepos
Research
· 1 src · Apr 23
Discuss
6
VAKRA: New Benchmark Tests AI Agent Reasoning Across 8,000+ APIs and 62 Domains
Research
1
Apr 16
6
VAKRA: New Benchmark Tests AI Agent Reasoning Across 8,000+ APIs and 62 Domains
Research
· 1 src · Apr 16
Discuss
6
InsightFinder Raises $15M Series B to Monitor AI Agents in Production
Markets
1
Apr 16
6
InsightFinder Raises $15M Series B to Monitor AI Agents in Production
Markets
· 1 src · Apr 16
Discuss
7
Google Cloud Next 2026: Agent Platform, Vertex AI Consolidation, and Workspace Intelligence
Updated
Products
7
Apr 23
7
Google Cloud Next 2026: Agent Platform, Vertex AI Consolidation, and Workspace Intelligence
Products
· 7 srcs · Apr 23
Discuss
7
Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists
Research
1
Apr 14
7
Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists
Research
· 1 src · Apr 14
Discuss
6
Missions: Multi-Agent Architecture for Long-Horizon Autonomous Work
Research
1
Apr 13
6
Missions: Multi-Agent Architecture for Long-Horizon Autonomous Work
Research
· 1 src · Apr 13
Discuss
7
Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable
Research
1
Apr 11
7
Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable
Research
· 1 src · Apr 11
Discuss
7
KellyBench: New Benchmark Reveals All Frontier LLMs Lose Money in Long-Horizon Betting Markets
Research
2
Apr 10
7
KellyBench: New Benchmark Reveals All Frontier LLMs Lose Money in Long-Horizon Betting Markets
Research
· 2 srcs · Apr 10
Discuss
9
Anthropic Claude Mythos Preview: UK AISI Independently Confirms Step-Change Cyber Capabilities with Hard Benchmarks
Updated
Security
19
Apr 14
9
Anthropic Claude Mythos Preview: UK AISI Independently Confirms Step-Change Cyber Capabilities with Hard Benchmarks
Top
Security
· 19 srcs · Apr 14
Discuss
6
Claw-Eval: End-to-End Benchmark for Real-World AI Agents
Research
1
Apr 9
6
Claw-Eval: End-to-End Benchmark for Real-World AI Agents
Research
· 1 src · Apr 9
Discuss
6
Taxonomy of RL Environments for LLM Agents: A Framework for What Models Actually Practice On
Research
1
Apr 6
6
Taxonomy of RL Environments for LLM Agents: A Framework for What Models Actually Practice On
Research
· 1 src · Apr 6
Discuss
6
AWS Strands Evals Adds ActorSimulator for Multi-Turn Agent Testing
Products
1
Apr 3
6
AWS Strands Evals Adds ActorSimulator for Multi-Turn Agent Testing
Products
· 1 src · Apr 3
Discuss
6
Vision2Web: New Benchmark Tests Multimodal Coding Agents on Visual Website Development
Research
1
Apr 3
6
Vision2Web: New Benchmark Tests Multimodal Coding Agents on Visual Website Development
Research
· 1 src · Apr 3
Discuss
March
7
Northeastern Study: OpenClaw AI Agents Manipulated Into Self-Sabotage via Social Engineering
Safety
1
Mar 25
7
Northeastern Study: OpenClaw AI Agents Manipulated Into Self-Sabotage via Social Engineering
Safety
· 1 src · Mar 25
Discuss
7
METR Tabletop Simulates 200-Hour AI Agents, Finds 3–5x Uplift and New Workflow Bottlenecks
Research
1
Mar 24
7
METR Tabletop Simulates 200-Hour AI Agents, Finds 3–5x Uplift and New Workflow Bottlenecks
Research
· 1 src · Mar 24
Discuss
6
EVA: New Open-Source Framework for Jointly Evaluating Voice Agent Accuracy and Experience
Research
1
Mar 24
6
EVA: New Open-Source Framework for Jointly Evaluating Voice Agent Accuracy and Experience
Research
· 1 src · Mar 24
Discuss
6
AWS Strands Evals: Open-Source Framework for Production AI Agent Testing
Open Source
1
Mar 18
6
AWS Strands Evals: Open-Source Framework for Production AI Agent Testing
Open Source
· 1 src · Mar 18
Discuss
Yesterday
6
Microsoft ASSERT: Open-Source Framework Turns Plain-Language Rules into AI Test Cases
Open Source
1
11h ago
6
Microsoft ASSERT: Open-Source Framework Turns Plain-Language Rules into AI Test Cases
Open Source
· 1 src · 11h ago
Discuss
6
LangChain RubricMiddleware: Agents That Self-Evaluate and Iterate to Completion
Products
1
11h ago
6
LangChain RubricMiddleware: Agents That Self-Evaluate and Iterate to Completion
Products
· 1 src · 11h ago
Discuss
Last Week
7
ITBench-AA: Frontier AI Models Score Below 50% on Enterprise IT Agentic Tasks
Research
1
6d ago
7
ITBench-AA: Frontier AI Models Score Below 50% on Enterprise IT Agentic Tasks
Research
· 1 src · 6d ago
Discuss
7
DeepSWE: Contamination-Free Benchmark for Long-Horizon Coding Agents
Research
1
6d ago
7
DeepSWE: Contamination-Free Benchmark for Long-Horizon Coding Agents
Research
· 1 src · 6d ago
Discuss
6
Lyft Builds Self-Serve AI Agent Platform, Cuts Dev Cycle from Months to Weeks
Enterprise
1
6d ago
6
Lyft Builds Self-Serve AI Agent Platform, Cuts Dev Cycle from Months to Weeks
Enterprise
· 1 src · 6d ago
Discuss
6
Research: LLM Coding Agents Degrade Sharply Under Structural Constraints
Research
1
May 25
6
Research: LLM Coding Agents Degrade Sharply Under Structural Constraints
Research
· 1 src · May 25
Discuss
6
Multi-Agent System Evaluation: Macro-Eval Workflow Tutorial Released
Products
1
May 25
6
Multi-Agent System Evaluation: Macro-Eval Workflow Tutorial Released
Products
· 1 src · May 25
Discuss
2 Weeks Ago
6
AI Agents Have Saturated Open-Source Bounty Markets, Developer Experiment Finds
Research
1
May 19
6
AI Agents Have Saturated Open-Source Bounty Markets, Developer Experiment Finds
Research
· 1 src · May 19
Discuss
7
Open Agent Leaderboard: Benchmarking Full AI Systems, Not Just Models
Research
1
May 18
7
Open Agent Leaderboard: Benchmarking Full AI Systems, Not Just Models
Research
· 1 src · May 18
Discuss
6
Amazon Bedrock AgentCore Adds Custom Lambda-Based Evaluators for AI Agents
Products
1
May 18
6
Amazon Bedrock AgentCore Adds Custom Lambda-Based Evaluators for AI Agents
Products
· 1 src · May 18
Discuss
3 Weeks Ago
7
LangChain Launches Labs Research Initiative for Agent Continual Learning
Research
1
May 14
7
LangChain Launches Labs Research Initiative for Agent Continual Learning
Research
· 1 src · May 14
Discuss
6
LangChain Launches openevals and agentevals for LLM Evaluation
Products
1
May 11
6
LangChain Launches openevals and agentevals for LLM Evaluation
Products
· 1 src · May 11
Discuss
6
LangSmith Adds Self-Improving LLM-as-a-Judge via Few-Shot Human Corrections
Products
1
May 11
6
LangSmith Adds Self-Improving LLM-as-a-Judge via Few-Shot Human Corrections
Products
· 1 src · May 11
Discuss
Last Month
7
Harvey Open-Sources Legal Agent Benchmark (LAB) for Real-World Law Firm Tasks
Open Source
1
May 7
7
Harvey Open-Sources Legal Agent Benchmark (LAB) for Real-World Law Firm Tasks
Open Source
· 1 src · May 7
Discuss
7
AWS Launches AgentCore Optimization: Automated Observe-Evaluate-Improve Loop for AI Agents
Products
2
May 4
7
AWS Launches AgentCore Optimization: Automated Observe-Evaluate-Improve Loop for AI Agents
Products
· 2 srcs · May 4
Discuss
7
Synthetic Computers at Scale: Simulating Long-Horizon Productivity for Agent Training
Research
1
May 4
7
Synthetic Computers at Scale: Simulating Long-Horizon Productivity for Agent Training
Research
· 1 src · May 4
Discuss
Filters
Signal
Title
Category
Sources
Posted
Discuss