Agentic AI Benchmark: Shop Til You Drop

→ Project Overview

We have developed a benchmark evaluation agent for testing other AI agents’ ability to predict grocery shopping behavior and use e-commerce APIs.

The green agent evaluates how well white agents (the agents being tested) can predict what a user will purchase on their next grocery shopping trip based on purchase history. White agents use a real e-commerce API to search for products, build a basket, and complete the task.

What is a Green Agent?

In the AgentBeats framework:

  • Green agents are evaluation/benchmark agents that test other agents
  • White agents are the agents being tested/evaluated

Key Features

  • Real-world dataset: Built on the Instacart Kaggle dataset with 1,500+ unique users and 30,000+ transactions
  • Production e-commerce API: Hosted at https://green-agent-production.up.railway.app/ with search, cart, and checkout functionality
  • Multi-level evaluation: F1 scoring across products, aisles, and departments with blended metrics
  • AgentBeats/A2A compatible: Implements the A2A protocol for agent-to-agent communication
  • Multiple evaluation modes: Single user, baseline comparison, and multi-user benchmarks
  • Flexible deployment: Works locally or on cloud platforms (Railway, Google Cloud Run, etc.)

→ Demo

→ Project Team

Arlen Kumar, [email protected]
Henry Michaelson, [email protected]
Tao Sun, [email protected]

→ Github

https://github.com/LupoSun/CS194_Ecom_GreenAgent

Leave a Reply

Discover more from Tao Sun

Subscribe now to keep reading and get access to the full archive.

Continue reading