How Automating Agent Trajectory Analysis Transformed Our Development Workflow
In the world of AI research, analyzing the performance of coding agents is both critical and time-consuming. I recently found myself caught in a repetitive cycle of reviewing thousands of agent trajectories, each a JSON file documenting an agent's decision-making steps while solving a task. Using GitHub Copilot, I could surface patterns and reduce the workload, but the process still required manual investigation. Driven by a desire to eliminate this intellectual toil, I created eval-agents, a tool that automates the analysis and enables my entire team to collaborate more effectively.
The Impetus for Automation
My primary responsibility involves evaluating coding agent performance against standardized benchmarks like TerminalBench2 and SWEBench-Pro. This requires digging through massive collections of trajectories—detailed logs that capture the agent's thoughts and actions for each task.

Analyzing Agent Trajectories
Each task in a benchmark set produces its own trajectory file, often hundreds of lines of JSON code. Multiply that by dozens of tasks per benchmark and again by the numerous runs we conduct daily, and you end up with hundreds of thousands of lines of data to analyze. Manually reading through all of this is simply impossible.
The Repetitive Loop
My typical workflow involved using GitHub Copilot to identify patterns in the trajectories, then manually investigating those patterns to extract meaningful insights. While Copilot helped me reduce the lines I needed to read from hundreds of thousands to a few hundred, the loop itself remained repetitive. The engineer in me thought: I can automate this. That realization sparked the creation of eval-agents.
Building Eval-Agents
The core idea was to build a system that could automate the intellectual work of analyzing agent trajectories, making it accessible and shareable across the team.
Design Goals
I approached the project with three guiding principles:
- Make agents easy to share and use – so that anyone on the team could leverage the automation.
- Make it easy to author new agents – empowering peers to create custom analysis tools.
- Make coding agents the primary vehicle for contributions – enabling a collaborative, agent-driven development workflow.
Sharing and Collaboration
These goals align closely with GitHub’s core values of collaboration and open source. My experience as an open-source maintainer for the GitHub CLI taught me the importance of making tools easy to adopt and extend. With eval-agents, I ensured that the agents could be version-controlled, shared via repositories, and run by anyone with minimal setup. Team members can now author their own agents to tackle specific analysis challenges, and the entire team benefits from a growing library of automation.

Impact and Future
The results have been transformative. Instead of spending hours on manual pattern hunting, my colleagues and I can now run agents that automatically surface insights from benchmark runs. This has not only accelerated our research but also freed up time for more creative problem-solving.
Moreover, the agent-driven development approach has opened up new possibilities. We are no longer limited by individual capacity; the team collectively builds and maintains agents that continuously improve our analysis capabilities. As we expand the agent library, we anticipate even greater efficiency gains and deeper understanding of coding agent behavior.
This journey taught me that automation isn't just about removing drudgery—it's about enabling teams to collaborate at a higher level. By leveraging tools like GitHub Copilot and building upon them with our own agents, we have created a feedback loop where automation fuels innovation.
Related Articles
- GitHub Actions Workflow Compromised: How a Malicious PyPI Package Slipped Through
- Mastering AI Agent Versioning: A Step-by-Step Guide to Cloudflare Artifacts
- Chinese Hygon C86-4G Processors Gain GCC 17 Compiler Support
- 5 Must-Know Governance Features for .NET AI Agents with MCP Tool Calls
- VideoLAN Unveils Dav2d: Pioneering Open-Source AV2 Decoder Development
- The Python Insider Blog: A New Home and a New Way to Contribute
- A Step-by-Step Guide for UK Policymakers: Addressing Online Harm Without Breaking the Web
- 10 Things You Need to Know About Stack Allocation in Go's 2026 Release