How One AI Researcher Automated Their Job Using GitHub Copilot Agents

By

In the world of software engineering, the drive to eliminate repetitive tasks often leads to creating automation tools. For one AI researcher on the Copilot Applied Science team, this urge sparked a project that went far beyond typical automation—it automated intellectual toil. By leveraging GitHub Copilot, they built a system called eval-agents to analyze vast amounts of coding agent trajectories, drastically reducing manual effort and empowering their teammates. This Q&A explores the journey, the challenges, and the lessons learned.

What challenge did the AI researcher face in analyzing coding agent performance?

The researcher's work involves evaluating coding agents using benchmarks like TerminalBench2 or SWEBench-Pro. Each task produces a trajectory—a JSON file with hundreds of lines describing the agent's thought processes and actions. With dozens of tasks in a benchmark and multiple runs per day, analyzing hundreds of thousands of lines of code became overwhelming. Doing this manually was impossible, so they turned to AI for help. GitHub Copilot surfaced patterns, reducing the workload to a few hundred lines. Still, the repetitive nature of this loop—using Copilot to discover insights, then investigating manually—cried out for automation. That frustration, combined with an engineer's instinct to remove toil, led to the creation of a dedicated tool.

How One AI Researcher Automated Their Job Using GitHub Copilot Agents
Source: github.blog

How did GitHub Copilot help in streamlining the analysis process?

Initially, the researcher used GitHub Copilot as a conversational assistant to analyze trajectory data. For each new benchmark run, they would ask Copilot to identify patterns—such as common failure modes or unexpected behaviors—within the JSON files. This turned hundreds of thousands of lines of code into a few dozen actionable insights. However, the process still required the researcher to manually prompt Copilot each time, investigate the highlighted patterns, and then act on the findings. While effective, it was repetitive. The researcher realized that the loop itself could be automated, and coding agents—powered by Copilot—could take over the pattern-spotting and preliminary analysis, freeing the researcher to focus on high-level creative decisions.

What is eval-agents and how was it born?

Eval-agents is a project that automates the intellectual work of analyzing coding agent trajectories. It was born from the researcher's desire to break free from the repetitive cycle of using Copilot to spot patterns and then manually investigating them. The name reflects its purpose: agents that evaluate other agents. The tool allows the researcher to define custom analysis agents that can ingest trajectory data, identify anomalies, summarize trends, and even suggest next steps. By automating the automation, eval-agents eliminates the need to redo the same Copilot-driven analysis for each new benchmark run. This saves hours daily and ensures consistent, reproducible evaluation methods across the team.

What were the three key design goals for eval-agents?

The researcher set three principles to guide development:

  1. Make these agents easy to share and use — so teammates without deep technical expertise could benefit from them.
  2. Make it easy to author new agents — allowing anyone to create custom analysis tools tailored to their specific needs.
  3. Make coding agents the primary vehicle for contributions — ensuring that improvements and new features come through version-controlled code, not static reports.

These goals align with GitHub's collaborative ethos and draw from the researcher's experience as an open-source maintainer. The result is a system where agents are shared via repositories, authored with minimal boilerplate, and maintained like any other software project.

How One AI Researcher Automated Their Job Using GitHub Copilot Agents
Source: github.blog

How does eval-agents enable team collaboration?

By making agents easy to share and modify, eval-agents turns individual productivity into a team asset. A teammate can create an agent that spots a specific type of error in trajectory logs, commit it to a shared repository, and instantly everyone on the Copilot Applied Science team can use it. Conversely, if someone finds a flaw or an improvement, they can fork and modify the agent, then submit a pull request. This creates a virtuous cycle: the more people use and contribute, the richer the library of agents becomes. The researcher also integrated the agents with existing workflows, so running an analysis is as simple as invoking a command or clicking a button. This lowers the barrier for team members who may not be comfortable writing code from scratch.

What lessons were learned about effective use of Copilot?

The researcher discovered that Copilot is most powerful when used as a collaborative partner rather than a black-box generator. For eval-agents, they learned to provide Copilot with clear context—such as example trajectories and desired output formats—which dramatically improved the quality of suggestions. They also found that iterating on prompts with Copilot in a conversation style helped refine agents faster. Another key insight was to treat Copilot as a force multiplier for code understanding: having Copilot summarize a trajectory allowed the researcher to skip reading thousands of lines and jump straight to anomalies. This lesson—using Copilot to reduce cognitive load—became the foundation for designing agents that let others do the same.

Tags:

Related Articles

Recommended

Discover More

Semi-Solid State Batteries Finally Hit the E-Bike Market – Industry Shift UnderwayMacBook Pro M5 Series Hits All-Time Low Prices on Amazon: Up to $216 Off in Flash SaleUrgent: ASP.NET Framework Users Must Migrate to Core or Face Performance Obsolescence, Experts WarnAnatomy of a Supply Chain Attack: How Hackers Weaponized LiteLLM to Steal Your DataJ. Craig Venter: The Maverick Who Revolutionized Genomics and Synthetic Biology