This is a brief introduction to agent tuning, including its motivation, challenges, common practice and existing datasets.

Agent Tuning/Learning

Motivation

Agents: what are they and how are they used

LLM-based agents, hereafter referred to as agents, are AI systems that utilize an LLM to understand and reason user requests, and can automatically complete those requests.

Compared to a typical LLM utilization, agents have two special characteristics. First, agents can access external tools such as a calculator, a storage, or python functions. This external tool calling is usually called tool-use capability, and it improves the reliability of agents as those tools can more reliabily accomplish well-defined tasks, like calculation. Second, agents usually involve a multi-turn interaction, meaning that after getting an user requests, agents could decompose the request into several actionable steps (referred to as planning) and call the underlying LLM for several times to finish these steps. This enables agents to automatically complete sophisticated user requests, which previous require several round of interactions between the user and the LLM.

Agents are usually adopted in tasks where the user requests are sufficiently sophisticated so that a single LLM call is not enough. Some typical applications include: 1, repository-level program repair (citation), compute control (anthropic), Customer Relationship Management or CRM (https://investor.salesforce.com/press-releases/press-release-details/2024/Salesforces-Agentforce-Is-Here-Trusted-Autonomous-AI-Agents-to-Scale-Your-Workforce/default.aspx), etc.

Challenges with agent and agent finetuning

Despite being a critical leap from 0 to 1, agents need some substantial improvement. Two of the key challenges are performance and costs. First, empirical results on several tasks like AppWorld (https://arxiv.org/pdf/2407.18901) show that task accomplishment rates of current agents vary but are usually unsatisfactory (e.g., around 50%), whereas much higher success rates (e.g., 95%) are required. Second, since current agents usually utilize SOTA commercial LLMs like GPT-4o or Claude-SONNET-3.5, a single agent utilization could induce high cost as the agent might conduct multi-turn interactions, with each generating long responses. The performance and cost challenges prohibit the industry from quickly scaling their agent applications based on off-the-shelf LLMs.

Agent tuning is a promising method to alleviate the performance and cost challenges. Overall, agent tuning is a training method that enables the agents to learn from previous expert demonstrations or via trial-and-error interactions in dynamic environments. Therefore, agent tuning could improve agent performance on specific tasks. This is similar to LLM fine-tuning which could substantially improve LLM performance on specific tasks (citation). Via agent tuning, the agent could better understand, retrieve and utilize tools, plan more appropriate steps according to user requests. All these improvements could improve the agent performance on the underlying tasks. Furthermore, agent tuning could reduce cost by employing smaller models. In several works (citations), empirical results show that appropriate agent tuning enables smaller sized open-source models (e.g., Llama-7B) to obtain on par performance with huge commercial LLMs.

Current agent tuning progress

Problem formulation

In essence, (LLM-based) agents are very similar to reinforcement learning (RL) agents: the input user request is the initial state, and the agents are expected to make sequential decisions to complete the task (the request). As a result, we can formulate agent tuning similarly as RL: let $\mathcal{A}$ be the action space, $\mathcal{S}$ be the state space, $f: \mathcal{S} \rightarrow \mathcal{A}$ be the underlying LLM, $r(s)$ be a reward model, $m: \mathcal{A} \times \mathcal{S} \rightarrow \mathcal{S}$ be a hidden transition model, $\alpha$ be a discount factor, $s_0$ be the initial statement determined by the user input, $\mathcal{D}$ be the initial state distribution, the agent tuning problem can be formulated as follows: $f = \arg max_f \sum_{t=1}^T \mathbb{E}_{s_0 \sim \mathcal{D}}\alpha^t r(s_t),$ where $s_t = m(s_{t-1}, f(s_{t-1}))$, $s_0$ is determined by the user input.

Challenges

Basing on the formulation, we could identify several challenges with agent tuning. First, agents are involved in a sequential decision making process, which is similar to RL training and significantly different from typical LLM instruction finetuning setting (where only utilizes [prompt, response] pairs.) Because of the sequential process, when following LLM instruction finetuning mechanism, we might encounter the same challenge as in behavior cloning (a classic RL algorithm) – a slight deviation from the optimal policy could drive the agent far away from completing the task as errors accumulate as the agent conduct its sequential actions. Second, while RL algorithms like PPO and policy gradient can serve to alleviate the error accumulation problem, they usually require a huge number of trial-and-error interactions which are only feasible in synthetic dynamic environments. However, since agents are usually deployed in very practical scenarios, designing such synthetic environments to reflect real agent application scenarios is especially challenging. Third, comparing to collecting pertaining and finetuning data, collecting data (expert demonstrations) for agent tuning is more challenging as these demonstrations involve sophisticated multiple steps and require the logic between steps. Furthermore, compared to RL, LLMs usually have significantly larger sizes, making them more costly to train.

The above challenges make it hard to directly apply mature RL algorithms to fine tune agents, even though LLM-based and RL agents share high similarities.

efforts in agent tuning

What are common algorithms to finetune agents?

Given the above challenges, efforts have been made to finetune agents. The most popular mechanism is behavior cloning (citations). Briefly, given a collection of user requests ${i_k}{k=1}^K$, which are specified in natural language, these methods first collect expert demonstrations ${d_k}{k=1}^K$, which are also in the form of natural language. These demonstrations could be collected from humans; agent trajectories successfully completing user requests. Then, agent is fine tuned on the (request, demonstration) pairs ${(i_k, d_k)}_{k=1}^K$ following the instruction finetuning mechanism. Specifically, let $f$ be the agent, $f(d_k^j \mid \mid i_k, d_k^{0:j-1})$ be $f$’s generative probability on the $j$ -th token in demonstration $d_k$ given input prompt $i_k$ and previous tokens $d_k^{0:j-1}$: $f = \arg max_f \prod_{j} f(d_k^j \mid i_k, d_k^{0:j-1}).$ In other words, the agent is optimized so that it can maximize the probability that the expert demonstrations could be generated given the user requests.

Inevitably, the above algorithm, which is mostly based on behavior cloning, suffers from the error accumulation problem. However, several works have shown promising results in utilizing such algorithms to improve agent performance. For instance, in (agenttuning), researchers have utilized such mechanism to improve Llama model based agents so that even Llama-7B model outperforms GPT-3.5 on held-on tasks (tasks observed during agent tuning); and Llama-70B obtains on par performance with GPT-3.5 on held-out tasks (tasks never observed during agent tuning).

Having noticed the limitations with the above algorithm, recent works have started to introduce explorations into agent tuning. For instance, (trial-and-error: https://arxiv.org/pdf/2403.02502) not only utilizes successful demonstrations, but also collect failed agent trajectories using the agent using the original LLM (referred to base agent, denoted as $f_{\text{base}}$), denoted as $e_k$. Given a collect of tuple $(i_k, d_k, e_k)$, this work utilizes DPO to optimize the base agent: $f = \arg max_f \mathbb{E}_{(i_k, d_k, e_k)} [\log (\beta \log \frac{f(d_k \mid i_k)}{f(e_k\mid i_k)} - \beta \log \frac{f_{\text{base}}(d_k \mid i_k)}{f_{\text{base}}(e_k\mid e_k)})].$ In their results, agent tuning with negative (exploration) significantly outperforms behavior cloning. Furthermore, more works start to incorporate negative trajectories or involve some form of exploration in their agent tuning process.

What dataset have been devised?

To facilitate agent tuning, several datasets and a few dynamic environments have been developed. We analyze existing agent tuning datasets from four perspectives, including dynamic vs static (a fixed set of instructions or not), their tasks (specific domain vs diverse), generation source (LLMs or human), size. The results are listed below:

Name	Property	Tasks	Demonstration	Size	Remark
Swe-gym	Static	repo-level code repair	openhands + gpt4o/sonnet-3.5	2.3K
agent-tuning	Static	diverse	gpt-4o + gpt-3.5	1.8k
Mind2Web	Static	web interaction	human	2.3k
fireact	Static	diverse	gpt4	500	different prompting: react, CoT, reflextion
agentbank	Static	diverse	gpt4o	50k
agentgym	Dyanmic	diverse	gpt-4o	20.5k instruction + 6k demonstration	a combination of previously collected environment + gpt4o generation instructions
agentGen	Dynamic	diverse	PDDL	7.2k	utilize domain texts to generate diverse tasks
apigen	Dynamic	diverse	deep-seek, mistral	10k	similar to AgentGen, this work samples from a collection of questions, and APIs to generate diverse scenarios/tasks
agentinstruct	Dynamic	diverse	gpt4o	25.8M	employ an agent to generate tasks;