Fixing SFT For Multi-Turn Tool Call Data: A Comprehensive Guide

Dec 3, 2025 by Alex Johnson 64 views

Introduction

In the realm of Natural Language Processing (NLP), Supervised Fine-Tuning (SFT) plays a crucial role in adapting pre-trained language models to specific tasks. When dealing with multi-turn conversations involving tool calls, the process becomes more intricate. This article delves into the challenges encountered when applying SFT to multi-turn tool call data, specifically focusing on issues reported by THUDM and SLIME, and proposes effective solutions. Understanding these challenges and their resolutions is essential for developers and researchers aiming to build robust and efficient conversational AI systems. We will explore the problems related to data processing, template application, and masking, and how a non-invasive approach can rectify these issues without disrupting the existing framework.

Understanding the Problem: SFT and Multi-Turn Tool Calls

To effectively address the challenges, let's first define the key concepts. SFT involves fine-tuning a pre-trained language model on a labeled dataset to optimize its performance on a specific task. In the context of multi-turn conversations, the model needs to maintain context across multiple interactions. Tool calls further complicate this by requiring the model to invoke external tools or functions to generate appropriate responses. The integration of these elements presents several technical hurdles that need to be carefully navigated.

The core issue arises when processing multi-turn tool call data for SFT. The common practice of converting message histories (a list of dictionaries) into a single string, especially when using --apply-chat-template, disrupts crucial functions like gen_multi_turn_loss_mask_qwen in mask_utils.py. These functions are designed to work with a structured list[dict] format, and the string conversion leads to errors. This problem is further exacerbated by the requirement of --tool-key needing --apply-chat-template, creating a cascading effect of issues. Additionally, the way apply_chat_template is implemented in mask_utils.py, applying it to each message individually without considering the tools parameter, leads to incorrect formatting of tool-related messages. This section will further elaborate on the specific problems, their causes, and the downstream effects, laying the groundwork for the proposed solutions.

Deep Dive into the Technical Challenges

1. Data Conversion and Its Impact

One of the primary challenges lies in how the message history is processed. When the --apply-chat-template flag is used, the message history, which is initially a list of dictionaries (list[dict]), is converted into a string within the data.py script. This conversion is problematic because functions like gen_multi_turn_loss_mask_qwen and others in mask_utils.py are designed to operate on the original list[dict] structure. These functions rely on the structured format to correctly identify and process different parts of the conversation, such as user inputs, assistant responses, and tool calls. By converting the message history into a string, the structured information is lost, leading to errors and incorrect processing.

This issue is critical because the loss masking functions are essential for training the model effectively. Loss masking allows the model to focus on the relevant parts of the input and output sequences, preventing it from being distracted by irrelevant information. When the message history is converted to a string, the masking functions cannot correctly identify the boundaries between different turns in the conversation, which can lead to the model learning incorrect patterns.

2. The Interplay Between `--tool-key` and `--apply-chat-template`

The --tool-key flag, which is used to specify the key in the data that contains tool information, further complicates the problem. It has a dependency on --apply-chat-template, meaning that whenever --tool-key is used, --apply-chat-template must also be enabled. This dependency triggers the string conversion issue described above, creating a direct link between tool usage and data processing errors. The issue arises because the system is designed to handle tools as part of the structured message history, and the string conversion disrupts this process. When the model is not able to correctly identify and utilize the tools, it's capabilities are significantly diminished.

3. Incorrect Formatting of Tool-Calling Messages

Another significant issue is the way apply_chat_template is used within mask_utils.py. The function is called on each message individually, without passing the tools parameter. This is problematic because tool-related messages, such as assistant messages with tool_calls and responses with the tool role, require the tools parameter to be formatted correctly. Without the tools parameter, the function cannot properly tokenize and format these messages, leading to incorrect tokenization and potentially misleading information for the model during training. This incorrect formatting can cause the model to misinterpret the tool calls and generate inappropriate responses.

4. Downstream Effects of the Technical Challenges

The issues described above have several downstream effects that can significantly impact the performance of the trained model. Without the proper handling of tool definitions, the model never sees these definitions during training. This means the model is not aware of the available tools and their functionalities, making it impossible for the model to effectively use them during conversations. This lack of awareness greatly limits the model's ability to handle complex tasks and generate informed responses.

When the --apply-chat-template and --tool-key flags are used together, the string conversion issue leads to fatal errors, preventing the training process from completing successfully. This is a critical problem because it blocks the model from learning from the data entirely. Even if the process doesn't crash, applying apply_chat_template without the tools parameter results in incorrect tokenization for tool-calling messages. This means the model receives flawed information, which can lead to suboptimal performance and an inability to correctly interpret and respond to tool calls. Ultimately, these issues undermine the model's capacity to engage in meaningful and effective conversations.

Proposing a Non-Invasive Solution

To address these challenges effectively, a non-invasive solution is proposed. This approach aims to fix the issues without making extensive changes to the existing codebase, ensuring minimal disruption and maintaining the stability of the system. The solution comprises two key changes:

1. Storing Tools in Metadata Within `data.py`

The first part of the solution focuses on how tool information is handled within the data.py script. When the --tool-key flag is provided, the proposed change involves extracting the tools from the data and storing them in the `sample.metadata[