SIQA Dataset: Investigating Wrong Scores In OpenCompass
Introduction
This article delves into a reported issue of incorrect scoring on the SIQA (Social Interaction Question Answering) dataset when using OpenCompass, a comprehensive evaluation platform for large language models (LLMs). The issue was observed during the evaluation of several quantized Deepseek-R1-distill-Llama-70B models, where the SIQA dataset consistently showed scores around 35 points, while other datasets yielded scores of zero. This discrepancy raises concerns about the accuracy of SIQA dataset evaluation within OpenCompass, particularly for low-precision quantized models. Understanding the intricacies of this issue is crucial for ensuring reliable and valid evaluation results, especially as the use of LLMs expands across various applications.
OpenCompass is a powerful tool for researchers and practitioners, providing a standardized framework for assessing model performance across a range of benchmarks. However, like any complex system, it's susceptible to bugs or unexpected behaviors. When evaluating large language models, it's critical to have confidence in the accuracy of the results. Incorrect scores can lead to misleading conclusions about a model's capabilities, impacting downstream tasks and research directions. This article aims to provide a detailed examination of the SIQA scoring issue, including the environment, reproduction steps, and potential causes, offering insights into how this problem can be addressed and resolved. Our exploration will cover the specific models and configurations used, the observed results, and the diagnostic steps taken to identify the root cause. By meticulously documenting this issue, we hope to contribute to the ongoing improvement of OpenCompass and enhance the reliability of LLM evaluations.
Background and Prerequisites
Before diving into the specifics, let's establish some context. The user who reported the issue diligently followed the necessary prerequisites, confirming that the problem hadn't been previously addressed in existing Issues and Discussions, and that it persisted in the latest version of OpenCompass. This thoroughness is commendable, as it ensures that efforts aren't duplicated and that attention is focused on genuinely novel issues. The user's environment details, including CUDA availability, GCC version, GPU information, and the versions of PyTorch, MMEngine, and other relevant libraries, provide valuable context for understanding the setup in which the issue occurred. Such detailed information is essential for reproducibility and debugging, as it helps to isolate environment-specific factors that may be contributing to the problem.
For those unfamiliar with the SIQA dataset, it's designed to evaluate a model's ability to reason about social interactions and common-sense social knowledge. The dataset presents scenarios and questions that require understanding of social norms, intentions, and consequences. Evaluating models on SIQA provides insights into their capacity for human-like social reasoning, a critical aspect for applications involving human-computer interaction and decision-making in social contexts. The fact that this issue specifically affects SIQA scores, while other datasets appear to be unaffected, suggests a potential interaction between the SIQA dataset's format or content and the evaluation process within OpenCompass. This could involve how the dataset is preprocessed, how the model's responses are interpreted, or how the scores are calculated. Investigating these possibilities is key to pinpointing the source of the error. The user's detailed environment information serves as a foundation for our investigation, allowing us to consider potential compatibility issues or configuration nuances that may be influencing the observed results. By carefully examining these details, we can begin to narrow down the range of potential causes and focus our efforts on the most promising areas for investigation. The next section will delve into the specifics of the problem reproduction, including the code and commands used to trigger the issue, providing a clear understanding of the steps involved in replicating the incorrect SIQA scores.
Reproducing the Problem: Code, Configuration, and Commands
The user provided a comprehensive set of information to reproduce the issue, including the code snippet, configuration details, and commands used. This level of detail is invaluable for debugging, as it allows others to replicate the problem and verify any proposed solutions. The code snippet imports various datasets from OpenCompass configurations, including ARC-c, ARC-e, GSM8k, OBQA, PIQA, SIQA, Winogrande, RACE, and MATH. This indicates a broad evaluation setup, aiming to assess model performance across a diverse range of tasks. The model configuration specifies the use of OpenAISDK with a custom API base and path, suggesting the evaluation is being conducted using a locally hosted model, likely served via vLLM or a similar inference server. The model name, 'DS-R1-Distill-Llama-70B-ht-sym-fp8', indicates a quantized version of the Deepseek-R1-distill-Llama-70B model, which is a large language model known for its strong performance. The parameters such as query_per_second, max_out_len, max_seq_len, temperature, batch_size, and retry provide insights into the evaluation settings, balancing throughput and response quality.
The commands used to run the evaluation are equally informative. The command CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server ... indicates that the model is being served using vLLM, a high-throughput and memory-efficient inference library for LLMs. The --tensor-parallel-size 4 argument specifies that the model is being distributed across four GPUs, while --gpu-memory-utilization 0.95 suggests an attempt to maximize GPU utilization. The --max-model-len 16384 parameter is crucial, as it sets the maximum sequence length the model can handle, which can impact performance on tasks requiring long-range context. This detailed configuration provides a clear picture of the evaluation environment and setup. By replicating these steps, we can confirm the issue and begin to investigate potential causes. The fact that the user explicitly mentions the absence of any error message is also significant. This suggests that the issue isn't a straightforward crash or exception, but rather a subtle problem in the scoring logic or data processing pipeline. This makes debugging more challenging, as it requires careful examination of the intermediate results and computations. The next section will delve into the observed results and error patterns, further refining our understanding of the problem. We will analyze the specific scores obtained on the SIQA dataset, compare them with other datasets, and examine the model's output to identify any anomalies or inconsistencies. This detailed analysis will help us to formulate hypotheses about the root cause and guide our subsequent investigation.
The Anomaly: Wrong Scores and Meaningless Output
The core of the issue lies in the discrepancy between the SIQA dataset scores and those of other datasets. The user reported that the SIQA dataset consistently yielded scores around 35 points, while other datasets showed scores of zero. This stark contrast immediately raises suspicion, suggesting a problem specific to the SIQA evaluation pipeline. The fact that this issue primarily affects low-precision quantized models further narrows down the potential causes. Quantization can introduce numerical errors and artifacts, which might interact with the SIQA dataset's characteristics in unexpected ways. To further complicate matters, the user observed