Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

AuthorsAndrea Santilli†**, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong‡**, Luca Zappella, Sinead Williamson

View publication

Uncertainty Quantification (UQ) in Language Models (LMs) is key to improving their safety and reliability. Evaluations often use metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). We show that mutual biases—when both UQ methods and correctness functions are biased by the same factors—systematically distort evaluation. First, we formally prove that any mutual bias non-randomly skews AUROC rankings, compromising benchmark integrity. Second, we confirm this happens empirically by testing 7 widely used correctness functions, from lexical-based and embedding-based metrics to LM-as-a-judge approaches, across 4 datasets x 4 models x 8 UQ methods. Our analysis shows that length biases in correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LM-as-a-judge methods as the least length-biased, offering a promising path for a fairer UQ evaluation.

** Work done while at Apple
† Sapienza University of Rome
‡ National University of Singapore

Related readings and updates.

Efficient and Effective Uncertainty Quantification in LLMs

November 21, 2024research area Speech and Natural Language ProcessingWorkshop at NeurIPS

This paper was accepted at the Safe Generative AI Workshop (SGAIW) 2024 at NeurIPS 2024.

Uncertainty quantification (UQ) is crucial for ensuring the safe deployment of large language model, particularly in high-stakes applications where hallucinations can be harmful. However, existing UQ methods often demand substantial computational resources, e.g., multi-sample methods such as Semantic Entropy (Kuhn et al., 2023) usually require 5-10 inference…

Finding Local Destinations with Siri’s Regionally Specific Language Models for Speech Recognition

August 9, 2018research area Speech and Natural Language Processing

The accuracy of automatic speech recognition (ASR) systems has improved phenomenally over recent years, due to the widespread adoption of deep learning techniques. Performance improvements have, however, mainly been made in the recognition of general speech; whereas accurately recognizing named entities, like small local businesses, has remained a performance bottleneck. This article describes how we met that challenge, improving Siri’s ability to recognize names of local POIs by incorporating knowledge of the user’s location into our speech recognition system. Customized language models that take the user’s location into account are known as geolocation-based language models (Geo-LMs). These models enable Siri to better estimate the user’s intended sequence of words by using not only the information provided by the acoustic model and a general LM (like in standard ASR) but also information about the POIs in the user’s surroundings.

Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

Related readings and updates.

Efficient and Effective Uncertainty Quantification in LLMs

Finding Local Destinations with Siri’s Regionally Specific Language Models for Speech Recognition

Discover opportunities in Machine Learning.