cs.CL
1158 postsarXiv:2501.10483v1 Announce Type: new Abstract: Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
arXiv:2501.10642v1 Announce Type: new Abstract: Large Language Models (LLMs) have been widely adopted across various domains, yet their application in the medical field poses unique challenges, particularly concerning the generation of hallucinations. Hallucinations in open-ended long medical text manifest as misleading critical claims, which are difficult to verify due to two reasons. First, critical claims are often deeply entangled within the text and cannot be extracted based solely on surface-level presentation. Second, verifying these claims is challenging because surface-level token-based retrieval often lacks precise or specific evidence, leaving the claims unverifiable without deeper mechanism-based analysis. In this paper, we introduce a novel method termed Iterative Tree Analysis (ITA) for medical critics. ITA is designed to extract implicit claims from long medical texts and verify each claim through an iterative and adaptive tree-like reasoning process. This process involves a combination of top-down task decomposition and bottom-up evidence consolidation, enabling precise verification of complex medical claims through detailed mechanism-level reasoning. Our extensive experiments demonstrate that ITA significantly outperforms previous methods in detecting factual inaccuracies in complex medical text verification tasks by 10%. Additionally, we will release a comprehensive test set to the public, aiming to foster further advancements in research within this domain.
arXiv:2412.05139v3 Announce Type: replace Abstract: The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors. Machine-generated content detectors claim to effectively identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, GPTID, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered. We employ various prompting strategies to simulate adversarial attacks, demonstrating that even moderate efforts can significantly evade detection. We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0%. Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.
arXiv:2501.10388v1 Announce Type: new Abstract: The emergence of Large Language Models has fundamentally transformed the capabilities of AI agents, enabling a new class of autonomous agents capable of interacting with their environment through dynamic code generation and execution. These agents possess the theoretical capacity to operate as independent economic actors within digital markets, offering unprecedented potential for value creation through their distinct advantages in operational continuity, perfect replication, and distributed learning capabilities. However, contemporary digital infrastructure, architected primarily for human interaction, presents significant barriers to their participation. This work presents a systematic analysis of the infrastructure requirements necessary for AI agents to function as autonomous participants in digital markets. We examine four key areas - identity and authorization, service discovery, interfaces, and payment systems - to show how existing infrastructure actively impedes agent participation. We argue that addressing these infrastructure challenges represents more than a technical imperative; it constitutes a fundamental step toward enabling new forms of economic organization. Much as traditional markets enable human intelligence to coordinate complex activities beyond individual capability, markets incorporating AI agents could dramatically enhance economic efficiency through continuous operation, perfect information sharing, and rapid adaptation to changing conditions. The infrastructure challenges identified in this work represent key barriers to realizing this potential.
arXiv:2501.10573v1 Announce Type: new Abstract: We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.
arXiv:2501.10639v1 Announce Type: new Abstract: Ensuring safety alignment has become a critical requirement for large language models (LLMs), particularly given their widespread deployment in real-world applications. However, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. Although numerous defense mechanisms based on adversarial training have been proposed, a persistent challenge lies in the exacerbation of over-refusal behaviors, which compromise the overall utility of the model. To address these challenges, we propose a Latent-space Adversarial Training with Post-aware Calibration (LATPC) framework. During the adversarial training phase, LATPC compares harmful and harmless instructions in the latent space and extracts safety-critical dimensions to construct refusal features attack, precisely simulating agnostic jailbreak attack types requiring adversarial mitigation. At the inference stage, an embedding-level calibration mechanism is employed to alleviate over-refusal behaviors with minimal computational overhead. Experimental results demonstrate that, compared to various defense methods across five types of jailbreak attacks, LATPC framework achieves a superior balance between safety and utility. Moreover, our analysis underscores the effectiveness of extracting safety-critical dimensions from the latent space for constructing robust refusal feature attacks.
arXiv:2501.09512v2 Announce Type: replace Abstract: Code-switching, the alternation of languages within a single discourse, presents a significant challenge for Automatic Speech Recognition. Despite the unique nature of the task, performance is commonly measured with established metrics such as Word-Error-Rate (WER). However, in this paper, we question whether these general metrics accurately assess performance on code-switching. Specifically, using both Connectionist-Temporal-Classification and Encoder-Decoder models, we show fine-tuning on non-code-switched data from both matrix and embedded language improves classical metrics on code-switching test sets, although actual code-switched words worsen (as expected). Therefore, we propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest. We instantiate PIER on code-switched utterances and show that this more accurately describes the code-switching performance, showing huge room for improvement in future work. This focused evaluation allows for a more precise assessment of model performance, particularly in challenging aspects such as inter-word and intra-word code-switching.
arXiv:2412.04505v2 Announce Type: replace Abstract: Accurately interpreting words is vital in political science text analysis; some tasks require assuming semantic stability, while others aim to trace semantic shifts. Traditional static embeddings, like Word2Vec effectively capture long-term semantic changes but often lack stability in short-term contexts due to embedding fluctuations caused by unbalanced training data. BERT, which features transformer-based architecture and contextual embeddings, offers greater semantic consistency, making it suitable for analyses in which stability is crucial. This study compares Word2Vec and BERT using 20 years of People's Daily articles to evaluate their performance in semantic representations across different timeframes. The results indicate that BERT outperforms Word2Vec in maintaining semantic stability and still recognizes subtle semantic variations. These findings support BERT's use in text analysis tasks that require stability, where semantic changes are not assumed, offering a more reliable foundation than static alternatives.
arXiv:2501.10361v1 Announce Type: new Abstract: This paper argues that we should perceive LLMs as machines of extrapolation. Extrapolation is a statistical function for predicting the next value in a series. Extrapolation contributes to both GPT successes and controversies surrounding its hallucination. The term hallucination implies a malfunction, yet this paper contends that it in fact indicates the chatbot efficiency in extrapolation, albeit an excess of it. This article bears a historical dimension: it traces extrapolation to the nascent years of cybernetics. In 1941, when Norbert Wiener transitioned from missile science to communication engineering, the pivotal concept he adopted was none other than extrapolation. Soviet mathematician Andrey Kolmogorov, renowned for his compression logic that inspired OpenAI, had developed in 1939 another extrapolation project that Wiener later found rather like his own. This paper uncovers the connections between hot war science, Cold War cybernetics, and the contemporary debates on LLM performances.
arXiv:2501.10377v1 Announce Type: new Abstract: The development and deployment of chatbot technology, while spanning decades and employing different techniques, require innovative frameworks to understand and interrogate their functionality and implications. A mere technocentric account of the evolution of chatbot technology does not fully illuminate how conversational systems are embedded in societal dynamics. This study presents a structured examination of chatbots across three societal dimensions, highlighting their roles as objects of scientific research, commercial instruments, and agents of intimate interaction. Through furnishing a dimensional framework for the evolution of conversational systems, from laboratories to marketplaces to private lives, this article contributes to the wider scholarly inquiry of chatbot technology and its impact in lived human experiences and dynamics.
arXiv:2501.10487v1 Announce Type: new Abstract: This paper proposes a Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline designed to efficiently process table data. Tabular-TX preprocesses table data by focusing on highlighted cells and then generates summary sentences structured with a Theme Part in the form of adverbial phrases followed by an Explanation Part in the form of clauses. In this process, customized analysis is performed by considering the structural characteristics and comparability of the table. Additionally, by utilizing In-Context Learning, Tabular-TX optimizes the analytical capabilities of large language models (LLMs) without the need for fine-tuning, effectively handling the structural complexity of table data. Results from applying the proposed Tabular-TX to generate table-based summaries demonstrated superior performance compared to existing fine-tuning-based methods, despite limitations in dataset size. Experimental results confirmed that Tabular-TX can process complex table data more effectively and established it as a new alternative for table-based question answering and summarization tasks, particularly in resource-constrained environments.
arXiv:2501.10542v1 Announce Type: new Abstract: Software bugs pose a significant challenge during development and maintenance, and practitioners spend nearly 50% of their time dealing with bugs. Many existing techniques adopt Information Retrieval (IR) to localize a reported bug using textual and semantic relevance between bug reports and source code. However, they often struggle to bridge a critical gap between bug reports and code that requires in-depth contextual understanding, which goes beyond textual or semantic relevance. In this paper, we present a novel technique for bug localization - BRaIn - that addresses the contextual gaps by assessing the relevance between bug reports and code with Large Language Models (LLM). It then leverages the LLM's feedback (a.k.a., Intelligent Relevance Feedback) to reformulate queries and re-rank source documents, improving bug localization. We evaluate BRaIn using a benchmark dataset, Bench4BL, and three performance metrics and compare it against six baseline techniques from the literature. Our experimental results show that BRaIn outperforms baselines by 87.6%, 89.5%, and 48.8% margins in MAP, MRR, and HIT@K, respectively. Additionally, it can localize approximately 52% of bugs that cannot be localized by the baseline techniques due to the poor quality of corresponding bug reports. By addressing the contextual gaps and introducing Intelligent Relevance Feedback, BRaIn advances not only theory but also improves IR-based bug localization.
arXiv:2501.10582v1 Announce Type: new Abstract: Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. We fine-tune models using a large dataset of sentences we curated in which each sentence is rated according to how useful it might be for spoken or written AAC communication. We find that using an algorithm to produce character predictions from a subword large language model provides more accurate predictions than adding a classification layer or using a byte-level model. We also find that our domain adaptation curriculum is effective at improving model performance on simple, conversational text.
arXiv:2501.10604v1 Announce Type: new Abstract: The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \url{https://github.com/ai4ce/SeeUnsafe}.
arXiv:2501.04962v2 Announce Type: replace Abstract: With the growing demand for developing speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. When engaging in conversations with humans, it is essential for these models to comprehend a wide range of world knowledge. In this paper, we introduce VoxEval, a novel speech question-answering benchmark specifically designed to assess SLMs' knowledge understanding through purely speech-based interactions. Unlike existing AudioQA benchmarks, VoxEval maintains speech format for both questions and answers, evaluates model robustness across diverse audio conditions (varying timbres, audio qualities, and speaking styles), and pioneers the assessment of challenging domains like mathematical problem-solving in spoken format. Our comprehensive evaluation of recent SLMs using VoxEval reveals significant performance limitations in current models, highlighting crucial areas for future improvements. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval
arXiv:2501.08613v2 Announce Type: replace Abstract: The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text predicates, often goes unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we present a comprehensive study of sensitivity of existing metrics and their alignment with human judgement on FOL evaluation. Using ground-truth FOLs, we carefully designed various perturbations on the ground-truth to assess metric sensitivity. We sample FOL translation candidates for natural language statements and measure the ranking alignment between automatic metrics and human annotators. Our empirical findings highlight oversensitivity in the n-gram metric BLEU for text perturbations, the semantic graph metric Smatch++ for structural perturbations, and FOL metric for operator perturbation. We also observe a closer alignment between BertScore and human judgement. Additionally, we show that combining metrics enhances both alignment and sensitivity compared to using individual metrics.
arXiv:2412.12039v2 Announce Type: replace Abstract: Despite their remarkable success, large language models (LLMs) have shown limited ability on applied tasks such as vulnerability detection. We investigate various prompting strategies for vulnerability detection and, as part of this exploration, propose a prompting strategy that integrates natural language descriptions of vulnerabilities with a contrastive chain-of-thought reasoning approach, augmented using contrastive samples from a synthetic dataset. Our study highlights the potential of LLMs to detect vulnerabilities by integrating natural language descriptions, contrastive reasoning, and synthetic examples into a comprehensive prompting framework. Our results show that this approach can enhance LLM understanding of vulnerabilities. On a high-quality vulnerability detection dataset such as SVEN, our prompting strategies can improve accuracies, F1-scores, and pairwise accuracies by 23%, 11%, and 14%, respectively.
arXiv:2412.14528v2 Announce Type: replace Abstract: Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes. Codes and models are available at https://github.com/2018cx/Multi-Level-OT.
arXiv:2412.06134v2 Announce Type: replace Abstract: Current social bias benchmarks for Large Language Models (LLMs) primarily rely on pre-defined question formats like multiple-choice, limiting their ability to reflect the complexity and open-ended nature of real-world interactions. To address this gap, we extend an existing BBQ dataset introduced by incorporating fill-in-the-blank and short-answer question types, designed to evaluate biases in an open-ended setting. Our finding reveals that LLMs tend to produce responses that are more biased against certain protected attributes, like age and socio-economic status. On the other hand, these biased outputs produced by LLMs can serve as valuable contexts and chains of thought for debiasing. Our debiasing approach combined zero-shot, few-shot, and chain-of-thought could significantly reduce the level of bias to almost 0. We open-source our evaluation and debiasing code hoping to encourage further measurements and mitigation of bias and stereotype in LLMs.
arXiv:2501.10648v1 Announce Type: new Abstract: In this report, we present DNA 1.0 8B Instruct, a state-of-the-art bilingual language model optimized for Korean and English language tasks. By applying continual pre-training (CPT) with high-quality Korean datasets to Llama 3.1 8B and subsequent supervised fine-tuning (SFT), we create an instruction-following model with enhanced Korean language capabilities. This model is then merged with Llama 3.1 8B Instruct via spherical linear interpolation (SLERP) and undergoes further optimization through direct preference optimization (DPO) and knowledge distillation (KD). DNA 1.0 8B Instruct achieves state-of-the-art results on Korean-specific tasks, including KMMLU (53.26%), KoBEST (83.40%), and BELEBELE (57.99%), while maintaining strong English capabilities on MMLU (66.64%), MMLU-Pro (43.05%) and GSM8K (80.52%). As an open model, DNA 1.0 8B Instruct represents a significant advancement in bilingual language modeling. As an open model, DNA 1.0 8B Instruct is freely available through https://huggingface.co/dnotitia/Llama-DNA-1.0-8B-Instruct . For commercial licensing inquiries or feedback, please contact us at https://www.dnotitia.com/contact/post-form