cs.AI
1119 postsarXiv:2501.00062v1 Announce Type: new Abstract: Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.74 macro F1 vs. 79.29 ELECTRA Base FT, 79.52 GPT-4o-mini) and yielded the lowest cost/performance ratio (\$0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.77) at much less cost (\$0.38 vs. \$1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.
arXiv:2501.00072v1 Announce Type: new Abstract: Neural algorithmic reasoning is an emerging area of machine learning that focuses on building neural networks capable of solving complex algorithmic tasks. Recent advancements predominantly follow the standard supervised learning paradigm -- feeding an individual problem instance into the network each time and training it to approximate the execution steps of a classical algorithm. We challenge this mode and propose a novel open-book learning framework. In this framework, whether during training or testing, the network can access and utilize all instances in the training dataset when reasoning for a given instance. Empirical evaluation is conducted on the challenging CLRS Algorithmic Reasoning Benchmark, which consists of 30 diverse algorithmic tasks. Our open-book learning framework exhibits a significant enhancement in neural reasoning capabilities. Further, we notice that there is recent literature suggesting that multi-task training on CLRS can improve the reasoning accuracy of certain tasks, implying intrinsic connections between different algorithmic tasks. We delve into this direction via the open-book framework. When the network reasons for a specific task, we enable it to aggregate information from training instances of other tasks in an attention-based manner. We show that this open-book attention mechanism offers insights into the inherent relationships among various tasks in the benchmark and provides a robust tool for interpretable multi-task training.
arXiv:2501.00055v1 Announce Type: new Abstract: While safety-aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
arXiv:2501.00061v1 Announce Type: new Abstract: Model merging has attracted significant attention as a powerful paradigm for model reuse, facilitating the integration of task-specific models into a singular, versatile framework endowed with multifarious capabilities. Previous studies, predominantly utilizing methods such as Weight Average (WA), have shown that model merging can effectively leverage pretrained models without the need for laborious retraining. However, the inherent heterogeneity among models poses a substantial constraint on its applicability, particularly when confronted with discrepancies in model architectures. To overcome this challenge, we propose an innovative model merging framework designed for heterogeneous models, encompassing both depth and width heterogeneity. To address depth heterogeneity, we introduce a layer alignment strategy that harmonizes model layers by segmenting deeper models, treating consecutive layers with similar representations as a cohesive segment, thus enabling the seamless merging of models with differing layer depths. For width heterogeneity, we propose a novel elastic neuron zipping algorithm that projects the weights from models of varying widths onto a common dimensional space, eliminating the need for identical widths. Extensive experiments validate the efficacy of these proposed methods, demonstrating that the merging of structurally heterogeneous models can achieve performance levels comparable to those of homogeneous merging, across both vision and NLP tasks. Our code is publicly available at https://github.com/zju-vipa/training_free_heterogeneous_model_merging.
arXiv:2501.00066v1 Announce Type: new Abstract: We investigate the adversarial robustness of LLMs in transfer learning scenarios. Through comprehensive experiments on multiple datasets (MBIB Hate Speech, MBIB Political Bias, MBIB Gender Bias) and various model architectures (BERT, RoBERTa, GPT-2, Gemma, Phi), we reveal that transfer learning, while improving standard performance metrics, often leads to increased vulnerability to adversarial attacks. Our findings demonstrate that larger models exhibit greater resilience to this phenomenon, suggesting a complex interplay between model size, architecture, and adaptation methods. Our work highlights the crucial need for considering adversarial robustness in transfer learning scenarios and provides insights into maintaining model security without compromising performance. These findings have significant implications for the development and deployment of LLMs in real-world applications where both performance and robustness are paramount.
arXiv:2501.00070v1 Announce Type: new Abstract: Recent work has demonstrated that semantics specified by pretraining data influence how representations of different concepts are organized in a large language model (LLM). However, given the open-ended nature of LLMs, e.g., their ability to in-context learn, we can ask whether models alter these pretraining semantics to adopt alternative, context-specified ones. Specifically, if we provide in-context exemplars wherein a concept plays a different role than what the pretraining data suggests, do models reorganize their representations in accordance with these novel semantics? To answer this question, we take inspiration from the theory of conceptual role semantics and define a toy "graph tracing" task wherein the nodes of the graph are referenced via concepts seen during training (e.g., apple, bird, etc.) and the connectivity of the graph is defined via some predefined structure (e.g., a square grid). Given exemplars that indicate traces of random walks on the graph, we analyze intermediate representations of the model and find that as the amount of context is scaled, there is a sudden re-organization from pretrained semantic representations to in-context representations aligned with the graph structure. Further, we find that when reference concepts have correlations in their semantics (e.g., Monday, Tuesday, etc.), the context-specified graph structure is still present in the representations, but is unable to dominate the pretrained structure. To explain these results, we analogize our task to energy minimization for a predefined graph topology, providing evidence towards an implicit optimization process to infer context-specified semantics. Overall, our findings indicate scaling context-size can flexibly re-organize model representations, possibly unlocking novel capabilities.
arXiv:2501.00042v1 Announce Type: new Abstract: This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model. Recently, new architectures of transformers were presented, focused on parameter efficiency and computational optimization; however, such models usually require considerable resources in terms of hardware when deployed in real-world applications on edge devices. This approach addresses this concern by halving embedding size and applying targeted techniques such as parameter pruning and quantization to optimize the memory footprint with minimum sacrifices in terms of accuracy. Experimental results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models. This work compared our model with existing compelling architectures, such as MobileBERT and DistilBERT, and proved its feasibility in the domain of resource-friendly deep learning architectures, mainly for applications in real-time and in resource-constrained applications.
arXiv:2501.00054v1 Announce Type: new Abstract: Security concerns surrounding text-to-image diffusion models have driven researchers to unlearn inappropriate concepts through fine-tuning. Recent fine-tuning methods typically align the prediction distributions of unsafe prompts with those of predefined text anchors. However, these techniques exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts. In this paper, we systematically analyze the impact of diverse text anchors on unlearning performance. Guided by this analysis, we propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue. These adversarial anchors are crafted to closely resemble the embeddings of undesirable concepts to maintain overall model performance, while selectively excluding defining attributes of these concepts for effective erasure. Extensive experiments demonstrate that AdvAnchor outperforms state-of-the-art methods. Our code is publicly available at https://anonymous.4open.science/r/AdvAnchor.
arXiv:2501.00057v1 Announce Type: new Abstract: Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet -- a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet's superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning.
arXiv:2501.00059v1 Announce Type: new Abstract: Mathematical problem-solving is a key field in artificial intelligence (AI) and a critical benchmark for evaluating the capabilities of large language models (LLMs). While extensive research has focused on mathematical problem-solving, most existing work and datasets concentrate on computational tasks, leaving gaps in areas like mathematical analysis, which demands rigorous proofs and formal reasoning. We developed the DEMI-MathAnalysis dataset, comprising proof-based problems from mathematical analysis topics such as Sequences and Limits, Infinite Series, and Convex Functions. We also designed a guiding framework to rigorously enhance LLMs' ability to solve these problems. Through fine-tuning LLMs on this dataset and employing our framework, we observed significant improvements in their capability to generate logical, complete, and elegant proofs. This work addresses critical gaps in mathematical reasoning and contributes to advancing trustworthy AI capable of handling formalized mathematical language. The code is publicly accessible at LLMs for Mathematical Analysis.
arXiv:2501.00063v1 Announce Type: new Abstract: The financial industry is increasingly seeking robust methods to address the challenges posed by data scarcity and low signal-to-noise ratios, which limit the application of deep learning techniques in stock market analysis. This paper presents two innovative generative model-based approaches to synthesize stock data, specifically tailored for different scenarios within the A-share market in China. The first method, a sector-based synthesis approach, enhances the signal-to-noise ratio of stock data by classifying the characteristics of stocks from various sectors in China's A-share market. This method employs an Approximate Non-Local Total Variation algorithm to smooth the generated data, a bandpass filtering method based on Fourier Transform to eliminate noise, and Denoising Diffusion Implicit Models to accelerate sampling speed. The second method, a recursive stock data synthesis approach based on pattern recognition, is designed to synthesize data for stocks with short listing periods and limited comparable companies. It leverages pattern recognition techniques and Markov models to learn and generate variable-length stock sequences, while introducing a sub-time-level data augmentation method to alleviate data scarcity issues.We validate the effectiveness of these methods through extensive experiments on various datasets, including those from the main board, STAR Market, Growth Enterprise Market Board, Beijing Stock Exchange, NASDAQ, NYSE, and AMEX. The results demonstrate that our synthesized data not only improve the performance of predictive models but also enhance the signal-to-noise ratio of individual stock signals in price trading strategies. Furthermore, the introduction of sub-time-level data significantly improves the quality of synthesized data.
arXiv:2501.00065v1 Announce Type: new Abstract: Objective: Predicting children's future levels of externalizing problems helps to identify children at risk and guide targeted prevention. Existing studies have shown that mothers providing support in response to children's dysregulation was associated with children's lower levels of externalizing problems. The current study aims to evaluate and improve the accuracy of predicting children's externalizing problems with mother-child interaction dynamics. Method: This study used mother-child interaction dynamics during a challenging puzzle task to predict children's externalizing problems six months later (N=101, 46 boys, Mage=57.41 months, SD=6.58). Performance of the Residual Dynamic Structural Equation Model (RDSEM) was compared with the Attention-based Sequential Behavior Interaction Modeling (ASBIM) model, developed using the deep learning techniques. Results: The RDSEM revealed that children whose mothers provided more autonomy support after increases of child defeat had lower levels of externalizing problems. Five-fold cross-validation showed that the RDSEM had good prediction accuracy. The ASBIM model further improved prediction accuracy, especially after including child inhibitory control as a personalized individual feature. Conclusions: The dynamic process of mother-child interaction provides important information for predicting children's externalizing problems, especially maternal autonomy supportive response to child defeat. The deep learning model is a useful tool to further improve prediction accuracy.
arXiv:2501.00067v1 Announce Type: new Abstract: The article describes an attempt to apply an ensemble of binary classifiers to solve the problem of speech assessment in medicine. A dataset was compiled based on quantitative and expert assessments of syllable pronunciation quality. Quantitative assessments of 7 selected metrics were used as features: dynamic time warp distance, Minkowski distance, correlation coefficient, longest common subsequence (LCSS), edit distance of real se-quence (EDR), edit distance with real penalty (ERP), and merge split (MSM). Expert as-sessment of pronunciation quality was used as a class label: class 1 means high-quality speech, class 0 means distorted. A comparison of training results was carried out for five classification methods: logistic regression (LR), support vector machine (SVM), naive Bayes (NB), decision trees (DT), and K-nearest neighbors (KNN). The results of using the mixture method to build an ensemble of classifiers are also presented. The use of an en-semble for the studied data sets allowed us to slightly increase the classification accuracy compared to the use of individual binary classifiers.
arXiv:2501.00069v1 Announce Type: new Abstract: Generative language models are increasingly used for contract drafting and enhancement, creating a scenario where competing parties deploy different language models against each other. This introduces not only a game-theory challenge but also significant concerns related to AI safety and security, as the language model employed by the opposing party can be unknown. These competitive interactions can be seen as adversarial testing grounds, where models are effectively red-teamed to expose vulnerabilities such as generating biased, harmful or legally problematic text. Despite the importance of these challenges, the competitive robustness and safety of these models in adversarial settings remain poorly understood. In this small study, we approach this problem by evaluating the performance and vulnerabilities of major open-source language models in head-to-head competitions, simulating real-world contract negotiations. We further explore how these adversarial interactions can reveal potential risks, informing the development of more secure and reliable models. Our findings contribute to the growing body of research on AI safety, offering insights into model selection and optimisation in competitive legal contexts and providing actionable strategies for mitigating risks.
arXiv:2501.00004v1 Announce Type: new Abstract: Information prioritization plays an important role in how humans perceive and understand the world. Homepage layouts serve as a tangible proxy for this prioritization. In this work, we present NewsHomepages, a large dataset of over 3,000 new website homepages (including local, national and topic-specific outlets) captured twice daily over a three-year period. We develop models to perform pairwise comparisons between news items to infer their relative significance. To illustrate that modeling organizational hierarchies has broader implications, we applied our models to rank-order a collection of local city council policies passed over a ten-year period in San Francisco, assessing their "newsworthiness". Our findings lay the groundwork for leveraging implicit organizational cues to deepen our understanding of information prioritization.
arXiv:2501.00032v1 Announce Type: new Abstract: Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to LLaMA.cpp-based solution. The optimized kernels are available at https://github.com/ggerganov/llama.cpp.
arXiv:2501.00048v1 Announce Type: new Abstract: Every year in the United States, 800,000 individuals suffer a stroke - one person every 40 seconds, with a death occurring every four minutes. While individual factors vary, certain predictors are more prevalent in determining stroke risk. As strokes are the second leading cause of death and disability worldwide, predicting stroke likelihood based on lifestyle factors is crucial. Showing individuals their stroke risk could motivate lifestyle changes, and machine learning offers solutions to this prediction challenge. Neural networks excel at predicting outcomes based on training features like lifestyle factors, however, they're not the only option. Logistic regression models can also effectively compute the likelihood of binary outcomes based on independent variables, making them well-suited for stroke prediction. This analysis will compare both neural networks (dense and convolutional) and logistic regression models for stroke prediction, examining their pros, cons, and differences to develop the most effective predictor that minimizes false negatives.
arXiv:2501.00051v1 Announce Type: new Abstract: Digital twin (DT) technology has emerged as a transformative approach to simulate, predict, and optimize the behavior of physical systems, with applications that span manufacturing, healthcare, climate science, and more. However, the development of DT models often faces challenges such as high data requirements, integration complexity, and limited adaptability to dynamic changes in physical systems. This paper presents a new method inspired by dynamic data-driven applications systems (DDDAS), called the dynamic data-driven generative of digital twins framework (DDD-GenDT), which combines the physical system with LLM, allowing LLM to act as DT to interact with the physical system operating status and generate the corresponding physical behaviors. We apply DDD-GenDT to the computer numerical control (CNC) machining process, and we use the spindle current measurement data in the NASA milling wear data set as an example to enable LLMs to forecast the physical behavior from historical data and interact with current observations. Experimental results show that in the zero-shot prediction setting, the LLM-based DT can adapt to the change in the system, and the average RMSE of the GPT-4 prediction is 0.479A, which is 4.79% of the maximum spindle motor current measurement of 10A, with little training data and instructions required. Furthermore, we analyze the performance of DDD-GenDT in this specific application and their potential to construct digital twins. We also discuss the limitations and challenges that may arise in practical implementations.
arXiv:2501.00056v1 Announce Type: new Abstract: Air pollution in cities, especially NO\textsubscript{2}, is linked to numerous health problems, ranging from mortality to mental health challenges and attention deficits in children. While cities globally have initiated policies to curtail emissions, real-time monitoring remains challenging due to limited environmental sensors and their inconsistent distribution. This gap hinders the creation of adaptive urban policies that respond to the sequence of events and daily activities affecting pollution in cities. Here, we demonstrate how city CCTV cameras can act as a pseudo-NO\textsubscript{2} sensors. Using a predictive graph deep model, we utilised traffic flow from London's cameras in addition to environmental and spatial factors, generating NO\textsubscript{2} predictions from over 133 million frames. Our analysis of London's mobility patterns unveiled critical spatiotemporal connections, showing how specific traffic patterns affect NO\textsubscript{2} levels, sometimes with temporal lags of up to 6 hours. For instance, if trucks only drive at night, their effects on NO\textsubscript{2} levels are most likely to be seen in the morning when people commute. These findings cast doubt on the efficacy of some of the urban policies currently being implemented to reduce pollution. By leveraging existing camera infrastructure and our introduced methods, city planners and policymakers could cost-effectively monitor and mitigate the impact of NO\textsubscript{2} and other pollutants.
arXiv:2501.00076v1 Announce Type: new Abstract: The ability to generate and recognize sequential data is fundamental for autonomous systems operating in dynamic environments. Inspired by the key principles of the brain-predictive coding and the Bayesian brain-we propose a novel stochastic Recurrent Neural Network with Parametric Biases (RNNPB). The proposed model incorporates stochasticity into the latent space using the reparameterization trick used in variational autoencoders. This approach enables the model to learn probabilistic representations of multidimensional sequences, capturing uncertainty and enhancing robustness against overfitting. We tested the proposed model on a robotic motion dataset to assess its performance in generating and recognizing temporal patterns. The experimental results showed that the stochastic RNNPB model outperformed its deterministic counterpart in generating and recognizing motion sequences. The results highlighted the proposed model's capability to quantify and adjust uncertainty during both learning and inference. The stochasticity resulted in a continuous latent space representation, facilitating stable motion generation and enhanced generalization when recognizing novel sequences. Our approach provides a biologically inspired framework for modeling temporal patterns and advances the development of robust and adaptable systems in artificial intelligence and robotics.