stat.AP

21 posts

arXiv:2501.01276v1 Announce Type: cross Abstract: Marketing mix modeling (MMM) is a widely used method to assess the effectiveness of marketing campaigns and optimize marketing strategies. Bayesian MMM is an advanced approach that allows for the incorporation of prior information, uncertainty quantification, and probabilistic predictions (1). In this paper, we describe the process of building a Bayesian MMM model for the online insurance company Lemonade. We first collected data on Lemonade's marketing activities, such as online advertising, social media, and brand marketing, as well as performance data. We then used a Bayesian framework to estimate the contribution of each marketing channel on total performance, while accounting for various factors such as seasonality, market trends, and macroeconomic indicators. To validate the model, we compared its predictions with the actual performance data from A/B-testing and sliding window holdout data (2). The results showed that the predicted contribution of each marketing channel is aligned with A/B test performance and is actionable. Furthermore, we conducted several scenario analyses using convex optimization to test the sensitivity of the model to different assumptions and to evaluate the impact of changes in the marketing mix on sales. The insights gained from the model allowed Lemonade to adjust their marketing strategy and allocate their budget more effectively. Our case study demonstrates the benefits of using Bayesian MMM for marketing attribution and optimization in a data-driven company like Lemonade. The approach is flexible, interpretable, and can provide valuable insights for decision-making.

Roy Ravid1/3/2025

arxiv

stat.ML cs.LG stat.AP stat.ME

Post Launch Evaluation of Policies in a High-Dimensional Setting

arXiv:2501.00119v1 Announce Type: cross Abstract: A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions. However, these tests can be costly in terms of time and resources, potentially exposing users, customers, or other test subjects (units) to inferior options. This paper explores practical considerations in applying methodologies inspired by "synthetic control" as an alternative to traditional A/B testing in settings with very large numbers of units, involving up to hundreds of millions of units, which is common in modern applications such as e-commerce and ride-sharing platforms. This method is particularly valuable in settings where the treatment affects only a subset of units, leaving many units unaffected. In these scenarios, synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units. After the treatment is implemented, these estimates can be compared to actual outcomes to measure the treatment effect. A key challenge in creating accurate counterfactual outcomes is interpolation bias, a well-documented phenomenon that occurs when control units differ significantly from treated units. To address this, we propose a two-phase approach: first using nearest neighbor matching based on unit covariates to select similar control units, then applying supervised learning methods suitable for high-dimensional data to estimate counterfactual outcomes. Testing using six large-scale experiments demonstrates that this approach successfully improves estimate accuracy. However, our analysis reveals that machine learning bias -- which arises from methods that trade off bias for variance reduction -- can impact results and affect conclusions about treatment effects. We document this bias in large-scale experimental settings and propose effective de-biasing techniques to address this challenge.

Shima Nassiri, Mohsen Bayati, Joe Cooprider1/3/2025

arxiv

econ.GN cs.AI q-fin.EC stat.AP stat.ML

Adventures in Demand Analysis Using AI

arXiv:2501.00382v1 Announce Type: cross Abstract: This paper advances empirical demand analysis by integrating multimodal product representations derived from artificial intelligence (AI). Using a detailed dataset of toy cars on \textit{Amazon.com}, we combine text descriptions, images, and tabular covariates to represent each product using transformer-based embedding models. These embeddings capture nuanced attributes, such as quality, branding, and visual characteristics, that traditional methods often struggle to summarize. Moreover, we fine-tune these embeddings for causal inference tasks. We show that the resulting embeddings substantially improve the predictive accuracy of sales ranks and prices and that they lead to more credible causal estimates of price elasticity. Notably, we uncover strong heterogeneity in price elasticity driven by these product-specific features. Our findings illustrate that AI-driven representations can enrich and modernize empirical demand analysis. The insights generated may also prove valuable for applied causal inference more broadly.

Philipp Bach, Victor Chernozhukov, Sven Klaassen, Martin Spindler, Jan Teichert-Kluge, Suhas Vijaykumar1/3/2025

arxiv

stat.CO cs.LG stat.AP

An Efficient Outlier Detection Algorithm for Data Streaming

arXiv:2501.01061v1 Announce Type: cross Abstract: The nature of modern data is increasingly real-time, making outlier detection crucial in any data-related field, such as finance for fraud detection and healthcare for monitoring patient vitals. Traditional outlier detection methods, such as the Local Outlier Factor (LOF) algorithm, struggle with real-time data due to the need for extensive recalculations with each new data point, limiting their application in real-time environments. While the Incremental LOF (ILOF) algorithm has been developed to tackle the challenges of online anomaly detection, it remains computationally expensive when processing large streams of data points, and its detection performance may degrade after a certain threshold of points have streamed in. In this paper, we propose a novel approach to enhance the efficiency of LOF algorithms for online anomaly detection, named the Efficient Incremental LOF (EILOF) algorithm. The EILOF algorithm only computes the LOF scores of new points without altering the LOF scores of existing data points. Although exact LOF scores have not yet been computed for the existing points in the new algorithm, datasets often contain noise, and minor deviations in LOF score calculations do not necessarily degrade detection performance. In fact, such deviations can sometimes enhance outlier detection. We systematically tested this approach on both simulated and real-world datasets, demonstrating that EILOF outperforms ILOF as the volume of streaming data increases across various scenarios. The EILOF algorithm not only significantly reduces computational costs, but also systematically improves detection accuracy when the number of additional points increases compared to the ILOF algorithm.

Rui Hu (Zhilu), Luc (Zhilu), Chen, Yiwei Wang1/3/2025

arxiv

cs.LG cs.AI stat.AP stat.ML

Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs

arXiv:2501.00555v1 Announce Type: new Abstract: Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, they often make overconfident, incorrect predictions, which can be risky in high-stakes settings like healthcare and finance. To mitigate these risks, recent works have used conformal prediction (CP), a model-agnostic framework for distribution-free uncertainty quantification. CP transforms a \emph{score function} into prediction sets that contain the true answer with high probability. While CP provides this coverage guarantee for arbitrary scores, the score quality significantly impacts prediction set sizes. Prior works have relied on LLM logits or other heuristic scores, lacking quality guarantees. We address this limitation by introducing CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Furthermore, inspired by the Monty Hall problem, we extend CP's utility beyond uncertainty quantification to improve accuracy. We propose \emph{conformal revision of questions} (CROQ) to revise the problem by narrowing down the available choices to those in the prediction set. The coverage guarantee of CP ensures that the correct choice is in the revised question prompt with high probability, while the smaller number of choices increases the LLM's chances of answering it correctly. Experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with Gemma-2, Llama-3 and Phi-3 models show that CP-OPT significantly reduces set sizes while maintaining coverage, and CROQ improves accuracy over the standard inference, especially when paired with CP-OPT scores. Together, CP-OPT and CROQ offer a robust framework for improving both the safety and accuracy of LLM-driven decision-making.

Harit Vishwakarma, Alan Mishler, Thomas Cook, Niccol\`o Dalmasso, Natraj Raman, Sumitra Ganesh1/3/2025

arxiv

cs.CL stat.AP

Does a Large Language Model Really Speak in Human-Like Language?

arXiv:2501.01273v1 Announce Type: new Abstract: Large Language Models (LLMs) have recently emerged, attracting considerable attention due to their ability to generate highly natural, human-like text. This study compares the latent community structures of LLM-generated text and human-written text within a hypothesis testing procedure. Specifically, we analyze three text sets: original human-written texts ($\mathcal{O}$), their LLM-paraphrased versions ($\mathcal{G}$), and a twice-paraphrased set ($\mathcal{S}$) derived from $\mathcal{G}$. Our analysis addresses two key questions: (1) Is the difference in latent community structures between $\mathcal{O}$ and $\mathcal{G}$ the same as that between $\mathcal{G}$ and $\mathcal{S}$? (2) Does $\mathcal{G}$ become more similar to $\mathcal{O}$ as the LLM parameter controlling text variability is adjusted? The first question is based on the assumption that if LLM-generated text truly resembles human language, then the gap between the pair ($\mathcal{O}$, $\mathcal{G}$) should be similar to that between the pair ($\mathcal{G}$, $\mathcal{S}$), as both pairs consist of an original text and its paraphrase. The second question examines whether the degree of similarity between LLM-generated and human text varies with changes in the breadth of text generation. To address these questions, we propose a statistical hypothesis testing framework that leverages the fact that each text has corresponding parts across all datasets due to their paraphrasing relationship. This relationship enables the mapping of one dataset's relative position to another, allowing two datasets to be mapped to a third dataset. As a result, both mapped datasets can be quantified with respect to the space characterized by the third dataset, facilitating a direct comparison between them. Our results indicate that GPT-generated text remains distinct from human-authored text.

Mose Park, Yunjin Choi, Jong-June Jeon1/3/2025

arxiv

stat.ME cs.LG math.ST stat.AP stat.TH

High-Dimensional Markov-switching Ordinary Differential Processes

arXiv:2501.00087v1 Announce Type: cross Abstract: We investigate the parameter recovery of Markov-switching ordinary differential processes from discrete observations, where the differential equations are nonlinear additive models. This framework has been widely applied in biological systems, control systems, and other domains; however, limited research has been conducted on reconstructing the generating processes from observations. In contrast, many physical systems, such as human brains, cannot be directly experimented upon and rely on observations to infer the underlying systems. To address this gap, this manuscript presents a comprehensive study of the model, encompassing algorithm design, optimization guarantees, and quantification of statistical errors. Specifically, we develop a two-stage algorithm that first recovers the continuous sample path from discrete samples and then estimates the parameters of the processes. We provide novel theoretical insights into the statistical error and linear convergence guarantee when the processes are $\beta$-mixing. Our analysis is based on the truncation of the latent posterior processes and demonstrates that the truncated processes approximate the true processes under mixing conditions. We apply this model to investigate the differences in resting-state brain networks between the ADHD group and normal controls, revealing differences in the transition rate matrices of the two groups.

Katherine Tsai, Mladen Kolar, Sanmi Koyejo1/3/2025

arXiv

eess.SP cs.LG stat.AP stat.CO stat.ML

Compact Neural Network Algorithm for Electrocardiogram Classification

arXiv:2412.17852v1 Announce Type: cross Abstract: In this paper, we present a high-performance, compact electrocardiogram (ECG)-based system for automatic classification of arrhythmias, integrating machine learning approaches to achieve robust cardiac diagnostics. Our method combines a compact artificial neural network with feature enhancement techniques, including mathematical transformations, signal analysis and data extraction algorithms, to capture both morphological and time-frequency features from ECG signals. A novel aspect of this work is the addition of 17 newly engineered features, which complement the algorithm's capability to extract significant data and physiological patterns from the ECG signal. This combination enables the classifier to detect multiple arrhythmia types, such as atrial fibrillation, sinus tachycardia, ventricular flutter, and other common arrhythmic disorders. The system achieves an accuracy of 97.36% on the MIT-BIH arrhythmia database, using a lower complexity compared to state-of-the-art models. This compact tool shows potential for clinical deployment, as well as adaptation for portable devices in long-term cardiac health monitoring applications.

Mateo Frausto-Avila, Jose Pablo Manriquez-Amavizca, Alfred U'Ren, Mario A. Quiroz-Juarez12/25/2024

arXiv

cs.LG stat.AP stat.ME

An Empirical Study: Extensive Deep Temporal Point Process

arXiv:2110.09823v5 Announce Type: replace Abstract: Temporal point process as the stochastic process on continuous domain of time is commonly used to model the asynchronous event sequence featuring with occurrence timestamps. Thanks to the strong expressivity of deep neural networks, they are emerging as a promising choice for capturing the patterns in asynchronous sequences, in the context of temporal point process. In this paper, we first review recent research emphasis and difficulties in modeling asynchronous event sequences with deep temporal point process, which can be concluded into four fields: encoding of history sequence, formulation of conditional intensity function, relational discovery of events and learning approaches for optimization. We introduce most of recently proposed models by dismantling them into the four parts, and conduct experiments by remodularizing the first three parts with the same learning strategy for a fair empirical evaluation. Besides, we extend the history encoders and conditional intensity function family, and propose a Granger causality discovery framework for exploiting the relations among multi-types of events. Because the Granger causality can be represented by the Granger causality graph, discrete graph structure learning in the framework of Variational Inference is employed to reveal latent structures of the graph. Further experiments show that the proposed framework with latent graph discovery can both capture the relations and achieve an improved fitting and predicting performance.

Haitao Lin, Cheng Tan, Lirong Wu, Zhangyang Gao, Zicheng Liu, Stan. Z. Li12/25/2024

arXiv

cs.SI cs.SY eess.SY stat.AP

Network Models of Expertise in the Complex Task of Operating Particle Accelerators

arXiv:2412.17988v1 Announce Type: new Abstract: We implement a network-based approach to study expertise in a complex real-world task: operating particle accelerators. Most real-world tasks we learn and perform (e.g., driving cars, operating complex machines, solving mathematical problems) are difficult to learn because they are complex, and the best strategies are difficult to find from many possibilities. However, how we learn such complex tasks remains a partially solved mystery, as we cannot explain how the strategies evolve with practice due to the difficulties of collecting and modeling complex behavioral data. As complex tasks are generally networks of many elementary subtasks, we model task performance as networks or graphs of subtasks and investigate how the networks change with expertise. We develop the networks by processing the text in a large archive of operator logs from 14 years of operations using natural language processing and machine learning. The network changes are examined using a set of measures at four levels of granularity - individual subtasks, interconnections among subtasks, groups of subtasks, and the whole complex task. We find that the operators consistently change with expertise at the subtask, the interconnection, and the whole-task levels, but they show remarkable similarity in how subtasks are grouped. These results indicate that the operators of all stages of expertise adopt a common divide-and-conquer approach by breaking the complex task into parts of manageable complexity, but they differ in the frequency and structure of nested subtasks. Operational logs are common data sources from real-world settings where people collaborate with hardware and software environments to execute complex tasks, and the network models investigated in this study can be expanded to accommodate multi-modal data. Therefore, our network-based approach provides a practical way to investigate expertise in the real world.

Roussel Rahman, Jane Shtalenkova, Aashwin Ananda Mishra, Wan-Lin Hu12/25/2024

arXiv

cs.LG stat.AP

Improving probabilistic forecasts of extreme wind speeds by training statistical post-processing models with weighted scoring rules

arXiv:2407.15900v3 Announce Type: replace Abstract: Accurate forecasts of extreme wind speeds are of high importance for many applications. Such forecasts are usually generated by ensembles of numerical weather prediction (NWP) models, which however can be biased and have errors in dispersion, thus necessitating the application of statistical post-processing techniques. In this work we aim to improve statistical post-processing models for probabilistic predictions of extreme wind speeds. We do this by adjusting the training procedure used to fit ensemble model output statistics (EMOS) models - a commonly applied post-processing technique - and propose estimating parameters using the so-called threshold-weighted continuous ranked probability score (twCRPS), a proper scoring rule that places special emphasis on predictions over a threshold. We show that training using the twCRPS leads to improved extreme event performance of post-processing models for a variety of thresholds. We find a distribution body-tail trade-off where improved performance for probabilistic predictions of extreme events comes with worse performance for predictions of the distribution body. However, we introduce strategies to mitigate this trade-off based on weighted training and linear pooling. Finally, we consider some synthetic experiments to explain the training impact of the twCRPS and derive closed-form expressions of the twCRPS for a number of distributions, giving the first such collection in the literature. The results will enable researchers and practitioners alike to improve the performance of probabilistic forecasting models for extremes and other events of interest.

Jakob Benjamin Wessel, Christopher A. T. Ferro, Gavin R. Evans, Frank Kwasniok12/24/2024

arXiv

cs.LG cs.AI q-fin.ST stat.AP

Optimizing Fintech Marketing: A Comparative Study of Logistic Regression and XGBoost

arXiv:2412.16333v1 Announce Type: new Abstract: As several studies have shown, predicting credit risk is still a major concern for the financial services industry and is receiving a lot of scholarly interest. This area of study is crucial because it aids financial organizations in determining the probability that borrowers would default, which has a direct bearing on lending choices and risk management tactics. Despite the progress made in this domain, there is still a substantial knowledge gap concerning consumer actions that take place prior to the filing of credit card applications. The objective of this study is to predict customer responses to mail campaigns and assess the likelihood of default among those who engage. This research employs advanced machine learning techniques, specifically logistic regression and XGBoost, to analyze consumer behavior and predict responses to direct mail campaigns. By integrating different data preprocessing strategies, including imputation and binning, we enhance the robustness and accuracy of our predictive models. The results indicate that XGBoost consistently outperforms logistic regression across various metrics, particularly in scenarios using categorical binning and custom imputation. These findings suggest that XGBoost is particularly effective in handling complex data structures and provides a strong predictive capability in assessing credit risk.

Sahar Yarmohammadtoosky Dinesh Chowdary Attota12/24/2024

arXiv

cs.LG cs.AI cs.CY stat.AP stat.ML

Learning Disease Progression Models That Capture Health Disparities

arXiv:2412.16406v1 Announce Type: new Abstract: Disease progression models are widely used to inform the diagnosis and treatment of many progressive diseases. However, a significant limitation of existing models is that they do not account for health disparities that can bias the observed data. To address this, we develop an interpretable Bayesian disease progression model that captures three key health disparities: certain patient populations may (1) start receiving care only when their disease is more severe, (2) experience faster disease progression even while receiving care, or (3) receive follow-up care less frequently conditional on disease severity. We show theoretically and empirically that failing to account for disparities produces biased estimates of severity (underestimating severity for disadvantaged groups, for example). On a dataset of heart failure patients, we show that our model can identify groups that face each type of health disparity, and that accounting for these disparities meaningfully shifts which patients are considered high-risk.

Erica Chiang, Divya Shanmugam, Ashley N. Beecy, Gabriel Sayer, Nir Uriel, Deborah Estrin, Nikhil Garg, Emma Pierson12/24/2024

arXiv

stat.ML cs.LG econ.EM math.ST stat.AP stat.TH

Minimax Optimal Simple Regret in Two-Armed Best-Arm Identification

arXiv:2412.17753v1 Announce Type: cross Abstract: This study investigates an asymptotically minimax optimal algorithm in the two-armed fixed-budget best-arm identification (BAI) problem. Given two treatment arms, the objective is to identify the arm with the highest expected outcome through an adaptive experiment. We focus on the Neyman allocation, where treatment arms are allocated following the ratio of their outcome standard deviations. Our primary contribution is to prove the minimax optimality of the Neyman allocation for the simple regret, defined as the difference between the expected outcomes of the true best arm and the estimated best arm. Specifically, we first derive a minimax lower bound for the expected simple regret, which characterizes the worst-case performance achievable under the location-shift distributions, including Gaussian distributions. We then show that the simple regret of the Neyman allocation asymptotically matches this lower bound, including the constant term, not just the rate in terms of the sample size, under the worst-case distribution. Notably, our optimality result holds without imposing locality restrictions on the distribution, such as the local asymptotic normality. Furthermore, we demonstrate that the Neyman allocation reduces to the uniform allocation, i.e., the standard randomized controlled trial, under Bernoulli distributions.

Masahiro Kato12/24/2024

arXiv

cs.CR stat.AP

Nested Dirichlet models for unsupervised attack pattern detection in honeypot data

arXiv:2301.02505v4 Announce Type: replace Abstract: Cyber-systems are under near-constant threat from intrusion attempts. Attacks types vary, but each attempt typically has a specific underlying intent, and the perpetrators are typically groups of individuals with similar objectives. Clustering attacks appearing to share a common intent is very valuable to threat-hunting experts. This article explores Dirichlet distribution topic models for clustering terminal session commands collected from honeypots, which are special network hosts designed to entice malicious attackers. The main practical implications of clustering the sessions are two-fold: finding similar groups of attacks, and identifying outliers. A range of statistical models are considered, adapted to the structures of command-line syntax. In particular, concepts of primary and secondary topics, and then session-level and command-level topics, are introduced into the models to improve interpretability. The proposed methods are further extended in a Bayesian nonparametric fashion to allow unboundedness in the vocabulary size and the number of latent intents. The methods are shown to discover an unusual MIRAI variant which attempts to take over existing cryptocurrency coin-mining infrastructure, not detected by traditional topic-modelling approaches.

Francesco Sanna Passino, Anastasia Mantziou, Daniyar Ghani, Philip Thiede, Ross Bevington, Nicholas A. Heard12/24/2024

arXiv

cs.LG stat.AP

Optimizing Heat Alert Issuance with Reinforcement Learning

arXiv:2312.14196v4 Announce Type: replace Abstract: A key strategy in societal adaptation to climate change is using alert systems to prompt preventative action and reduce the adverse health impacts of extreme heat events. This paper implements and evaluates reinforcement learning (RL) as a tool to optimize the effectiveness of such systems. Our contributions are threefold. First, we introduce a new publicly available RL environment enabling the evaluation of the effectiveness of heat alert policies to reduce heat-related hospitalizations. The rewards model is trained from a comprehensive dataset of historical weather, Medicare health records, and socioeconomic/geographic features. We use scalable Bayesian techniques tailored to the low-signal effects and spatial heterogeneity present in the data. The transition model uses real historical weather patterns enriched by a data augmentation mechanism based on climate region similarity. Second, we use this environment to evaluate standard RL algorithms in the context of heat alert issuance. Our analysis shows that policy constraints are needed to improve RL's initially poor performance. Third, a post-hoc contrastive analysis provides insight into scenarios where our modified heat alert-RL policies yield significant gains/losses over the current National Weather Service alert policy in the United States.

Ellen M. Considine, Rachel C. Nethery, Gregory A. Wellenius, Francesca Dominici, Mauricio Tec12/23/2024

arXiv

cs.CV eess.SP stat.AP

Camera-Based Localization and Enhanced Normalized Mutual Information

arXiv:2412.16137v1 Announce Type: new Abstract: Robust and fine localization algorithms are crucial for autonomous driving. For the production of such vehicles as a commodity, affordable sensing solutions and reliable localization algorithms must be designed. This work considers scenarios where the sensor data comes from images captured by an inexpensive camera mounted on the vehicle and where the vehicle contains a fine global map. Such localization algorithms typically involve finding the section in the global map that best matches the captured image. In harsh environments, both the global map and the captured image can be noisy. Because of physical constraints on camera placement, the image captured by the camera can be viewed as a noisy perspective transformed version of the road in the global map. Thus, an optimal algorithm should take into account the unequal noise power in various regions of the captured image, and the intrinsic uncertainty in the global map due to environmental variations. This article briefly reviews two matching methods: (i) standard inner product (SIP) and (ii) normalized mutual information (NMI). It then proposes novel and principled modifications to improve the performance of these algorithms significantly in noisy environments. These enhancements are inspired by the physical constraints associated with autonomous vehicles. They are grounded in statistical signal processing and, in some context, are provably better. Numerical simulations demonstrate the effectiveness of such modifications.

Vishnu Teja Kunde, Jean-Francois Chamberland, Siddharth Agarwal12/23/2024

arXiv

cs.CY cs.AI stat.AP

Learning-by-teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education

arXiv:2412.15226v1 Announce Type: new Abstract: This study investigates the potential of using ChatGPT as a teachable agent to support students' learning by teaching process, specifically in programming education. While learning by teaching is an effective pedagogical strategy for promoting active learning, traditional teachable agents have limitations, particularly in facilitating natural language dialogue. Our research explored whether ChatGPT, with its ability to engage learners in natural conversations, can support this process. The findings reveal that interacting with ChatGPT improves students' knowledge gains and programming abilities, particularly in writing readable and logically sound code. However, it had limited impact on developing learners' error-correction skills, likely because ChatGPT tends to generate correct code, reducing opportunities for students to practice debugging. Additionally, students' self-regulated learning (SRL) abilities improved, suggesting that teaching ChatGPT fosters learners' higher self-efficacy and better implementation of SRL strategies. This study discussed the role of natural dialogue in fostering socialized learning by teaching, and explored ChatGPT's specific contributions in supporting students' SRL through the learning by teaching process. Overall, the study highlights ChatGPT's potential as a teachable agent, offering insights for future research on ChatGPT-supported education.

Angxuan Chen, Yuang Wei, Huixiao Le, Yan Zhang12/23/2024

arXiv

cs.AI cs.CY cs.MA econ.GN q-fin.EC stat.AP

Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations

arXiv:2412.15433v1 Announce Type: new Abstract: We present a quantitative model for tracking dangerous AI capabilities over time. Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks. We first use the model to provide a novel introduction to dangerous capability testing and how this testing can directly inform policy. Decision makers in AI labs and government often set policy that is sensitive to the estimated danger of AI systems, and may wish to set policies that condition on the crossing of a set threshold for danger. The model helps us to reason about these policy choices. We then run simulations to illustrate how we might fail to test for dangerous capabilities. To summarise, failures in dangerous capability testing may manifest in two ways: higher bias in our estimates of AI danger, or larger lags in threshold monitoring. We highlight two drivers of these failure modes: uncertainty around dynamics in AI capabilities and competition between frontier AI labs. Effective AI policy demands that we address these failure modes and their drivers. Even if the optimal targeting of resources is challenging, we show how delays in testing can harm AI policy. We offer preliminary recommendations for building an effective testing ecosystem for dangerous capabilities and advise on a research agenda.

Paolo Bova, Alessandro Di Stefano, The Anh Han12/23/2024

arXiv

cs.HC cs.AI cs.MA cs.SE stat.AP

A Survey on Large Language Model-based Agents for Statistics and Data Science

arXiv:2412.14222v1 Announce Type: cross Abstract: In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.

Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, Jian Huang12/23/2024