stat.ME

61 posts

arXiv:2501.10625v1 Announce Type: new Abstract: The Markov property serves as a foundational assumption in most existing work on vehicle driving behavior, positing that future states depend solely on the current state, not the series of preceding states. This study validates the Markov properties of vehicle trajectories for both Autonomous Vehicles (AVs) and Human-driven Vehicles (HVs). A statistical method used to test whether time series data exhibits Markov properties is applied to examine whether the trajectory data possesses Markov characteristics. t test and F test are additionally introduced to characterize the differences in Markov properties between AVs and HVs. Based on two public trajectory datasets, we investigate the presence and order of the Markov property of different types of vehicles through rigorous statistical tests. Our findings reveal that AV trajectories generally exhibit stronger Markov properties compared to HV trajectories, with a higher percentage conforming to the Markov property and lower Markov orders. In contrast, HV trajectories display greater variability and heterogeneity in decision-making processes, reflecting the complex perception and information processing involved in human driving. These results have significant implications for the development of driving behavior models, AV controllers, and traffic simulation systems. Our study also demonstrates the feasibility of using statistical methods to test the presence of Markov properties in driving trajectory data.

Zheng Li, Haoming Meng, Chengyuan Ma, Ke Ma, Xiaopeng Li1/22/2025

arXiv:2501.10440v1 Announce Type: cross Abstract: This study investigates the performance of median-of-means sampling compared to traditional mean-of-means sampling for computing the Keister function integral using Randomized Quasi-Monte Carlo (RQMC) methods. The research tests both lattice points and digital nets as point distributions across dimensions 2, 3, 5, and 8, with sample sizes ranging from 2^8 to 2^19 points. Results demonstrate that median-of-means sampling consistently outperforms mean-of-means for sample sizes larger than 10^3 points, while mean-of-means shows better accuracy with smaller sample sizes, particularly for digital nets. The study also confirms previous theoretical predictions about median-of-means' superior performance with larger sample sizes and reflects the known challenges of maintaining accuracy in higher-dimensional integration. These findings support recent research suggesting median-of-means as a promising alternative to traditional sampling methods in numerical integration, though limitations in sample size and dimensionality warrant further investigation with different test functions and larger parameter spaces.

Bocheng Zhang1/22/2025

arXiv:2411.19223v4 Announce Type: replace Abstract: We propose a rigorous decomposition of predictive error, highlighting that not all 'irreducible' error is genuinely immutable. Many domains stand to benefit from iterative enhancements in measurement, construct validity, and modeling. Our approach demonstrates how apparently 'unpredictable' outcomes can become more tractable with improved data (across both target and features) and refined algorithms. By distinguishing aleatoric from epistemic error, we delineate how accuracy may asymptotically improve--though inherent stochasticity may remain--and offer a robust framework for advancing computational research.

Jiani Yan, Charles Rahal1/22/2025

arXiv:2501.11384v1 Announce Type: new Abstract: We introduce a method based on Conformal Prediction (CP) to quantify the uncertainty of full ranking algorithms. We focus on a specific scenario where $n + m$ items are to be ranked by some ''black box'' algorithm. It is assumed that the relative (ground truth) ranking of n of them is known. The objective is then to quantify the error made by the algorithm on the ranks of the m new items among the total $(n + m)$. In such a setting, the true ranks of the n original items in the total $(n + m)$ depend on the (unknown) true ranks of the m new ones. Consequently, we have no direct access to a calibration set to apply a classical CP method. To address this challenge, we propose to construct distribution-free bounds of the unknown conformity scores using recent results on the distribution of conformal p-values. Using these scores upper bounds, we provide valid prediction sets for the rank of any item. We also control the false coverage proportion, a crucial quantity when dealing with multiple prediction sets. Finally, we empirically show on both synthetic and real data the efficiency of our CP method.

Jean-Baptiste Fermanian (UM, Inria, IMAG), Pierre Humbert (SU, LPSM), Gilles Blanchard (LMO, DATASHAPE)1/22/2025

arXiv:2501.10446v1 Announce Type: new Abstract: A complex multi-state redundant system with preventive maintenance subject to multiple events is considered. The online unit can undergo several types of failures: internal and those provoked by external shocks. Multiple degradation levels are assumed so as internal and external. Degradation levels are observed by random inspections and if they are major, the unit goes to repair facility where preventive maintenance is carried out. This repair facility is composed of a single repairperson governed by a multiple vacation policy. This policy is set up according to the operational number of units. Two types of task can be performed by the repairperson, corrective repair and preventive maintenance. The times embedded in the system are phase type distributed and the model is built by using Markovian Arrival Processes with marked arrivals. Multiple performance measures besides of the transient and stationary distribution are worked out through matrix-analytic methods. This methodology enables us to express the main results and the global development in a matrix-algorithmic form. To optimize the model costs and rewards are included. A numerical example shows the versatility of the model.

Juan Eloy Ruiz-Castro1/22/2025

arXiv:1603.03788v2 Announce Type: replace-cross Abstract: We provide an introduction to the signature method, focusing on its theoretical properties and machine learning applications. Our presentation is divided into two parts. In the first part, we present the definition and fundamental properties of the signature of a path. The signature is a sequence of numbers associated with a path that captures many of its important analytic and geometric properties. As a sequence of numbers, the signature serves as a compact description (dimension reduction) of a path. In presenting its theoretical properties, we assume only familiarity with classical real analysis and integration, and supplement theory with straightforward examples. We also mention several advanced topics, including the role of the signature in rough path theory. In the second part, we present practical applications of the signature to the area of machine learning. The signature method is a non-parametric way of transforming data into a set of features that can be used in machine learning tasks. In this method, data are converted into multi-dimensional paths, by means of embedding algorithms, of which the signature is then computed. We describe this pipeline in detail, making a link with the properties of the signature presented in the first part. We furthermore review some of the developments of the signature method in machine learning and, as an illustrative example, present a detailed application of the method to handwritten digit classification.

Ilya Chevyrev, Andrey Kormilitzin1/22/2025

arXiv:2501.12354v1 Announce Type: new Abstract: Inferring the true demand for a product or a service from aggregate data is often challenging due to the limited available supply, thus resulting in observations that are censored and correspond to the realized demand, thereby not accounting for the unsatisfied demand. Censored regression models are able to account for the effect of censoring due to the limited supply, but they don't consider the effect of substitutions, which may cause the demand for similar alternative products or services to increase. This paper proposes Diffusion-aware Censored Demand Models, which combine a Tobit likelihood with a graph diffusion process in order to model the latent process of transfer of unsatisfied demand between similar products or services. We instantiate this new class of models under the framework of GPs and, based on both simulated and real-world data for modeling sales, bike-sharing demand, and EV charging demand, demonstrate its ability to better recover the true demand and produce more accurate out-of-sample predictions.

Filipe Rodrigues1/22/2025

arXiv:2212.05866v4 Announce Type: replace-cross Abstract: As they play an increasingly important role in determining access to credit, credit scoring models are under growing scrutiny from banking supervisors and internal model validators. These authorities need to monitor the model performance and identify its key drivers. To facilitate this, we introduce the XPER methodology to decompose a performance metric (e.g., AUC, $R^2$) into specific contributions associated with the various features of a forecasting model. XPER is theoretically grounded on Shapley values and is both model-agnostic and performance metric-agnostic. Furthermore, it can be implemented either at the model level or at the individual level. Using a novel dataset of car loans, we decompose the AUC of a machine-learning model trained to forecast the default probability of loan applicants. We show that a small number of features can explain a surprisingly large part of the model performance. Notably, the features that contribute the most to the predictive performance of the model may not be the ones that contribute the most to individual forecasts (SHAP). Finally, we show how XPER can be used to deal with heterogeneity issues and improve performance.

Hu\'e Sullivan, Hurlin Christophe, P\'erignon Christophe, Saurin S\'ebastien1/22/2025

arXiv:2501.10729v1 Announce Type: cross Abstract: Local Polynomial Regression (LPR) is a widely used nonparametric method for modeling complex relationships due to its flexibility and simplicity. It estimates a regression function by fitting low-degree polynomials to localized subsets of the data, weighted by proximity. However, traditional LPR is sensitive to outliers and high-leverage points, which can significantly affect estimation accuracy. This paper revisits the kernel function used to compute regression weights and proposes a novel framework that incorporates both predictor and response variables in the weighting mechanism. By introducing two positive definite kernels, the proposed method robustly estimates weights, mitigating the influence of outliers through localized density estimation. The method is implemented in Python and is publicly available at https://github.com/yaniv-shulman/rsklpr, demonstrating competitive performance in synthetic benchmark experiments. Compared to standard LPR, the proposed approach consistently improves robustness and accuracy, especially in heteroscedastic and noisy environments, without requiring multiple iterations. This advancement provides a promising extension to traditional LPR, opening new possibilities for robust regression applications.

Yaniv Shulman1/22/2025

arXiv:2309.02422v4 Announce Type: replace-cross Abstract: Integral probability metrics (IPMs) constitute a general class of nonparametric two-sample tests that are based on maximizing the mean difference between samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the IPM defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness degree $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. We can thus leverage the power of modern neural network optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test.

Seunghoon Paik, Michael Celentano, Alden Green, Ryan J. Tibshirani1/14/2025

arXiv:2501.06868v1 Announce Type: cross Abstract: Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.

Marcos Matabuena1/14/2025

arXiv:2501.06540v1 Announce Type: new Abstract: We aim to assist image-based myopia screening by resolving two longstanding problems, "how to integrate the information of ocular images of a pair of eyes" and "how to incorporate the inherent dependence among high-myopia status and axial length for both eyes." The classification-regression task is modeled as a novel 4-dimensional muti-response regression, where discrete responses are allowed, that relates to two dependent 3rd-order tensors (3D ultrawide-field fundus images). We present a Vision Transformer-based bi-channel architecture, named CeViT, where the common features of a pair of eyes are extracted via a shared Transformer encoder, and the interocular asymmetries are modeled through separated multilayer perceptron heads. Statistically, we model the conditional dependence among mixture of discrete-continuous responses given the image covariates by a so-called copula loss. We establish a new theoretical framework regarding fine-tuning on CeViT based on latent representations, allowing the black-box fine-tuning procedure interpretable and guaranteeing higher relative efficiency of fine-tuning weight estimation in the asymptotic setting. We apply CeViT to an annotated ultrawide-field fundus image dataset collected by Shanghai Eye \& ENT Hospital, demonstrating that CeViT enhances the baseline model in both accuracy of classifying high-myopia and prediction of AL on both eyes.

Chong Zhong, Yang Li, Jinfeng Xu, Xiang Fu, Yunhao Liu, Qiuyi Huang, Danjuan Yang, Meiyan Li, Aiyi Liu, Alan H. Welsh, Xingtao Zhou, Bo Fu, Catherine C. Liu1/14/2025

arXiv:2501.06826v1 Announce Type: cross Abstract: Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.

Stephanie Eckman, Bolei Ma, Christoph Kern, Rob Chew, Barbara Plank, Frauke Kreuter1/14/2025

arXiv:2111.04681v2 Announce Type: replace-cross Abstract: We consider the problem of structured tensor denoising in the presence of unknown permutations. Such data problems arise commonly in recommendation system, neuroimaging, community detection, and multiway comparison applications. Here, we develop a general family of smooth tensor models up to arbitrary index permutations; the model incorporates the popular tensor block models and Lipschitz hypergraphon models as special cases. We show that a constrained least-squares estimator in the block-wise polynomial family achieves the minimax error bound. A phase transition phenomenon is revealed with respect to the smoothness threshold needed for optimal recovery. In particular, we find that a polynomial of degree up to $(m-2)(m+1)/2$ is sufficient for accurate recovery of order-$m$ tensors, whereas higher degree exhibits no further benefits. This phenomenon reveals the intrinsic distinction for smooth tensor estimation problems with and without unknown permutations. Furthermore, we provide an efficient polynomial-time Borda count algorithm that provably achieves optimal rate under monotonicity assumptions. The efficacy of our procedure is demonstrated through both simulations and Chicago crime data analysis.

Chanwoo Lee, Miaoyan Wang1/14/2025

arXiv:2501.06661v1 Announce Type: new Abstract: We show how random feature maps can be used to forecast dynamical systems with excellent forecasting skill. We consider the tanh activation function and judiciously choose the internal weights in a data-driven manner such that the resulting features explore the nonlinear, non-saturated regions of the activation function. We introduce skip connections and construct a deep variant of random feature maps by combining several units. To mitigate the curse of dimensionality, we introduce localization where we learn local maps, employing conditional independence. Our modified random feature maps provide excellent forecasting skill for both single trajectory forecasts as well as long-time estimates of statistical properties, for a range of chaotic dynamical systems with dimensions up to 512. In contrast to other methods such as reservoir computers which require extensive hyperparameter tuning, we effectively need to tune only a single hyperparameter, and are able to achieve state-of-the-art forecast skill with much smaller networks.

Pinak Mandal, Georg A. Gottwald1/14/2025

arXiv:2501.06268v1 Announce Type: new Abstract: We introduce a new method for clustering based on Cluster Catch Digraphs (CCDs). The new method addresses the limitations of RK-CCDs by employing a new variant of spatial randomness test that employs the nearest neighbor distance (NND) instead of the Ripley's K function used by RK-CCDs. We conduct a comprehensive Monte Carlo analysis to assess the performance of our method, considering factors such as dimensionality, data set size, number of clusters, cluster volumes, and inter-cluster distance. Our method is particularly effective for high-dimensional data sets, comparable to or outperforming KS-CCDs and RK-CCDs that rely on a KS-type statistic or the Ripley's K function. We also evaluate our methods using real and complex data sets, comparing them to well-known clustering methods. Again, our methods exhibit competitive performance, producing high-quality clusters with desirable properties. Keywords: Graph-based clustering, Cluster catch digraphs, High-dimensional data, The nearest neighbor distance, Spatial randomness test

Rui Shi, Nedret Billor, Elvan Ceyhan1/14/2025

arXiv:2501.06366v1 Announce Type: cross Abstract: When applied in healthcare, reinforcement learning (RL) seeks to dynamically match the right interventions to subjects to maximize population benefit. However, the learned policy may disproportionately allocate efficacious actions to one subpopulation, creating or exacerbating disparities in other socioeconomically-disadvantaged subgroups. These biases tend to occur in multi-stage decision making and can be self-perpetuating, which if unaccounted for could cause serious unintended consequences that limit access to care or treatment benefit. Counterfactual fairness (CF) offers a promising statistical tool grounded in causal inference to formulate and study fairness. In this paper, we propose a general framework for fair sequential decision making. We theoretically characterize the optimal CF policy and prove its stationarity, which greatly simplifies the search for optimal CF policies by leveraging existing RL algorithms. The theory also motivates a sequential data preprocessing algorithm to achieve CF decision making under an additive noise assumption. We prove and then validate our policy learning approach in controlling unfairness and attaining optimal value through simulations. Analysis of a digital health dataset designed to reduce opioid misuse shows that our proposal greatly enhances fair access to counseling.

Jitao Wang, Chengchun Shi, John D. Piette, Joshua R. Loftus, Donglin Zeng, Zhenke Wu1/14/2025

arXiv:2501.06873v1 Announce Type: cross Abstract: We analyze over 44,000 NBER and CEPR working papers from 1980 to 2023 using a custom language model to construct knowledge graphs that map economic concepts and their relationships. We distinguish between general claims and those documented via causal inference methods (e.g., DiD, IV, RDD, RCTs). We document a substantial rise in the share of causal claims-from roughly 4% in 1990 to nearly 28% in 2020-reflecting the growing influence of the "credibility revolution." We find that causal narrative complexity (e.g., the depth of causal chains) strongly predicts both publication in top-5 journals and higher citation counts, whereas non-causal complexity tends to be uncorrelated or negatively associated with these outcomes. Novelty is also pivotal for top-5 publication, but only when grounded in credible causal methods: introducing genuinely new causal edges or paths markedly increases both the likelihood of acceptance at leading outlets and long-run citations, while non-causal novelty exhibits weak or even negative effects. Papers engaging with central, widely recognized concepts tend to attract more citations, highlighting a divergence between factors driving publication success and long-term academic impact. Finally, bridging underexplored concept pairs is rewarded primarily when grounded in causal methods, yet such gap filling exhibits no consistent link with future citations. Overall, our findings suggest that methodological rigor and causal innovation are key drivers of academic recognition, but sustained impact may require balancing novel contributions with conceptual integration into established economic discourse.

Prashant Garg, Thiemo Fetzer1/14/2025

arXiv:2501.06918v1 Announce Type: cross Abstract: By 2030, the senior population aged 65 and older is expected to increase by over 50%, significantly raising the number of older drivers on the road. Drivers over 70 face higher crash death rates compared to those in their forties and fifties, underscoring the importance of developing more effective safety interventions for this demographic. Although the impact of aging on driving behavior has been studied, there is limited research on how these behaviors translate into real-world driving scenarios. This study addresses this need by leveraging Naturalistic Driving Data (NDD) to analyze driving performance measures - specifically, speed limit adherence on interstates and deceleration at stop intersections, both of which may be influenced by age-related declines. Using NDD, we developed Cumulative Distribution Functions (CDFs) to establish benchmarks for key driving behaviors among senior and young drivers. Our analysis, which included anomaly detection, benchmark comparisons, and accuracy evaluations, revealed significant differences in driving patterns primarily related to speed limit adherence at 75mph. While our approach shows promising potential for enhancing Advanced Driver Assistance Systems (ADAS) by providing tailored interventions based on age-specific adherence to speed limit driving patterns, we recognize the need for additional data to refine and validate metrics for other driving behaviors. By establishing precise benchmarks for various driving performance metrics, ADAS can effectively identify anomalies, such as abrupt deceleration, which may indicate impaired driving or other safety concerns. This study lays a strong foundation for future research aimed at improving safety interventions through detailed driving behavior analysis.

Aparna Joshi, Kojo Adugyamfi, Jennifer Merickel, Pujitha Gunaratne, Anuj Sharma1/14/2025

arXiv:2501.07025v1 Announce Type: cross Abstract: Many Natural Language Processing (NLP) related applications involves topics and sentiments derived from short documents such as consumer reviews and social media posts. Topics and sentiments of short documents are highly sparse because a short document generally covers a few topics among hundreds of candidates. Imputation of missing data is sometimes hard to justify and also often unpractical in highly sparse data. We developed a method for calculating a weighted similarity for highly sparse data without imputation. This weighted similarity is consist of three components to capture similarities based on both existence and lack of common properties and pattern of missing values. As a case study, we used a community detection algorithm and this weighted similarity to group different shampoo brands based on sparse topic sentiments derived from short consumer reviews. Compared with traditional imputation and similarity measures, the weighted similarity shows better performance in both general community structures and average community qualities. The performance is consistent and robust across metrics and community complexities.

Yong Zhang, Eric Herrison Gyamfi1/14/2025