stat.AP

42 posts

arXiv:2501.11131v1 Announce Type: cross Abstract: Underwater noise pollution from human activities, particularly shipping, has been recognised as a serious threat to marine life. The sound generated by vessels can have various adverse effects on fish and aquatic ecosystems in general. In this setting, the estimation and analysis of the underwater noise produced by vessels is an important challenge for the preservation of the marine environment. In this paper we propose a model for the spatio-temporal characterisation of the underwater noise generated by vessels. The approach is based on the reconstruction of the vessels' trajectories from Automatic Identification System (AIS) data and on their deployment in a spatio-temporal database. Trajectories are enriched with semantic information like the acoustic characteristics of the vessels' engines or the activity performed by the vessels. We define a model for underwater noise propagation and use the trajectories' information to infer how noise propagates in the area of interest. We develop our approach for the case study of the fishery activities in the Northern Adriatic sea, an area of the Mediterranean sea which is well known to be highly exploited. We implement our approach using MobilityDB, an open source geospatial trajectory data management and analysis platform, which offers spatio-temporal operators and indexes improving the efficiency of our system. We use this platform to conduct various analyses of the underwater noise generated in the Northern Adriatic Sea, aiming at estimating the impact of fishing activities on underwater noise pollution and at demonstrating the flexibility and expressiveness of our approach.

Giulia Rovinelli, Davide Rocchesso, Marta Simeoni, Esteban Zim\'anyi, Alessandra Raffaet\`a1/22/2025

arXiv:2501.04796v2 Announce Type: replace Abstract: We focus on the potential fragility of democratic elections given modern information-communication technologies (ICT) in the Web 2.0 era. Our work provides an explanation for the cascading attrition of public officials recently in the United States and offers potential policy interventions from a dynamic system's perspective. We propose that micro-level heterogeneity across individuals within crucial institutions leads to vulnerabilities of election support systems at the macro scale. Our analysis provides comparative statistics to measure the fragility of systems against targeted harassment, disinformation campaigns, and other adversarial manipulations that are now cheaper to scale and deploy. Our analysis also informs policy interventions that seek to retain public officials and increase voter turnout. We show how limited resources (for example, salary incentives to public officials and targeted interventions to increase voter turnout) can be allocated at the population level to improve these outcomes and maximally enhance democratic resilience. On the one hand, structural and individual heterogeneity cause systemic fragility that adversarial actors can exploit, but also provide opportunities for effective interventions that offer significant global improvements from limited and localized actions.

M. Amin Rahimian, Michael P. Colaresi1/22/2025

arXiv:2501.11084v1 Announce Type: new Abstract: This paper combines two significant areas of political science research: measuring individual ideological position and cohesion. Although both approaches help analyze legislative behaviors, no unified model currently integrates these dimensions. To fill this gap, the paper proposes a methodology called B-Call that combines ideological positioning with voting cohesion, treating votes as random variables. The model is empirically validated using roll-call data from the United States, Brazil, and Chile legislatures, which represent diverse legislative dynamics. The analysis aims to capture the complexities of voting and legislative behaviors, resulting in a two-dimensional indicator. This study addresses gaps in current legislative voting models, particularly in contexts with limited party control.

Juan Reutter, Sergio Toro, Lucas Valenzuela, Daniel Alcatruz, Macarena Valenzuela1/22/2025

arXiv:2501.10423v1 Announce Type: cross Abstract: The energy transition is profoundly reshaping electricity market dynamics. It makes it essential to understand how renewable energy generation actually impacts electricity prices, among all other market drivers. These insights are critical to design policies and market interventions that ensure affordable, reliable, and sustainable energy systems. However, identifying causal effects from observational data is a major challenge, requiring innovative causal inference approaches that go beyond conventional regression analysis only. We build upon the state of the art by developing and applying a local partially linear double machine learning approach. Its application yields the first robust causal evidence on the distinct and non-linear effects of wind and solar power generation on UK wholesale electricity prices, revealing key insights that have eluded previous analyses. We find that, over 2018-2024, wind power generation has a U-shaped effect on prices: at low penetration levels, a 1 GWh increase in energy generation reduces prices by up to 7 GBP/MWh, but this effect gets close to none at mid-penetration levels (20-30%) before intensifying again. Solar power places substantial downward pressure on prices at very low penetration levels (up to 9 GBP/MWh per 1 GWh increase in energy generation), though its impact weakens quite rapidly. We also uncover a critical trend where the price-reducing effects of both wind and solar power have become more pronounced over time (from 2018 to 2024), highlighting their growing influence on electricity markets amid rising penetration. Our study provides both novel analysis approaches and actionable insights to guide policymakers in appraising the way renewables impact electricity markets.

Davide Cacciarelli, Pierre Pinson, Filip Panagiotopoulos, David Dixon, Lizzie Blaxland1/22/2025

arXiv:2501.11869v1 Announce Type: cross Abstract: Snapshot Compressive Imaging (SCI) maps three-dimensional (3D) data cubes, such as videos or hyperspectral images, into two-dimensional (2D) measurements via optical modulation, enabling efficient data acquisition and reconstruction. Recent advances have shown the potential of mask optimization to enhance SCI performance, but most studies overlook nonlinear distortions caused by saturation in practical systems. Saturation occurs when high-intensity measurements exceed the sensor's dynamic range, leading to information loss that standard reconstruction algorithms cannot fully recover. This paper addresses the challenge of optimizing binary masks in SCI under saturation. We theoretically characterize the performance of compression-based SCI recovery in the presence of saturation and leverage these insights to optimize masks for such conditions. Our analysis reveals trade-offs between mask statistics and reconstruction quality in saturated systems. Experimental results using a Plug-and-Play (PnP) style network validate the theory, demonstrating improved recovery performance and robustness to saturation with our optimized binary masks.

Mengyu Zhao, Shirin Jalali1/22/2025

arXiv:2501.10401v1 Announce Type: cross Abstract: Fuel moisture content (FMC) is a key predictor for wildfire rate of spread (ROS). Machine learning models of FMC are being used more in recent years, augmenting or replacing traditional physics-based approaches. Wildfire rate of spread (ROS) has a highly nonlinear relationship with FMC, where small differences in dry fuels lead to large differences in ROS. In this study, custom loss functions that place more weight on dry fuels were examined with a variety of machine learning models of FMC. The models were evaluated with a spatiotemporal cross-validation procedure to examine whether the custom loss functions led to more accurate forecasts of ROS. Results show that the custom loss functions improved accuracy for ROS forecasts by a small amount. Further research would be needed to establish whether the improvement in ROS forecasts leads to more accurate real-time wildfire simulations.

Jonathon Hirschi1/22/2025

arXiv:2501.11860v1 Announce Type: new Abstract: Speckle noise is a fundamental challenge in coherent imaging systems, significantly degrading image quality. Over the past decades, numerous despeckling algorithms have been developed for applications such as Synthetic Aperture Radar (SAR) and digital holography. In this paper, we aim to establish a theoretically grounded approach to despeckling. We propose a method applicable to general structured stationary stochastic sources. We demonstrate the effectiveness of the proposed method on piecewise constant sources. Additionally, we theoretically derive a lower bound on the despeckling performance for such sources. The proposed depseckler applied to the 1-Markov structured sources achieves better reconstruction performance with no strong simplification of the ground truth signal model or speckle noise.

Ali Zafari, Shirin Jalali1/22/2025

arXiv:2501.12198v1 Announce Type: new Abstract: This paper focuses on the opinion dynamics under the influence of manipulative agents. This type of agents is characterized by the fact that their opinions follow a trajectory that does not respond to the dynamics of the model, although it does influence the rest of the normal agents. Simulation has been implemented to study how one manipulative group modifies the natural dynamics of some opinion models of bounded confidence. It is studied what strategies based on the number of manipulative agents and their common opinion trajectory can be carried out by a manipulative group to influence normal agents and attract them to their opinions. In certain weighted models, some effects are observed in which normal agents move in the opposite direction to the manipulator group. Moreover, the conditions which ensure the influence of a manipulative group on a group of normal agents over time are also established for the Hegselmann-Krause model.

A. Bautista1/22/2025

arXiv:2408.00139v2 Announce Type: replace Abstract: The related concepts of partisan belief systems, issue alignment, and partisan sorting are central to our understanding of politics. These phenomena have been studied using measures of alignment between pairs of topics, or how much individuals' attitudes toward a topic reveal about their attitudes toward another topic. We introduce a higher-order measure that extends the assessment of alignment beyond pairs of topics by quantifying the amount of information individuals' opinions on one topic reveal about a set of topics simultaneously. Applying this approach to legislative voting behavior shows that parliamentary systems typically exhibit similar multiway alignment characteristics, but can change in response to shifting intergroup dynamics. In American National Election Studies surveys, our approach reveals a growing significance of party identification together with a consistent rise in multiway alignment over time.

Letizia Iannucci, Ali Faqeeh, Ali Salloum, Ted Hsuan Yun Chen, Mikko Kivel\"a1/14/2025

arXiv:2501.06241v1 Announce Type: new Abstract: This study investigates the efficacy of machine learning models for predicting house rental prices in Ghana, addressing the need for accurate and accessible housing market information. Utilising a comprehensive dataset of rental listings, we trained and evaluated various models, including CatBoost, XGBoost, and Random Forest. CatBoost emerged as the best-performing model, achieving an $R^2$ of 0.876, demonstrating its ability to effectively capture complex relationships within the housing market. Feature importance analysis revealed that location-based features, number of bedrooms, bathrooms, and furnishing status are key drivers of rental prices. Our findings provide valuable insights for stakeholders, including real estate professionals, investors, and policymakers, while also highlighting opportunities for future research, such as incorporating temporal data and exploring regional variations.

Philip Adzanoukpe1/14/2025

arXiv:2412.06825v2 Announce Type: replace Abstract: Reliable and interpretable traffic crash modeling is essential for understanding causality and improving road safety. This study introduces a novel approach to predicting collision types by utilizing a comprehensive dataset fused from multiple sources, including weather data, crash reports, high-resolution traffic information, pavement geometry, and facility characteristics. Central to our approach is the development of a Feature Group Tabular Transformer (FGTT) model, which organizes disparate data into meaningful feature groups, represented as tokens. These group-based tokens serve as rich semantic components, enabling effective identification of collision patterns and interpretation of causal mechanisms. The FGTT model is benchmarked against widely used tree ensemble models, including Random Forest, XGBoost, and CatBoost, demonstrating superior predictive performance. Furthermore, model interpretation reveals key influential factors, providing fresh insights into the underlying causality of distinct crash types.

Oscar Lares, Hao Zhen, Jidong J. Yang1/14/2025

arXiv:2501.07185v1 Announce Type: new Abstract: Precision agriculture in general, and precision weeding in particular, have greatly benefited from the major advancements in deep learning and computer vision. A large variety of commercial robotic solutions are already available and deployed. However, the adoption by farmers of such solutions is still low for many reasons, an important one being the lack of trust in these systems. This is in great part due to the opaqueness and complexity of deep neural networks and the manufacturers' inability to provide valid guarantees on their performance. Conformal prediction, a well-established methodology in the machine learning community, is an efficient and reliable strategy for providing trustworthy guarantees on the predictions of any black-box model under very minimal constraints. Bridging the gap between the safe machine learning and precision agriculture communities, this article showcases conformal prediction in action on the task of precision weeding through deep learning-based image classification. After a detailed presentation of the conformal prediction methodology and the development of a precision spraying pipeline based on a ''conformalized'' neural network and well-defined spraying decision rules, the article evaluates this pipeline on two real-world scenarios: one under in-distribution conditions, the other reflecting a near out-of-distribution setting. The results show that we are able to provide formal, i.e. certifiable, guarantees on spraying at least 90% of the weeds.

Paul Melki (IMS), Lionel Bombrun (IMS), Boubacar Diallo (IMS), J\'er\^ome Dias (IMS), Jean-Pierre da Costa (IMS)1/14/2025

arXiv:2501.06656v1 Announce Type: new Abstract: For analysis of bibliographic data, we can obtain from bibliographic databases the corresponding collection of bibliographic networks. Recently OpenAlex, a new open-access bibliographic database, became available. We present OpenAlex2Pajek, an R package for converting OpenAlex data into a collection of Pajek's networks. For an illustration, we created a temporal weighted network describing the co-authorship between world countries for years from 1990 to 2023. We present some analyses of this network.

Vladimir Batagelj1/14/2025

arXiv:2501.07206v1 Announce Type: new Abstract: Systemic lupus erythematosus (SLE) is a complex heterogeneous disease with many manifestational facets. We propose a data-driven approach to discover probabilistic independent sources from multimodal imperfect EHR data. These sources represent exogenous variables in the data generation process causal graph that estimate latent root causes of the presence of SLE in the health record. We objectively evaluated the sources against the original variables from which they were discovered by training supervised models to discriminate SLE from negative health records using a reduced set of labelled instances. We found 19 predictive sources with high clinical validity and whose EHR signatures define independent factors of SLE heterogeneity. Using the sources as input patient data representation enables models to provide with rich explanations that better capture the clinical reasons why a particular record is (not) an SLE case. Providers may be willing to trade patient-level interpretability for discrimination especially in challenging cases.

Marco Barbero Mota, John M. Still, Jorge L. Gamboa, Eric V. Strobl, Charles M. Stein, Vivian K. Kawai, Thomas A. Lasko1/14/2025

arXiv:2410.02987v2 Announce Type: cross Abstract: In this work, we study the effectiveness of employing archetypal aperiodic sequencing -- namely Fibonacci, Thue-Morse, and Rudin-Shapiro -- on the Parrondian effect. From a capital gain perspective, our results show that these series do yield a Parrondo's Paradox with the Thue-Morse based strategy outperforming not only the other two aperiodic strategies but benchmark Parrondian games with random and periodical ($AABBAABB\ldots$) switching as well. The least performing of the three aperiodic strategies is the Rudin-Shapiro. To elucidate the underlying causes of these results, we analyze the cross-correlation between the capital generated by the switching protocols and that of the isolated losing games. This analysis reveals that a strong anticorrelation with both isolated games is typically required to achieve a robust manifestation of Parrondo's effect. We also study the influence of the sequencing on the capital using the lacunarity and persistence measures. In general, we observe that the switching protocols tend to become less performing in terms of the capital as one increases the persistence and thus approaches the features of an isolated losing game. For the (log-)lacunarity, a property related to heterogeneity, we notice that for small persistence (less than 0.5) the performance increases with the lacunarity with a maximum around 0.4. In respect of this, our work shows that the optimization of a switching protocol is strongly dependent on a fine-tuning between persistence and heterogeneity.

Marcelo A. Pires, Erveton P. Pinto, Rone N. da Silva, S\'ilvio M. Duarte Queir\'os1/14/2025

arXiv:2501.06540v1 Announce Type: new Abstract: We aim to assist image-based myopia screening by resolving two longstanding problems, "how to integrate the information of ocular images of a pair of eyes" and "how to incorporate the inherent dependence among high-myopia status and axial length for both eyes." The classification-regression task is modeled as a novel 4-dimensional muti-response regression, where discrete responses are allowed, that relates to two dependent 3rd-order tensors (3D ultrawide-field fundus images). We present a Vision Transformer-based bi-channel architecture, named CeViT, where the common features of a pair of eyes are extracted via a shared Transformer encoder, and the interocular asymmetries are modeled through separated multilayer perceptron heads. Statistically, we model the conditional dependence among mixture of discrete-continuous responses given the image covariates by a so-called copula loss. We establish a new theoretical framework regarding fine-tuning on CeViT based on latent representations, allowing the black-box fine-tuning procedure interpretable and guaranteeing higher relative efficiency of fine-tuning weight estimation in the asymptotic setting. We apply CeViT to an annotated ultrawide-field fundus image dataset collected by Shanghai Eye \& ENT Hospital, demonstrating that CeViT enhances the baseline model in both accuracy of classifying high-myopia and prediction of AL on both eyes.

Chong Zhong, Yang Li, Jinfeng Xu, Xiang Fu, Yunhao Liu, Qiuyi Huang, Danjuan Yang, Meiyan Li, Aiyi Liu, Alan H. Welsh, Xingtao Zhou, Bo Fu, Catherine C. Liu1/14/2025

arXiv:2501.06868v1 Announce Type: cross Abstract: Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.

Marcos Matabuena1/14/2025

arXiv:2501.03747v1 Announce Type: new Abstract: Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs' capabilities. Many methods aim to activate LLMs' capabilities based on token-level alignment but overlook LLMs' inherent strength on natural language processing -- their deep understanding of linguistic logic and structure rather than superficial embedding processing. We propose Context-Alignment, a new paradigm that aligns TS with a linguistic component in the language environments familiar to LLMs to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities. Specifically, such context-level alignment comprises structural alignment and logical alignment, which is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to TS-language multimodal inputs. Structural alignment utilizes dual-scale nodes to describe hierarchical structure in TS-language, enabling LLMs treat long TS data as a whole linguistic component while preserving intrinsic token features. Logical alignment uses directed edges to guide logical relationships, ensuring coherence in the contextual semantics. Demonstration examples prompt are employed to construct Demonstration Examples based Context-Alignment (DECA) following DSCA-GNNs framework. DECA can be flexibly and repeatedly integrated into various layers of pre-trained LLMs to improve awareness of logic and structure, thereby enhancing performance. Extensive experiments show the effectiveness of DECA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting, confirming that Context-Alignment provide powerful prior knowledge on context.

Yuxiao Hu, Qian Li, Dongxiao Zhang, Jinyue Yan, Yuntian Chen1/8/2025

arXiv:2402.10456v2 Announce Type: replace-cross Abstract: The generation of synthetic data with distributions that faithfully emulate the underlying data-generating mechanism holds paramount significance. Wasserstein Generative Adversarial Networks (WGANs) have emerged as a prominent tool for this task; however, due to the delicate equilibrium of the minimax formulation and the instability of Wasserstein distance in high dimensions, WGAN often manifests the pathological phenomenon of mode collapse. This results in generated samples that converge to a restricted set of outputs and fail to adequately capture the tail behaviors of the true distribution. Such limitations can lead to serious downstream consequences. To this end, we propose the Penalized Optimal Transport Network (POTNet), a versatile deep generative model based on the marginally-penalized Wasserstein (MPW) distance. Through the MPW distance, POTNet effectively leverages low-dimensional marginal information to guide the overall alignment of joint distributions. Furthermore, our primal-based framework enables direct evaluation of the MPW distance, thus eliminating the need for a critic network. This formulation circumvents training instabilities inherent in adversarial approaches and avoids the need for extensive parameter tuning. We derive a non-asymptotic bound on the generalization error of the MPW loss and establish convergence rates of the generative distribution learned by POTNet. Our theoretical analysis together with extensive empirical evaluations demonstrate the superior performance of POTNet in accurately capturing underlying data structures, including their tail behaviors and minor modalities. Moreover, our model achieves orders of magnitude speedup during the sampling stage compared to state-of-the-art alternatives, which enables computationally efficient large-scale synthetic data generation.

Wenhui Sophia Lu, Chenyang Zhong, Wing Hung Wong1/8/2025

arXiv:2501.01437v1 Announce Type: cross Abstract: Network reconstruction consists in retrieving the -- hidden -- interaction structure of a system from empirical observations such as time series. Many reconstruction algorithms have been proposed, although less research has been devoted to describe their theoretical limitations. To this end, we adopt an information-theoretical point of view and define the reconstructability -- the fraction of structural information recoverable from data. The reconstructability depends on the true data generating model which is shown to set the reconstruction limit, i.e., the performance upper bound for all algorithms. We show that the reconstructability is related to various performance measures, such as the probability of error and the Jaccard similarity. In an empirical context where the true data generating model is unknown, we introduce the reconstruction index as an approximation of the reconstructability. We find that performing model selection is crucial for the validity of the reconstruction index as a proxy of the reconstructability, and illustrate how it assesses the reconstruction limit of empirical time series and networks.

Charles Murphy, Simon Lizotte, Fran\c{c}ois Thibault, Vincent Thibeault, Patrick Desrosiers, Antoine Allard1/6/2025