stat.ME
32 postsarXiv:2412.20173v2 Announce Type: replace-cross Abstract: This study proposes a debiasing method for smooth nonparametric estimators. While machine learning techniques such as random forests and neural networks have demonstrated strong predictive performance, their theoretical properties remain relatively underexplored. Specifically, many modern algorithms lack assurances of pointwise asymptotic normality and uniform convergence, which are critical for statistical inference and robustness under covariate shift and have been well-established for classical methods like Nadaraya-Watson regression. To address this, we introduce a model-free debiasing method that guarantees these properties for smooth estimators derived from any nonparametric regression approach. By adding a correction term that estimates the conditional expected residual of the original estimator, or equivalently, its estimation error, we obtain a debiased estimator with proven pointwise asymptotic normality, and uniform convergence. These properties enable statistical inference and enhance robustness to covariate shift, making the method broadly applicable to a wide range of nonparametric regression problems.
arXiv:2501.00119v1 Announce Type: cross Abstract: A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions. However, these tests can be costly in terms of time and resources, potentially exposing users, customers, or other test subjects (units) to inferior options. This paper explores practical considerations in applying methodologies inspired by "synthetic control" as an alternative to traditional A/B testing in settings with very large numbers of units, involving up to hundreds of millions of units, which is common in modern applications such as e-commerce and ride-sharing platforms. This method is particularly valuable in settings where the treatment affects only a subset of units, leaving many units unaffected. In these scenarios, synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units. After the treatment is implemented, these estimates can be compared to actual outcomes to measure the treatment effect. A key challenge in creating accurate counterfactual outcomes is interpolation bias, a well-documented phenomenon that occurs when control units differ significantly from treated units. To address this, we propose a two-phase approach: first using nearest neighbor matching based on unit covariates to select similar control units, then applying supervised learning methods suitable for high-dimensional data to estimate counterfactual outcomes. Testing using six large-scale experiments demonstrates that this approach successfully improves estimate accuracy. However, our analysis reveals that machine learning bias -- which arises from methods that trade off bias for variance reduction -- can impact results and affect conclusions about treatment effects. We document this bias in large-scale experimental settings and propose effective de-biasing techniques to address this challenge.
arXiv:2501.00755v1 Announce Type: cross Abstract: Causal inference in observational studies with high-dimensional covariates presents significant challenges. We introduce CausalBGM, an AI-powered Bayesian generative modeling approach that captures the causal relationship among covariates, treatment, and outcome variables. The core innovation of CausalBGM lies in its ability to estimate the individual treatment effect (ITE) by learning individual-specific distributions of a low-dimensional latent feature set (e.g., latent confounders) that drives changes in both treatment and outcome. This approach not only effectively mitigates confounding effects but also provides comprehensive uncertainty quantification, offering reliable and interpretable causal effect estimates at the individual level. CausalBGM adopts a Bayesian model and uses a novel iterative algorithm to update the model parameters and the posterior distribution of latent features until convergence. This framework leverages the power of AI to capture complex dependencies among variables while adhering to the Bayesian principles. Extensive experiments demonstrate that CausalBGM consistently outperforms state-of-the-art methods, particularly in scenarios with high-dimensional covariates and large-scale datasets. Its Bayesian foundation ensures statistical rigor, providing robust and well-calibrated posterior intervals. By addressing key limitations of existing methods, CausalBGM emerges as a robust and promising framework for advancing causal inference in modern applications in fields such as genomics, healthcare, and social sciences. CausalBGM is maintained at the website https://causalbgm.readthedocs.io/.
arXiv:2501.00854v1 Announce Type: cross Abstract: Sequential decision problems are widely studied across many areas of science. A key challenge when learning policies from historical data - a practice commonly referred to as off-policy learning - is how to ``identify'' the impact of a policy of interest when the observed data are not randomized. Off-policy learning has mainly been studied in two settings: dynamic treatment regimes (DTRs), where the focus is on controlling confounding in medical problems with short decision horizons, and offline reinforcement learning (RL), where the focus is on dimension reduction in closed systems such as games. The gap between these two well studied settings has limited the wider application of off-policy learning to many real-world problems. Using the theory for causal inference based on acyclic directed mixed graph (ADMGs), we provide a set of graphical identification criteria in general decision processes that encompass both DTRs and MDPs. We discuss how our results relate to the often implicit causal assumptions made in the DTR and RL literatures and further clarify several common misconceptions. Finally, we present a realistic simulation study for the dynamic pricing problem encountered in container logistics, and demonstrate how violations of our graphical criteria can lead to suboptimal policies.
arXiv:2501.01414v1 Announce Type: cross Abstract: In the era of generative AI, deep generative models (DGMs) with latent representations have gained tremendous popularity. Despite their impressive empirical performance, the statistical properties of these models remain underexplored. DGMs are often overparametrized, non-identifiable, and uninterpretable black boxes, raising serious concerns when deploying them in high-stakes applications. Motivated by this, we propose an interpretable deep generative modeling framework for rich data types with discrete latent layers, called Deep Discrete Encoders (DDEs). A DDE is a directed graphical model with multiple binary latent layers. Theoretically, we propose transparent identifiability conditions for DDEs, which imply progressively smaller sizes of the latent layers as they go deeper. Identifiability ensures consistent parameter estimation and inspires an interpretable design of the deep architecture. Computationally, we propose a scalable estimation pipeline of a layerwise nonlinear spectral initialization followed by a penalized stochastic approximation EM algorithm. This procedure can efficiently estimate models with exponentially many latent components. Extensive simulation studies validate our theoretical results and demonstrate the proposed algorithms' excellent performance. We apply DDEs to three diverse real datasets for hierarchical topic modeling, image representation learning, response time modeling in educational testing, and obtain interpretable findings.
arXiv:2304.02022v3 Announce Type: replace Abstract: We study an online joint assortment-inventory optimization problem, in which we assume that the choice behavior of each customer follows the Multinomial Logit (MNL) choice model, and the attraction parameters are unknown a priori. The retailer makes periodic assortment and inventory decisions to dynamically learn from the customer choice observations about the attraction parameters while maximizing the expected total profit over time. In this paper, we propose a novel algorithm that can effectively balance exploration and exploitation in the online decision-making of assortment and inventory. Our algorithm builds on a new estimator for the MNL attraction parameters, an innovative approach to incentivize exploration by adaptively tuning certain known and unknown parameters, and an optimization oracle to static single-cycle assortment-inventory planning problems with given parameters. We establish a regret upper bound for our algorithm and a lower bound for the online joint assortment-inventory optimization problem, suggesting that our algorithm achieves nearly optimal regret rate, provided that the static optimization oracle is exact. Then we incorporate more practical approximate static optimization oracles into our algorithm, and bound from above the impact of static optimization errors on the regret of our algorithm. We perform numerical studies to demonstrate the effectiveness of our proposed algorithm. At last, we extend our study by incorporating inventory carryover and the learning of customer arrival distribution.
arXiv:2406.09169v2 Announce Type: replace Abstract: Real-world networks are sparse. As we show in this article, even when a large number of interactions is observed, most node pairs remain disconnected. We demonstrate that classical multi-edge network models, such as the $G(N,p)$, configuration models, and stochastic block models, fail to accurately capture this phenomenon. To mitigate this issue, zero-inflation must be integrated into these traditional models. Through zero-inflation, we incorporate a mechanism that accounts for the excess number of zeroes (disconnected pairs) observed in empirical data. By performing an analysis on all the datasets from the Sociopatterns repository, we illustrate how zero-inflated models more accurately reflect the sparsity and heavy-tailed edge count distributions observed in empirical data. Our findings underscore that failing to account for these ubiquitous properties in real-world networks inadvertently leads to biased models that do not accurately represent complex systems and their dynamics.
arXiv:2501.00087v1 Announce Type: cross Abstract: We investigate the parameter recovery of Markov-switching ordinary differential processes from discrete observations, where the differential equations are nonlinear additive models. This framework has been widely applied in biological systems, control systems, and other domains; however, limited research has been conducted on reconstructing the generating processes from observations. In contrast, many physical systems, such as human brains, cannot be directly experimented upon and rely on observations to infer the underlying systems. To address this gap, this manuscript presents a comprehensive study of the model, encompassing algorithm design, optimization guarantees, and quantification of statistical errors. Specifically, we develop a two-stage algorithm that first recovers the continuous sample path from discrete samples and then estimates the parameters of the processes. We provide novel theoretical insights into the statistical error and linear convergence guarantee when the processes are $\beta$-mixing. Our analysis is based on the truncation of the latent posterior processes and demonstrates that the truncated processes approximate the true processes under mixing conditions. We apply this model to investigate the differences in resting-state brain networks between the ADHD group and normal controls, revealing differences in the transition rate matrices of the two groups.
arXiv:2412.18145v1 Announce Type: cross Abstract: The social characteristics of players in a social network are closely associated with their network positions and relational importance. Identifying those influential players in a network is of great importance as it helps to understand how ties are formed, how information is propagated, and, in turn, can guide the dissemination of new information. Motivated by a Sina Weibo social network analysis of the 2021 Henan Floods, where response variables for each Sina Weibo user are available, we propose a new notion of supervised centrality that emphasizes the task-specific nature of a player's centrality. To estimate the supervised centrality and identify important players, we develop a novel sparse network influence regression by introducing individual heterogeneity for each user. To overcome the computational difficulties in fitting the model for large social networks, we further develop a forward-addition algorithm and show that it can consistently identify a superset of the influential Sina Weibo users. We apply our method to analyze three responses in the Henan Floods data: the number of comments, reposts, and likes, and obtain meaningful results. A further simulation study corroborates the developed method.
arXiv:2412.18180v1 Announce Type: cross Abstract: For a data-generating process for random variables that can be described with a linear structural equation model, we consider a situation in which (i) a set of covariates satisfying the back-door criterion cannot be observed or (ii) such a set can be observed, but standard statistical estimation methods cannot be applied to estimate causal effects because of multicollinearity/high-dimensional data problems. We propose a novel two-stage penalized regression approach, the penalized covariate-mediator selection operator (PCM Selector), to estimate the causal effects in such scenarios. Unlike existing penalized regression analyses, when a set of intermediate variables is available, PCM Selector provides a consistent or less biased estimator of the causal effect. In addition, PCM Selector provides a variable selection procedure for intermediate variables to obtain better estimation accuracy of the causal effects than does the back-door criterion.
arXiv:2412.18568v1 Announce Type: cross Abstract: The problem of evaluating the effectiveness of a treatment or policy commonly appears in causal inference applications under network interference. In this paper, we suggest the new method of high-dimensional network causal inference (HNCI) that provides both valid confidence interval on the average direct treatment effect on the treated (ADET) and valid confidence set for the neighborhood size for interference effect. We exploit the model setting in Belloni et al. (2022) and allow certain type of heterogeneity in node interference neighborhood sizes. We propose a linear regression formulation of potential outcomes, where the regression coefficients correspond to the underlying true interference function values of nodes and exhibit a latent homogeneous structure. Such a formulation allows us to leverage existing literature from linear regression and homogeneity pursuit to conduct valid statistical inferences with theoretical guarantees. The resulting confidence intervals for the ADET are formally justified through asymptotic normalities with estimable variances. We further provide the confidence set for the neighborhood size with theoretical guarantees exploiting the repro samples approach. The practical utilities of the newly suggested methods are demonstrated through simulation and real data examples.
arXiv:2110.09823v5 Announce Type: replace Abstract: Temporal point process as the stochastic process on continuous domain of time is commonly used to model the asynchronous event sequence featuring with occurrence timestamps. Thanks to the strong expressivity of deep neural networks, they are emerging as a promising choice for capturing the patterns in asynchronous sequences, in the context of temporal point process. In this paper, we first review recent research emphasis and difficulties in modeling asynchronous event sequences with deep temporal point process, which can be concluded into four fields: encoding of history sequence, formulation of conditional intensity function, relational discovery of events and learning approaches for optimization. We introduce most of recently proposed models by dismantling them into the four parts, and conduct experiments by remodularizing the first three parts with the same learning strategy for a fair empirical evaluation. Besides, we extend the history encoders and conditional intensity function family, and propose a Granger causality discovery framework for exploiting the relations among multi-types of events. Because the Granger causality can be represented by the Granger causality graph, discrete graph structure learning in the framework of Variational Inference is employed to reveal latent structures of the graph. Further experiments show that the proposed framework with latent graph discovery can both capture the relations and achieve an improved fitting and predicting performance.
arXiv:2311.17303v3 Announce Type: replace Abstract: In this paper, we develop a generic methodology to encode hierarchical causality structure among observed variables into a neural network in order to improve its predictive performance. The proposed methodology, called causality-informed neural network (CINN), leverages three coherent steps to systematically map the structural causal knowledge into the layer-to-layer design of neural network while strictly preserving the orientation of every causal relationship. In the first step, CINN discovers causal relationships from observational data via directed acyclic graph (DAG) learning, where causal discovery is recast as a continuous optimization problem to avoid the combinatorial nature. In the second step, the discovered hierarchical causality structure among observed variables is systematically encoded into neural network through a dedicated architecture and customized loss function. By categorizing variables in the causal DAG as root, intermediate, and leaf nodes, the hierarchical causal DAG is translated into CINN with a one-to-one correspondence between nodes in the causal DAG and units in the CINN while maintaining the relative order among these nodes. Regarding the loss function, both intermediate and leaf nodes in the DAG graph are treated as target outputs during CINN training so as to drive co-learning of causal relationships among different types of nodes. As multiple loss components emerge in CINN, we leverage the projection of conflicting gradients to mitigate gradient interference among the multiple learning tasks. Computational experiments across a broad spectrum of UCI data sets demonstrate substantial advantages of CINN in predictive performance over other state-of-the-art methods. In addition, an ablation study underscores the value of integrating structural and quantitative causal knowledge in enhancing the neural network's predictive performance incrementally.
arXiv:2412.17970v1 Announce Type: new Abstract: Causal reasoning capabilities are essential for large language models (LLMs) in a wide range of applications, such as education and healthcare. But there is still a lack of benchmarks for a better understanding of such capabilities. Current LLM benchmarks are mainly based on conversational tasks, academic math tests, and coding tests. Such benchmarks evaluate LLMs in well-regularized settings, but they are limited in assessing the skills and abilities to solve real-world problems. In this work, we provide a benchmark, named by CARL-GT, which evaluates CAusal Reasoning capabilities of large Language models using Graphs and Tabular data. The benchmark has a diverse range of tasks for evaluating LLMs from causal graph reasoning, knowledge discovery, and decision-making aspects. In addition, effective zero-shot learning prompts are developed for the tasks. In our experiments, we leverage the benchmark for evaluating open-source LLMs and provide a detailed comparison of LLMs for causal reasoning abilities. We found that LLMs are still weak in casual reasoning, especially with tabular data to discover new insights. Furthermore, we investigate and discuss the relationships of different benchmark tasks by analyzing the performance of LLMs. The experimental results show that LLMs have different strength over different tasks and that their performance on tasks in different categories, i.e., causal graph reasoning, knowledge discovery, and decision-making, shows stronger correlation than tasks in the same category.
arXiv:2410.21858v4 Announce Type: replace-cross Abstract: We develop a nonparametric, kernel-based joint estimator for conditional mean and covariance matrices in large and unbalanced panels. The estimator is supported by rigorous consistency results and finite-sample guarantees, ensuring its reliability for empirical applications in Finance. We apply it to an extensive panel of monthly US stock excess returns from 1962 to 2021, using macroeconomic and firm-specific covariates as conditioning variables. The estimator effectively captures time-varying cross-sectional dependencies, demonstrating robust statistical and economic performance. We find that idiosyncratic risk explains, on average, more than 75% of the cross-sectional variance.
arXiv:2412.16577v1 Announce Type: new Abstract: Discovering a unique causal structure is difficult due to both inherent identifiability issues, and the consequences of finite data. As such, uncertainty over causal structures, such as those obtained from a Bayesian posterior, are often necessary for downstream tasks. Finding an accurate approximation to this posterior is challenging, due to the large number of possible causal graphs, as well as the difficulty in the subproblem of finding posteriors over the functional relationships of the causal edges. Recent works have used meta-learning to view the problem of estimating the maximum a-posteriori causal graph as supervised learning. Yet, these methods are limited when estimating the full posterior as they fail to encode key properties of the posterior, such as correlation between edges and permutation equivariance with respect to nodes. Further, these methods also cannot reliably sample from the posterior over causal structures. To address these limitations, we propose a Bayesian meta learning model that allows for sampling causal structures from the posterior and encodes these key properties. We compare our meta-Bayesian causal discovery against existing Bayesian causal discovery methods, demonstrating the advantages of directly learning a posterior over causal structure.
arXiv:2412.16691v1 Announce Type: new Abstract: This research presents a three-step causal inference framework that integrates correlation analysis, machine learning-based causality discovery, and LLM-driven interpretations to identify socioeconomic factors influencing carbon emissions and contributing to climate change. The approach begins with identifying correlations, progresses to causal analysis, and enhances decision making through LLM-generated inquiries about the context of climate change. The proposed framework offers adaptable solutions that support data-driven policy-making and strategic decision-making in climate-related contexts, uncovering causal relationships within the climate change domain.
arXiv:2412.16452v1 Announce Type: cross Abstract: Statistical protocols are often used for decision-making involving multiple parties, each with their own incentives, private information, and ability to influence the distributional properties of the data. We study a game-theoretic version of hypothesis testing in which a statistician, also known as a principal, interacts with strategic agents that can generate data. The statistician seeks to design a testing protocol with controlled error, while the data-generating agents, guided by their utility and prior information, choose whether or not to opt in based on expected utility maximization. This strategic behavior affects the data observed by the statistician and, consequently, the associated testing error. We analyze this problem for general concave and monotonic utility functions and prove an upper bound on the Bayes false discovery rate (FDR). Underlying this bound is a form of prior elicitation: we show how an agent's choice to opt in implies a certain upper bound on their prior null probability. Our FDR bound is unimprovable in a strong sense, achieving equality at a single point for an individual agent and at any countable number of points for a population of agents. We also demonstrate that our testing protocols exhibit a desirable maximin property when the principal's utility is considered. To illustrate the qualitative predictions of our theory, we examine the effects of risk aversion, reward stochasticity, and signal-to-noise ratio, as well as the implications for the Food and Drug Administration's testing protocols.
arXiv:2404.05545v2 Announce Type: replace Abstract: Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. We evaluate six LLMs on the benchmarks, finding that GPT models show promising accuracy at predicting the intervention effects.
arXiv:2405.19317v2 Announce Type: replace Abstract: This study investigates an asymptotically locally minimax optimal algorithm for fixed-budget best-arm identification (BAI). We propose the Generalized Neyman Allocation (GNA) algorithm and demonstrate that its worst-case upper bound on the probability of misidentifying the best arm aligns with the worst-case lower bound under the small-gap regime, where the gap between the expected outcomes of the best and suboptimal arms is small. Our lower and upper bounds are tight, matching exactly including constant terms within the small-gap regime. The GNA algorithm generalizes the Neyman allocation for two-armed bandits (Neyman, 1934; Kaufmann et al., 2016) and refines existing BAI algorithms, such as those proposed by Glynn & Juneja (2004). By proposing an asymptotically minimax optimal algorithm, we address the longstanding open issue in BAI (Kaufmann, 2020) and treatment choice (Kasy & Sautmann, 202) by restricting a class of distributions to the small-gap regimes.