
102 posts

arXiv:2501.06247v1 Announce Type: new Abstract: Optimal Transport (OT) has established itself as a robust framework for quantifying differences between distributions, with applications that span fields such as machine learning, data science, and computer vision. This paper offers a detailed examination of the OT problem, beginning with its theoretical foundations, including the classical formulations of Monge and Kantorovich and their extensions to modern computational techniques. It explores cutting-edge algorithms, including Sinkhorn iterations, primal-dual strategies, and reduction-based approaches, emphasizing their efficiency and scalability in addressing high-dimensional problems. The paper also highlights emerging trends, such as integrating OT into machine learning frameworks, the development of novel problem variants, and ongoing theoretical advancements. Applications of OT are presented across a range of domains, with particular attention to its innovative application in time series data analysis via Optimal Transport Warping (OTW), a robust alternative to methods like Dynamic Time Warping. Despite the significant progress made, challenges related to scalability, robustness, and ethical considerations remain, necessitating further research. The paper underscores OT's potential to bridge theoretical depth and practical utility, fostering impactful advancements across diverse disciplines.

Sina Moradi1/14/2025

arXiv:2501.07503v1 Announce Type: new Abstract: In this paper, we give theoretically and practically efficient implementations of Big Atomics, i.e., $k$-word linearizable registers that support the load, store, and compare-and-swap (CAS) operations. While modern hardware supports $k = 1$ and sometimes $k = 2$ (e.g., double-width compare-and-swap in x86), our implementations support arbitrary $k$. Big Atomics are useful in many applications, including atomic manipulation of tuples, version lists, and implementing load-linked/store-conditional (LL/SC). We design fast, lock-free implementations of big atomics based on a novel fast-path-slow-path approach we develop. We then use them to develop an efficient concurrent hash table, as evidence of their utility. We experimentally validate the approach by comparing a variety of implementations of big atomics under a variety of workloads (thread counts, load/store ratios, contention, oversubscription, and number of atomics). The experiments compare two of our lock-free variants with C++ std::atomic, a lock-based version, a version using sequence locks, and an indirect version. The results show that our approach is close to the fastest under all conditions and far outperforms others under oversubscription. We also compare our big atomics based concurrent hash table to a variety of other state-of-the-art hash tables that support arbitrary length keys and values, including implementations from Intel's TBB, Facebook's Folly, libcuckoo, and a recent release from Boost. The results show that our approach of using big atomics in the design of hash tables is a promising direction.

Daniel Anderson, Guy E. Blelloch, Siddhartha Jayanti1/14/2025

arXiv:2412.03120v4 Announce Type: replace Abstract: Sinkhorn algorithm is the de-facto standard approximation algorithm for optimal transport, which has been applied to a variety of applications, including image processing and natural language processing. In theory, the proof of its convergence follows from the convergence of the Sinkhorn--Knopp algorithm for the matrix scaling problem, and Altschuler et al. show that its worst-case time complexity is in near-linear time. Very recently, sequentially composed optimal transports were proposed by Watanabe and Isobe as a hierarchical extension of optimal transports. In this paper, we present an efficient approximation algorithm, namely Sinkhorn algorithm for sequentially composed optimal transports, for its entropic regularization. Furthermore, we present a theoretical analysis of the Sinkhorn algorithm, namely (i) its exponential convergence to the optimal solution with respect to the Hilbert pseudometric, and (ii) a worst-case complexity analysis for the case of one sequential composition.

Kazuki Watanabe, Noboru Isobe1/14/2025

arXiv:2405.00082v3 Announce Type: replace-cross Abstract: We study the problem of Hamiltonian structure learning from real-time evolution: given the ability to apply $e^{-\mathrm{i} Ht}$ for an unknown local Hamiltonian $H = \sum_{a = 1}^m \lambda_a E_a$ on $n$ qubits, the goal is to recover $H$. This problem is already well-understood under the assumption that the interaction terms, $E_a$, are given, and only the interaction strengths, $\lambda_a$, are unknown. But how efficiently can we learn a local Hamiltonian without prior knowledge of its interaction structure? We present a new, general approach to Hamiltonian learning that not only solves the challenging structure learning variant, but also resolves other open questions in the area, all while achieving the gold standard of Heisenberg-limited scaling. In particular, our algorithm recovers the Hamiltonian to $\varepsilon$ error with total evolution time $O(\log (n)/\varepsilon)$, and has the following appealing properties: (1) it does not need to know the Hamiltonian terms; (2) it works beyond the short-range setting, extending to any Hamiltonian $H$ where the sum of terms interacting with a qubit has bounded norm; (3) it evolves according to $H$ in constant time $t$ increments, thus achieving constant time resolution. As an application, we can also learn Hamiltonians exhibiting power-law decay up to accuracy $\varepsilon$ with total evolution time beating the standard limit of $1/\varepsilon^2$.

Ainesh Bakshi, Allen Liu, Ankur Moitra, Ewin Tang1/14/2025

arXiv:2501.06588v1 Announce Type: new Abstract: We consider coresets for $k$-clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the $k$-median objective $\sum_{p}\min_{c\in C}dist(p,C)$. Given a point set $P$, a coreset $\Omega$ is a small weighted subset that approximates the cost of $P$ for all candidate solutions $C$ up to a $(1\pm\varepsilon )$ multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved $k$-median coreset bounds for the following metrics: Coresets of size $\tilde{O}\left(k\varepsilon^{-2}\right)$ for shortest path metrics in planar graphs, improving over the bounds $\tilde{O}\left(k\varepsilon^{-6}\right)$ by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC'21] and $\tilde{O}\left(k^2\varepsilon^{-4}\right)$ by [Braverman, Jiang, Krauthgamer, Wu, SODA'21]. Coresets of size $\tilde{O}\left(kd\ell\varepsilon^{-2}\log m\right)$ for clustering $d$-dimensional polygonal curves of length at most $m$ with curves of length at most $\ell$ with respect to Frechet metrics, improving over the bounds $\tilde{O}\left(k^3d\ell\varepsilon^{-3}\log m\right)$ by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS'22] and $\tilde{O}\left(k^2d\ell\varepsilon^{-2}\log m \log |P|\right)$ by [Conradi, Kolbe, Psarros, Rohde, SoCG'24].

Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, Chris Schwiegelshohn1/14/2025

arXiv:2501.06246v1 Announce Type: new Abstract: Tokenization is the process of encoding strings into tokens from a fixed vocabulary of size $k$ and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE, while achieving a comparable objective score as GreedWMC (which could have achieved a higher score due to relaxation).

Jia Peng Lim, Davin Choo, Hady W. Lauw1/14/2025

arXiv:2404.15554v3 Announce Type: replace Abstract: In the online disjoint set covers problem, the edges of a hypergraph are revealed online, and the goal is to partition them into a maximum number of disjoint set covers. That is, n nodes of a hypergraph are given at the beginning, and then a sequence of hyperedges (subsets of [n]) is presented to an algorithm. For each hyperedge, an online algorithm must assign a color (an integer). Once an input terminates, the gain of the algorithm is the number of colors that correspond to valid set covers (i.e., the union of hyperedges that have that color contains all n nodes). We present a deterministic online algorithm that is O(log^2 n)-competitive, exponentially improving on the previous bound of O(n) and matching the performance of the best randomized algorithm by Emek et al. [ESA 2019]. For color selection, our algorithm uses a novel potential function, which can be seen as an online counterpart of the derandomization method of conditional probabilities and pessimistic estimators. There are only a few cases where derandomization has been successfully used in the field of online algorithms. In contrast to previous approaches, our result extends to the following new challenges: (i) the potential function derandomizes not only the Chernoff bound, but also the coupon collector's problem, (ii) the value of OPT of the maximization problem is not bounded a priori, and (iii) we do not produce a fractional solution first, but work directly on the input.

Marcin Bienkowski, Jaros{\l}aw Byrka, {\L}ukasz Je\.z1/14/2025

arXiv:2308.06254v3 Announce Type: replace Abstract: Prize-Collecting TSP is a variant of the traveling salesperson problem where one may drop vertices from the tour at the cost of vertex-dependent penalties. The quality of a solution is then measured by adding the length of the tour and the sum of all penalties of vertices that are not visited. We present a polynomial-time approximation algorithm with an approximation guarantee slightly below $1.6$, where the guarantee is with respect to the natural linear programming relaxation of the problem. This improves upon the previous best-known approximation ratio of $1.774$. Our approach is based on a known decomposition for solutions of this linear relaxation into rooted trees. Our algorithm takes a tree from this decomposition and then performs a pruning step before doing parity correction on the remainder. Using a simple analysis, we bound the approximation guarantee of the proposed algorithm by $(1+\sqrt{5})/2 \approx 1.618$, the golden ratio. With some additional technical care we further improve it to $1.599$. Furthermore, we show that for the path version of Prize-Collecting TSP (known as Prize-Collecting Stroll) our approach yields an approximation guarantee of 1.6662, improving upon the previous best-known guarantee of 1.926.

Jannis Blauth, Nathan Klein, Martin N\"agele1/14/2025

arXiv:2411.10719v3 Announce Type: replace Abstract: The Seat Arrangement Problem is a problem of finding a desirable seat arrangement for given preferences of agents and a seat graph that represents a configuration of seats. In this paper, we consider decision problems of determining if an envy-free arrangement exists and an exchange-stable arrangement exists, when a seat graph is an $\ell \times m$ grid graph. When $\ell=1$, the seat graph is a path of length $m$ and both problems have been known to be NP-complete. In this paper, we extend it and show that both problems are NP-complete for any integer $\ell \geq 2$.

Sota Kawase, Shuichi Miyazaki1/14/2025

arXiv:2407.04976v3 Announce Type: replace Abstract: We develop a novel algorithm to construct a congestion-approximator with polylogarithmic quality on a capacitated, undirected graph in nearly-linear time. Our approach is the first *bottom-up* hierarchical construction, in contrast to previous *top-down* approaches including that of Racke, Shah, and Taubig (SODA 2014), the only other construction achieving polylogarithmic quality that is implementable in nearly-linear time (Peng, SODA 2016). Similar to Racke, Shah, and Taubig, our construction at each hierarchical level requires calls to an approximate max-flow/min-cut subroutine. However, the main advantage to our bottom-up approach is that these max-flow calls can be implemented directly *without recursion*. More precisely, the previously computed levels of the hierarchy can be converted into a *pseudo-congestion-approximator*, which then translates to a max-flow algorithm that is sufficient for the particular max-flow calls used in the construction of the next hierarchical level. As a result, we obtain the first non-recursive algorithms for congestion-approximator and approximate max-flow that run in nearly-linear time, a conceptual improvement to the aforementioned algorithms that recursively alternate between the two problems.

Jason Li, Satish Rao, Di Wang1/14/2025

arXiv:2501.06417v1 Announce Type: new Abstract: Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\mathrm{poly}(1/\epsilon)$ samples from the data distribution, we can round all but $O(m)$ model weights such that the expected approximation error of the quantized model on the true data distribution is $\le \epsilon$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our proof, which is algorithmic, inspired a simple and practical rounding algorithm called \emph{DiscQuant}. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%). We make our code available at https://github.com/jerry-chee/DiscQuant.

Jerry Chee, Arturs Backurs, Rainie Heck, Li Zhang, Janardhan Kulkarni, Thomas Rothvoss, Sivakanth Gopi1/14/2025

arXiv:2501.06949v1 Announce Type: new Abstract: This thesis investigates three biologically inspired operations: prefix-suffix duplication, bounded prefix-suffix duplication, and prefix-suffix-square completion. Duplication, a common genetic mutation, involves repeating DNA sequences and is modeled here as formal operations on words. The prefix-suffix duplication generates non-context-free languages, even from simple initial words. To better reflect biological processes, we propose a bounded variant that limits duplication length, resolving unsolved problems and aligning with biochemical realities. We also introduce the prefix-suffix-square completion operation, which generates squares at sequence ends. This operation enables the generation of infinite words such as Fibonacci, Period-doubling, and Thue-Morse, which contain squares but avoid higher exponent repetitions, highlighting unique structural properties. In contrast, prefix-suffix duplication cannot generate certain infinite words, such as Thue-Morse, but can produce cube-free words. Additionally, we address the detection of gapped repeats and palindromes-structures important in DNA and RNA analysis. These involve repeating or reversed factors flanking a central gap. Previous studies imposed constraints on gap length or arm-gap relationships; we extend this by solving the problem in three novel settings. This work advances theoretical insights into biologically inspired operations and their computational applications in genetic modeling.

Marius Dumitran1/14/2025

arXiv:2501.06872v1 Announce Type: new Abstract: This paper investigates the shared-memory Graph Transposition (GT) problem, a fundamental graph algorithm that is widely used in graph analytics and scientific computing. Previous GT algorithms have significant memory requirements that are proportional to the number of vertices and threads which obstructs their use on large graphs. Moreover, atomic memory operations have become comparably fast on recent CPU architectures, which creates new opportunities for improving the performance of concurrent atomic accesses in GT. We design PoTra, a GT algorithm which leverages graph structure and processor and memory architecture to optimize locality and performance. PoTra limits the size of additional data structures close to CPU cache sizes and utilizes the skewed degree distribution of graph datasets to optimize locality and performance. We present the performance model of PoTra to explain the connection between cache and memory response times and graph locality. Our evaluation of PoTra on three CPU architectures and 20 real-world and synthetic graph datasets with up to 128 billion edges demonstrates that PoTra achieves up to 8.7 times speedup compared to previous works and if there is a performance loss it remains limited to 15.7%, on average.

Mohsen Koohi Esfahani, Hans Vandierendonck1/14/2025

arXiv:2501.06647v1 Announce Type: new Abstract: Tucker decomposition has been widely used in a variety of applications to obtain latent factors of tensor data. In these applications, a common need is to compute Tucker decomposition for a given time range. Furthermore, real-world tensor time series are typically evolving in the time dimension. Such needs call for a data structure that can efficiently and accurately support range queries of Tucker decomposition and stream updates. Unfortunately, existing methods do not support either range queries or stream updates. This challenging problem has remained open for years prior to our work. To solve this challenging problem, we propose TUCKET, a data structure that can efficiently and accurately handle both range queries and stream updates. Our key idea is to design a new data structure that we call a stream segment tree by generalizing the segment tree, a data structure that was originally invented for computational geometry. For a range query of length $L$, our TUCKET can find $O(\log L)$ nodes (called the hit set) from the tree and efficiently stitch their preprocessed decompositions to answer the range query. We also propose an algorithm to optimally prune the hit set via an approximation of subtensor decomposition. For the $T$-th stream update, our TUCKET modifies only amortized $O(1)$ nodes and only $O(\log T)$ nodes in the worst case. Extensive evaluation demonstrates that our TUCKET consistently achieves the highest efficiency and accuracy across four large-scale datasets. Our TUCKET achieves at least 3 times lower latency and at least 1.4 times smaller reconstruction error than Zoom-Tucker on all datasets.

Ruizhong Qiu, Jun-Gi Jang, Xiao Lin, Lihui Liu, Hanghang Tong1/14/2025

arXiv:2501.06452v1 Announce Type: new Abstract: In the 3-Hitting Set problem, the input is a hypergraph $G$ such that the size of every hyperedge of $G$ is at most 3, and an integers $k$, and the goal is to decide whether there is a set $S$ of at most $k$ vertices such that every hyperedge of $G$ contains at least one vertex from $S$. In this paper we give an $O^*(2.0409^k)$-time algorithm for 3-Hitting Set.

Dekel Tsur1/14/2025

arXiv:2501.03688v1 Announce Type: new Abstract: The Closest Vector Problem (CVP) is a computational problem in lattices that is central to modern cryptography. The study of its fine-grained complexity has gained momentum in the last few years, partly due to the upcoming deployment of lattice-based cryptosystems in practice. A main motivating question has been if there is a $(2-\varepsilon)^n$ time algorithm on lattices of rank $n$, or whether it can be ruled out by SETH. Previous work came tantalizingly close to a negative answer by showing a $2^{(1-o(1))n}$ lower bound under SETH if the underlying distance metric is changed from the standard $\ell_2$ norm to other $\ell_p$ norms. Moreover, barriers toward proving such results for $\ell_2$ (and any even $p$) were established. In this paper we show \emph{positive results} for a natural special case of the problem that has hitherto seemed just as hard, namely $(0,1)$-$\mathsf{CVP}$ where the lattice vectors are restricted to be sums of subsets of basis vectors (meaning that all coefficients are $0$ or $1$). All previous hardness results applied to this problem, and none of the previous algorithmic techniques could benefit from it. We prove the following results, which follow from new reductions from $(0,1)$-$\mathsf{CVP}$ to weighted Max-SAT and minimum-weight $k$-Clique. 1. An $O(1.7299^n)$ time algorithm for exact $(0,1)$-$\mathsf{CVP}_2$ in Euclidean norm, breaking the natural $2^n$ barrier, as long as the absolute value of all coordinates in the input vectors is $2^{o(n)}$. 2. A computational equivalence between $(0,1)$-$\mathsf{CVP}_p$ and Max-$p$-SAT for all even $p$. 3. The minimum-weight-$k$-Clique conjecture from fine-grained complexity and its numerous consequences (which include the APSP conjecture) can now be supported by the hardness of a lattice problem, namely $(0,1)$-$\mathsf{CVP}_2$.

Amir Abboud, Rajendra Kumar1/8/2025

arXiv:2501.03649v1 Announce Type: new Abstract: The last five years of research on distributed graph algorithms have seen huge leaps of progress, both regarding algorithmic improvements and impossibility results: new strong lower bounds have emerged for many central problems and exponential improvements over the state of the art have been achieved for the runtimes of many algorithms. Nevertheless, there are still large gaps between the best known upper and lower bounds for many important problems. The current lower bound techniques for deterministic algorithms are often tailored to obtaining a logarithmic bound and essentially cannot be used to prove lower bounds beyond $\Omega(\log n)$. In contrast, the best deterministic upper bounds are often polylogarithmic, raising the fundamental question of how to resolve the gap between logarithmic lower and polylogarithmic upper bounds and finally obtain tight bounds. We develop a novel algorithm design technique aimed at closing this gap. In essence, each node finds a carefully chosen local solution in $O(\log n)$ rounds and we guarantee that this solution is consistent with the other nodes' solutions without coordination. The local solutions are based on a distributed version of Hall's theorem that may be of independent interest and motivates the title of this work. We showcase our framework by improving on the state of the art for the following fundamental problems: edge coloring, bipartite saturating matchings and hypergraph sinkless orientation. In particular, we obtain an asymptotically optimal $O(\log n)$-round algorithm for $3\Delta/2$-edge coloring in bounded degree graphs. The previously best bound for the problem was $O(\log^4 n)$ rounds, obtained by plugging in the state-of-the-art maximal independent set algorithm from arXiv:2303.16043 into the $3\Delta/2$-edge coloring algorithm from arXiv:1711.05469 .

Sebastian Brandt, Yannic Maus, Ananth Narayanan, Florian Schager, Jara Uitto1/8/2025

arXiv:2501.03663v1 Announce Type: new Abstract: Hybrid $k$-Clustering is a model of clustering that generalizes two of the most widely studied clustering objectives: $k$-Center and $k$-Median. In this model, given a set of $n$ points $P$, the goal is to find $k$ centers such that the sum of the $r$-distances of each point to its nearest center is minimized. The $r$-distance between two points $p$ and $q$ is defined as $\max\{d(p, q)-r, 0\}$ -- this represents the distance of $p$ to the boundary of the $r$-radius ball around $q$ if $p$ is outside the ball, and $0$ otherwise. This problem was recently introduced by Fomin et al. [APPROX 2024], who designed a $(1+\varepsilon, 1+\varepsilon)$-bicrtieria approximation that runs in time $2^{(kd/\varepsilon)^{O(1)}} \cdot n^{O(1)}$ for inputs in $\mathbb{R}^d$; such a bicriteria solution uses balls of radius $(1+\varepsilon)r$ instead of $r$, and has a cost at most $1+\varepsilon$ times the cost of an optimal solution using balls of radius $r$. In this paper we significantly improve upon this result by designing an approximation algorithm with the same bicriteria guarantee, but with running time that is FPT only in $k$ and $\varepsilon$ -- crucially, removing the exponential dependence on the dimension $d$. This resolves an open question posed in their paper. Our results extend further in several directions. First, our approximation scheme works in a broader class of metric spaces, including doubling spaces, minor-free, and bounded treewidth metrics. Secondly, our techniques yield a similar bicriteria FPT-approximation schemes for other variants of Hybrid $k$-Clustering, e.g., when the objective features the sum of $z$-th power of the $r$-distances. Finally, we also design a coreset for Hybrid $k$-Clustering in doubling spaces, answering another open question from the work of Fomin et al.

Ameet Gadekar, Tanmay Inamdar1/8/2025

arXiv:2501.03363v1 Announce Type: new Abstract: We consider the optimisation problem of adding $k$ links to a given network, such that the resulting effective graph resistance is as small as possible. The problem was recently proven to be NP-hard, such that optimal solutions obtained with brute-force methods require exponentially many computation steps and thus are infeasible for any graph of realistic size. Therefore, it is common in such cases to use a simple greedy algorithm to obtain an approximation of the optimal solution. It is known that if the considered problem is submodular, the quality of the greedy solution can be guaranteed. However, it is known that the optimisation problem we are facing, is not submodular. For such cases one can use the notion of generalized submodularity, which is captured by the submodularity ratio $\gamma$. A performance bound, which is a function of $\gamma$, also exists in case of generalized submodularity. In this paper we give an example of a family of graphs where the submodularity ratio approaches zero, implying that the solution quality of the greedy algorithm cannot be guaranteed. Furthermore, we show that the greedy algorithm does not always yield the optimal solution and demonstrate that even for a small graph with 10 nodes, the ratio between the optimal and the greedy solution can be as small as 0.878.

Massimo A. Achterberg, Robert E. Kooij1/8/2025

arXiv:2501.03488v1 Announce Type: new Abstract: The Chernoff bound is one of the most widely used tools in theoretical computer science. It's rare to find a randomized algorithm that doesn't employ a Chernoff bound in its analysis. The standard proofs of Chernoff bounds are beautiful but in some ways not very intuitive. In this paper, I'll show you a different proof that has four features: (1) the proof offers a strong intuition for why Chernoff bounds look the way that they do; (2) the proof is user-friendly and (almost) algebra-free; (3) the proof comes with matching lower bounds, up to constant factors in the exponent; and (4) the proof extends to establish generalizations of Chernoff bounds in other settings. The ultimate goal is that, once you know this proof (and with a bit of practice), you should be able to confidently reason about Chernoff-style bounds in your head, extending them to other settings, and convincing yourself that the bounds you're obtaining are tight (up to constant factors in the exponent).

William Kuszmaul1/8/2025