cs.DC

135 posts

arXiv:2501.07207v1 Announce Type: new Abstract: Digital infrastructures are seeing convergence and connectivity at unprecedented scale. This is true for both current critical national infrastructures and emerging future systems that are highly cyber-physical in nature with complex intersections between humans and technologies, e.g., smart cities, intelligent transportation, high-value manufacturing and Industry 4.0. Diverse legacy and non-legacy software systems underpinned by heterogeneous hardware compose on-the-fly to deliver services to millions of users with varying requirements and unpredictable actions. This complexity is compounded by intricate and complicated supply-chains with many digital assets and services outsourced to third parties. The reality is that, at any particular point in time, there will be untrusted, partially-trusted or compromised elements across the infrastructure. Given this reality, and the societal scale of digital infrastructures, delivering secure and resilient operations is a major challenge. We argue that this requires us to move beyond the paradigm of security-by-design and embrace the challenge of securing-a-compromised-system.

Awais Rashid, Sana Belguith, Matthew Bradbury, Sadie Creese, Ivan Flechais, Neeraj Suri1/14/2025

arXiv:2404.08973v3 Announce Type: replace Abstract: Fairness in federated learning has emerged as a critical concern, aiming to develop an unbiased model among groups (e.g., male or female) of diverse sensitive features. However, there is a trade-off between model performance and fairness, i.e., improving model fairness will decrease model performance. Existing approaches have characterized such a trade-off by introducing hyperparameters to quantify client's preferences for model fairness and model performance. Nevertheless, these approaches are limited to scenarios where each client has only a single pre-defined preference, and fail to work in practical systems where each client generally has multiple preferences. To this end, we propose a Preference-aware scheme in Fair Federated Learning (called PraFFL) to generate preference-specific models in real time. PraFFL can adaptively adjust the model based on each client's preferences to meet their needs. We theoretically prove that PraFFL can offer the optimal model tailored to an arbitrary preference of each client, and show its linear convergence. Experimental results show that our proposed PraFFL outperforms six fair federated learning algorithms in terms of the model's capability of adapting to clients' different preferences. Our implementation is available at https://github.com/rG223/PraFFL.

Rongguang Ye, Wei-Bin Kou, Ming Tang1/14/2025

arXiv:2501.07130v1 Announce Type: new Abstract: Edge computing has become critical for enabling latency-sensitive applications, especially when paired with cloud resources to form cloud-assisted edge clusters. However, efficient resource management remains challenging due to edge nodes' limited capacity and unreliable connectivity. This paper introduces KubeDSM, a Kubernetes-based dynamic scheduling and migration framework tailored for cloud-assisted edge environments. KubeDSM addresses the challenges of resource fragmentation, dynamic scheduling, and live migration while ensuring Quality of Service (QoS) for latency-sensitive applications. Unlike Kubernetes' default scheduler, KubeDSM adopts batch scheduling to minimize resource fragmentation and incorporates a live migration mechanism to optimize edge resource utilization. Specifically, KubeDSM facilitates three key operations: intra-edge migration to reduce fragmentation, edge-to-cloud migration during resource shortages, and cloud-to-edge migration when resources become available, thereby increasing the number of pods allocated to the edge. Our results demonstrate that KubeDSM consistently achieves a higher average edge ratio and a lower standard deviation in edge ratios, highlighting its ability to provide more effective and stable scheduling across different deployments. We also explore the impact of migration strategies and Quality of Service (QoS) configurations on the edge ratios achieved by KubeDSM. The findings reveal that enabling migrations significantly enhances the edge ratio by reducing fragmentation. Additionally, KubeDSM's adaptability in respecting QoS requirements while maximizing overall edge ratios is confirmed through different QoS scenarios.

Amirhossein Pashaeehir, Sina Shariati, Shayan Shafaghi, Manni Moghimi, Mahmoud Momtazpour1/14/2025

arXiv:2501.07056v1 Announce Type: new Abstract: Sparse GEneral Matrix-matrix Multiplication (SpGEMM) is a critical operation in many applications. Current multithreaded implementations are based on Gustavson's algorithm and often perform poorly on large matrices due to limited cache reuse by the accumulators. We present MAGNUS (Matrix Algebra for Gigantic NUmerical Systems), a novel algorithm to maximize data locality in SpGEMM. To generate locality, MAGNUS reorders the intermediate product into discrete cache-friendly chunks using a two-level hierarchical approach. The accumulator is applied to each chunk, where the chunk size is chosen such that the accumulator is cache-efficient. MAGNUS is input- and system-aware: based on the matrix characteristics and target system specifications, the optimal number of chunks is computed by minimizing the storage cost of the necessary data structures. MAGNUS allows for a hybrid accumulation strategy in which each chunk uses a different accumulator based on an input threshold. We consider two accumulators: an AVX-512 vectorized bitonic sorting algorithm and classical dense accumulation. An OpenMP implementation of MAGNUS is compared with several baselines for a variety of different matrices on three Intel x86 architectures. For matrices from the SuiteSparse collection, MAGNUS is faster than all the baselines in most cases and is orders of magnitude faster than Intel MKL for several matrices. For massive random matrices that model social network graphs, MAGNUS scales to the largest matrix sizes, while the baselines fail to do so. Furthermore, MAGNUS is close to the optimal bound for these matrices, regardless of the matrix size, structure, and density.

Jordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini1/14/2025

arXiv:2501.07435v1 Announce Type: new Abstract: We present Union, a trust-minimized bridge protocol that enables secure transfer of BTC between Bitcoin and a secondary blockchain. The growing ecosystem of blockchain systems built around Bitcoin has created a pressing need for secure and efficient bridges to transfer BTC between networks while preserving Bitcoin's security guarantees. Union employs a multi-party variant of BitVMX, an optimistic proving system on Bitcoin, to create a bridge that operates securely under the assumption that at least one participant remains honest. This 1-of-n honest approach is strikingly different from the conventional honest-majority assumption adopted by practically all federated systems. The protocol introduces several innovations: a packet-based architecture that allows security bonds to be reused for multiple bridge operations, improving capital efficiency; a system of enablers to manage functionaries participation and to enforce penalties; a flexible light client framework adaptable to various blockchain architectures; and an efficient stop watch mechanism to optimize time-lock management. Union is a practical and scalable solution for Bitcoin interoperability that maintains strong security guarantees and minimizes trust assumptions.

Ramon Amela (Rootstock Labs), Shreemoy Mishra (Rootstock Labs), Sergio Demian Lerner (Rootstock Labs, Fairgate Labs), Javier \'Alvarez Cid-Fuentes (Rootstock Labs)1/14/2025

arXiv:2303.09287v5 Announce Type: replace Abstract: We introduce semitopology, a generalisation of point-set topology that removes the restriction that intersections of open sets need necessarily be open. The intuition is that points represent participants in a decentralised system, and open sets represent collections of participants that collectively have the authority to collaborate to update their local state; we call this an actionable coalition. Examples of actionable coalition include: majority stakes in proof-of-stake blockchains; communicating peers in peer-to-peer networks; and even pedestrians working together to not bump into one another in the street. Where actionable coalitions exist, they have in common that: collaborations are local (updating the states of the participants in the coalition, but not immediately those of the whole system); collaborations are voluntary (up to and including breaking rules); participants may be heterogeneous in their computing power or in their goals (not all pedestrians want to go to the same place); participants can choose with whom to collaborate; and they are not assumed subject to permission or synchronisation by a central authority. We develop a topology-flavoured mathematics that goes some way to explaining how and why these complex decentralised systems can exhibit order, and gives us new ways to understand existing practical implementations.

Murdoch Gabbay1/14/2025

arXiv:2501.06856v1 Announce Type: new Abstract: Convolutional neural networks (CNNs) are widely applied in real-time applications on resource-constrained devices. To accelerate CNN inference, prior works proposed to distribute the inference workload across multiple devices. However, they did not address stragglers and device failures in distributed inference, which is challenging due to the devices' time-varying and possibly unknown computation/communication capacities. To address this, we propose a distributed coded inference system, called CoCoI. It splits the convolutional layers of CNN, considering the data dependency of high-dimensional inputs and outputs, and then adapts coding schemes to generate task redundancy. With CoCoI, the inference results can be determined once a subset of devices complete their subtasks, improving robustness against stragglers and failures. To theoretically analyze the tradeoff between redundancy and subtask workload, we formulate an optimal splitting problem to minimize the expected inference latency. Despite its non-convexity, we determine an approximate strategy with minor errors, and prove that CoCoI outperforms uncoded benchmarks. For performance evaluation, we build a testbed with Raspberry Pi 4Bs. The experimental results show that the approximate strategy closely matches the optimal solution. When compared with uncoded benchmarks, CoCoI reduces inference latency by up to 34.2% in the presence of stragglers and device failures.

Xing Liu, Chao Huang, Ming Tang1/14/2025

arXiv:2501.06780v1 Announce Type: new Abstract: Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods.

Jihoon Park, Jeongin Choe, Dohyun Kim, Jae-Joon Kim1/14/2025

arXiv:2501.06872v1 Announce Type: new Abstract: This paper investigates the shared-memory Graph Transposition (GT) problem, a fundamental graph algorithm that is widely used in graph analytics and scientific computing. Previous GT algorithms have significant memory requirements that are proportional to the number of vertices and threads which obstructs their use on large graphs. Moreover, atomic memory operations have become comparably fast on recent CPU architectures, which creates new opportunities for improving the performance of concurrent atomic accesses in GT. We design PoTra, a GT algorithm which leverages graph structure and processor and memory architecture to optimize locality and performance. PoTra limits the size of additional data structures close to CPU cache sizes and utilizes the skewed degree distribution of graph datasets to optimize locality and performance. We present the performance model of PoTra to explain the connection between cache and memory response times and graph locality. Our evaluation of PoTra on three CPU architectures and 20 real-world and synthetic graph datasets with up to 128 billion edges demonstrates that PoTra achieves up to 8.7 times speedup compared to previous works and if there is a performance loss it remains limited to 15.7%, on average.

Mohsen Koohi Esfahani, Hans Vandierendonck1/14/2025

arXiv:2501.07036v1 Announce Type: new Abstract: Given a positive integer $k$, $k$-set agreement is the distributed task in which each process $i\in [n]$ in a group of $n$ processing nodes starts with an input value $x_i$ in the set $\{0,\dots,k\}$, and must output a value $y_i$ such that (1) for every $i \in [n]$, $y_i$ is the input value of some process, and (2)$|\{y_i : i\in [n]\}|\leq k$. That is, at most $k$ different values in total must be outputted by the processes. The case $k=1$ correspond to (binary) consensus, arguably the most studied problem in distributed computing. While lower bounds for consensus have been obtained for most of the standard distributed computing models, the design of lower bounds for $k$-set agreement with $k>1$ is notoriously known to be much more difficult, and remains open for many models. The main techniques for designing lower bounds for k-set agreement with $k>1$ use tools from algebraic topology. The algebraic topology tools are difficult to manipulate, and require a lot of care for avoiding mistakes. This difficulty increases when the communications are mediated by a network of arbitrary structure. Recently, the KNOWALL model has been specifically designed as a first attempt to understand the LOCAL model through the lens of algebraic topology, and Casta\~neda et al.(2021) have designed lower bounds for $k$-set agreement in the KNOWALL model, with applications to dynamic networks. In this work, we re-prove the same lower bound for $k$-set agreement in the KNOWALL model. This new proof stands out in its simplicity, which makes it accessible to a broader audience, and increases confidence in the result.

Pierre Fraigniaud, Minh Hang Nguyen, Ami Paz1/14/2025

arXiv:2501.06531v1 Announce Type: new Abstract: Recent advances have improved the throughput and latency of blockchains by processing transactions accessing different parts of the state concurrently. However, these systems are unable to concurrently process (a) transactions accessing the same state, even if they are (almost) commutative, e.g., payments much smaller than an account's balance, and (b) multi-party transactions, e.g., asset swaps. Moreover, they are slow to recover from contention, requiring once-in-a-day synchronization. We present Stingray, a novel blockchain architecture that addresses these limitations. The key conceptual contributions are a replicated bounded counter that processes (almost) commutative transactions concurrently, and a FastUnlock protocol that uses a fallback consensus protocol for fast contention recovery. We prove Stingray's security in an asynchronous network with Byzantine faults and demonstrate on a global testbed that Stingray achieves 10,000 times the throughput of prior systems for commutative workloads.

Srivatsan Sridhar, Alberto Sonnino, Lefteris Kokoris-Kogias1/14/2025

arXiv:2501.06650v1 Announce Type: new Abstract: Split Learning (SL) is a distributed deep learning approach enabling multiple clients and a server to collaboratively train and infer on a shared deep neural network (DNN) without requiring clients to share their private local data. The DNN is partitioned in SL, with most layers residing on the server and a few initial layers and inputs on the client side. This configuration allows resource-constrained clients to participate in training and inference. However, the distributed architecture exposes SL to backdoor attacks, where malicious clients can manipulate local datasets to alter the DNN's behavior. Existing defenses from other distributed frameworks like Federated Learning are not applicable, and there is a lack of effective backdoor defenses specifically designed for SL. We present SafeSplit, the first defense against client-side backdoor attacks in Split Learning (SL). SafeSplit enables the server to detect and filter out malicious client behavior by employing circular backward analysis after a client's training is completed, iteratively reverting to a trained checkpoint where the model under examination is found to be benign. It uses a two-fold analysis to identify client-induced changes and detect poisoned models. First, a static analysis in the frequency domain measures the differences in the layer's parameters at the server. Second, a dynamic analysis introduces a novel rotational distance metric that assesses the orientation shifts of the server's layer parameters during training. Our comprehensive evaluation across various data distributions, client counts, and attack scenarios demonstrates the high efficacy of this dual analysis in mitigating backdoor attacks while preserving model utility.

Phillip Rieger, Alessandro Pegoraro, Kavita Kumari, Tigist Abera, Jonathan Knauer, Ahmad-Reza Sadeghi1/14/2025

arXiv:2501.07503v1 Announce Type: new Abstract: In this paper, we give theoretically and practically efficient implementations of Big Atomics, i.e., $k$-word linearizable registers that support the load, store, and compare-and-swap (CAS) operations. While modern hardware supports $k = 1$ and sometimes $k = 2$ (e.g., double-width compare-and-swap in x86), our implementations support arbitrary $k$. Big Atomics are useful in many applications, including atomic manipulation of tuples, version lists, and implementing load-linked/store-conditional (LL/SC). We design fast, lock-free implementations of big atomics based on a novel fast-path-slow-path approach we develop. We then use them to develop an efficient concurrent hash table, as evidence of their utility. We experimentally validate the approach by comparing a variety of implementations of big atomics under a variety of workloads (thread counts, load/store ratios, contention, oversubscription, and number of atomics). The experiments compare two of our lock-free variants with C++ std::atomic, a lock-based version, a version using sequence locks, and an indirect version. The results show that our approach is close to the fastest under all conditions and far outperforms others under oversubscription. We also compare our big atomics based concurrent hash table to a variety of other state-of-the-art hash tables that support arbitrary length keys and values, including implementations from Intel's TBB, Facebook's Folly, libcuckoo, and a recent release from Boost. The results show that our approach of using big atomics in the design of hash tables is a promising direction.

Daniel Anderson, Guy E. Blelloch, Siddhartha Jayanti1/14/2025

arXiv:2501.07526v1 Announce Type: new Abstract: Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where communication is more expensive than computation, the scalability and performance of these algorithms are limited by communication cost. This work generalizes prior work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between $s$-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants. We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of $5.3\times$ over $s$-step SGD and speedups up to $121\times$ over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.

Aditya Devarakonda, Ramakrishnan Kannan1/14/2025

arXiv:2501.06244v1 Announce Type: new Abstract: With the growing demand for Earth observation, it is important to provide reliable real-time remote sensing inference services to meet the low-latency requirements. The Space Computing Power Network (Space-CPN) offers a promising solution by providing onboard computing and extensive coverage capabilities for real-time inference. This paper presents a remote sensing artificial intelligence applications deployment framework designed for Low Earth Orbit satellite constellations to achieve real-time inference performance. The framework employs the microservice architecture, decomposing monolithic inference tasks into reusable, independent modules to address high latency and resource heterogeneity. This distributed approach enables optimized microservice deployment, minimizing resource utilization while meeting quality of service and functional requirements. We introduce Robust Optimization to the deployment problem to address data uncertainty. Additionally, we model the Robust Optimization problem as a Partially Observable Markov Decision Process and propose a robust reinforcement learning algorithm to handle the semi-infinite Quality of Service constraints. Our approach yields sub-optimal solutions that minimize accuracy loss while maintaining acceptable computational costs. Simulation results demonstrate the effectiveness of our framework.

Zhiyong Yu, Yuning Jiang, Xin Liu, Yuanming Shi, Chunxiao Jiang, Linling Kuang1/14/2025

arXiv:2501.06589v1 Announce Type: new Abstract: Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 30% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.

Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao1/14/2025

arXiv:2501.06706v1 Announce Type: new Abstract: AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan1/14/2025

arXiv:2501.06709v1 Announce Type: new Abstract: Serving large language models (LLMs) for massive users is challenged by the significant memory footprint of the transient state, known as the key-value (KV) cache, which scales with sequence length and number of requests. Instead of renting or buying more expensive GPUs, the load imbalance of the KV cache across GPUs, coupled with recent advances in inter-GPU communication, provides an opportunity to serve more requests via request migration. However, high migration overhead and unpredictable request patterns make it challenging. Therefore, this paper proposes MELL, a memory-efficient LLM serving system via multi-GPU KV cache management. It saves the number of GPUs needed in the system by considering the dynamic KV cache load and the costly request migration. Specifically, we first develop an adaptive request migration mechanism to balance the computational and communication overheads and adapt to diverse resource conditions. Then, we design an online algorithm tailored to a multi-LLM request and multi-GPU scheduling problem with migration enabled. It aims to minimise the required GPUs while limiting the number of migrations. Finally, we implement a prototype of MELL and demonstrate that it reduces the number of GPUs by 31% and increases the GPU utilization by 43% at most compared to existing LLM serving systems.

Liu Qianli, Hong Zicong, Chen Fahao, Li Peng, Guo Song1/14/2025

arXiv:2501.06242v1 Announce Type: new Abstract: 5G technology enhances industries with high-speed, reliable, low-latency communication, revolutionizing mobile broadband and supporting massive IoT connectivity. With the increasing complexity of applications on User Equipment (UE), offloading resource-intensive tasks to robust servers is essential for improving latency and speed. The 3GPP's Multi-access Edge Computing (MEC) framework addresses this challenge by processing tasks closer to the user, highlighting the need for an intelligent controller to optimize task offloading and resource allocation. This paper introduces a novel methodology to efficiently allocate both communication and computational resources among individual UEs. Our approach integrates two critical 5G service imperatives: Ultra-Reliable Low Latency Communication (URLLC) and Massive Machine Type Communication (mMTC), embedding them into the decision-making framework. Central to this approach is the utilization of Proximal Policy Optimization, providing a robust and efficient solution to the challenges posed by the evolving landscape of 5G technology. The proposed model is evaluated in a simulated 5G MEC environment. The model significantly reduces processing time by 4% for URLLC users under strict latency constraints and decreases power consumption by 26% for mMTC users, compared to existing baseline models based on the reported simulation results. These improvements showcase the model's adaptability and superior performance in meeting diverse QoS requirements in 5G networks.

Alireza Ebrahimi, Fatemeh Afghah1/14/2025

arXiv:2406.17918v4 Announce Type: replace Abstract: In our recent research, we have developed a framework called GraphSnapShot, which has been proven an useful tool for graph learning acceleration. GraphSnapShot is a framework for fast cache, storage, retrieval and computation for graph learning. It can quickly store and update the local topology of graph structure and allows us to track patterns in the structure of graph networks, just like take snapshots of the graphs. In experiments, GraphSnapShot shows efficiency, it can achieve up to 30% training acceleration and 73% memory reduction for lossless graph ML training compared to current baselines such as dgl.This technique is particular useful for large dynamic graph learning tasks such as social media analysis and recommendation systems to process complex relationships between entities. The code for GraphSnapShot is publicly available at https://github.com/NoakLiu/GraphSnapShot.

Dong Liu, Roger Waleffe, Meng Jiang, Shivaram Venkataraman1/14/2025