cs.DC

87 posts

Beyond Model Scale Limits: End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration

arXiv:2501.00693v1 Announce Type: new Abstract: The rise of End-Edge-Cloud Collaboration (EECC) offers a promising paradigm for Artificial Intelligence (AI) model training across end devices, edge servers, and cloud data centers, providing enhanced reliability and reduced latency. Hierarchical Federated Learning (HFL) can benefit from this paradigm by enabling multi-tier model aggregation across distributed computing nodes. However, the potential of HFL is significantly constrained by the inherent heterogeneity and dynamic characteristics of EECC environments. Specifically, the uniform model structure bounded by the least powerful end device across all computing nodes imposes a performance bottleneck. Meanwhile, coupled heterogeneity in data distributions and resource capabilities across tiers disrupts hierarchical knowledge transfer, leading to biased updates and degraded performance. Furthermore, the mobility and fluctuating connectivity of computing nodes in EECC environments introduce complexities in dynamic node migration, further compromising the robustness of the training process. To address multiple challenges within a unified framework, we propose End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration (FedEEC), which is a novel EECC-empowered FL framework that allows the trained models from end, edge, to cloud to grow larger in size and stronger in generalization ability. FedEEC introduces two key innovations: (1) Bridge Sample Based Online Distillation Protocol (BSBODP), which enables knowledge transfer between neighboring nodes through generated bridge samples, and (2) Self-Knowledge Rectification (SKR), which refines the transferred knowledge to prevent suboptimal cloud model optimization. The proposed framework effectively handles both cross-tier resource heterogeneity and effective knowledge transfer between neighboring nodes, while satisfying the migration-resilient requirements of EECC.

Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Ke Xu, Quyang Pan, Bo Gao, Tian Wen1/3/2025

Fides: Scalable Censorship-Resistant DAG Consensus via Trusted Components

arXiv:2501.01062v1 Announce Type: new Abstract: Recently, consensus protocols based on Directed Acyclic Graph (DAG) have gained significant attention due to their potential to build robust blockchain systems, particularly in asynchronous networks. In this paper, we propose Fides, an asynchronous DAG-based BFT consensus protocol that leverages Trusted Execution Environments (TEEs) to tackle three major scalability and security challenges faced by existing protocols: (i) the need for a larger quorum size (i.e., at least 3x larger) to tolerate Byzantine replicas, (ii) high communication costs and reliance on expensive cryptographic primitives (i.e., global common coin) to reach agreement in asynchronous networks, and (iii) poor censorship resilience undermining the liveness guarantee. Specifically, Fides adopts four trusted components-Reliable Broadcast, Vertex Validation, Common Coin, and Transaction Disclosure-within TEEs. Incorporating these components enables Fides to achieve linear message complexity, guaranteed censorship resilience, 2x larger quorum size, and lightweight common coin usage. Besides, abstracting these essential components rather than porting the entire protocol into TEE can significantly reduce the Trusted Computing Base (TCB). Experimental evaluations of Fides in local and geo-distributed networks demonstrate its superior performance compared to established state-of-the-art protocols such as Tusk, RCC, HotStuff, and PBFT. The results indicate that Fides achieves a throughput of 400k transactions per second in a geo-distributed network and 810k transactions per second in a local network. Our analysis further explores the protocol's overhead, highlighting its suitability and effectiveness for practical deployment in real-world blockchain systems.

Shaokang Xie, Dakai Kang, Hanzheng Lyu, Jianyu Niu, Mohammad Sadoghi1/3/2025

cs.DC cs.MS cs.PF cs.SE

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

arXiv:2501.00279v1 Announce Type: new Abstract: BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies. Building on our preliminary work demonstrating the potential of automatic *gemm offload, this paper extends the framework to all level-3 BLAS operations and introduces SCILIB-Accel, a novel tool for automatic BLAS offload. SCILIB-Accel leverages the memory coherency in Grace-Hopper and introduces a Device First-Use data movement policy inspired by the OpenMP First-Touch approach in multi-socket CPU programming, minimizing CPU-GPU data transfers for typical scientific computing codes. Additionally, utilizing dynamic binary instrumentation, the tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation. SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups. Notably, for the LSMS method in the MuST suite, a 3x speedup was achieved on Grace-Hopper compared to Grace-Grace.

Junjie Li1/3/2025

UPC Sentinel: An Accurate Approach for Detecting Upgradeability Proxy Contracts in Ethereum

arXiv:2501.00674v1 Announce Type: new Abstract: Software applications that run on a blockchain platform are known as DApps. DApps are built using smart contracts, which are immutable after deployment. Just like any real-world software system, DApps need to receive new features and bug fixes over time in order to remain useful and secure. However, Ethereum lacks native solutions for post-deployment smart contract maintenance, requiring developers to devise their own methods. A popular method is known as the upgradeability proxy contract (UPC), which involves implementing the proxy design pattern (as defined by the Gang of Four). In this method, client calls first hit a proxy contract, which then delegates calls to a certain implementation contract. Most importantly, the proxy contract can be reconfigured during runtime to delegate calls to another implementation contract, effectively enabling application upgrades. For researchers, the accurate detection of UPCs is a strong requirement in the understanding of how exactly real-world DApps are maintained over time. For practitioners, the accurate detection of UPCs is crucial for providing application behavior transparency and enabling auditing. In this paper, we introduce UPC Sentinel, a novel three-layer algorithm that utilizes both static and dynamic analysis of smart contract bytecode to accurately detect active UPCs. We evaluated UPC Sentinel using two distinct ground truth datasets. In the first dataset, our method demonstrated a near-perfect accuracy of 99%. The evaluation on the second dataset further established our method's efficacy, showing a perfect precision rate of 100% and a near-perfect recall of 99.3%, outperforming the state of the art. Finally, we discuss the potential value of UPC Sentinel in advancing future research efforts.

Amir M. Ebrahimi, Bram Adams, Gustavo A. Oliva, Ahmed E. Hassan1/3/2025

A Large-Scale Exploratory Study on the Proxy Pattern in Ethereum

arXiv:2501.00965v1 Announce Type: new Abstract: The proxy pattern is a well-known design pattern with numerous use cases in several sectors of the software industry. As such, the use of the proxy pattern is also a common approach in the development of complex decentralized applications (DApps) on the Ethereum blockchain. Despite the importance of proxy contracts, little is known about (i) how their prevalence changed over time, (ii) the ways in which developers integrate proxies in the design of DApps, and (iii) what proxy types are being most commonly leveraged by developers. This study bridges these gaps through a comprehensive analysis of Ethereum smart contracts, utilizing a dataset of 50 million contracts and 1.6 billion transactions as of September 2022. Our findings reveal that 14.2% of all deployed smart contracts are proxy contracts. We show that proxy contracts are being more actively used than non-proxy contracts. Also, the usage of proxy contracts in various contexts, transactions involving proxy contracts, and adoption of proxy contracts by users have shown an upward trend over time, peaking at the end of our study period. They are either deployed through off-chain scripts or on-chain factory contracts, with the former and latter being employed in 39.1% and 60.9% of identified usage contexts in turn. We found that while the majority (67.8%) of proxies act as an interceptor, 32.2% enables upgradeability. Proxy contracts are typically (79%) implemented based on known reference implementations with 29.4% being of type ERC-1167, a class of proxies that aims to cheaply reuse and clone contracts' functionality. Our evaluation shows that our proposed behavioral proxy detection method has a precision and recall of 100% in detecting active proxies. Finally, we derive a set of practical recommendations for developers and introduce open research questions to guide future research on the topic.

Amir M. Ebrahimi, Bram Adams, Gustavo A. Oliva, Ahmed E. Hassan1/3/2025

Deep Reinforcement Learning for Job Scheduling and Resource Management in Cloud Computing: An Algorithm-Level Review

arXiv:2501.01007v1 Announce Type: new Abstract: Cloud computing has revolutionized the provisioning of computing resources, offering scalable, flexible, and on-demand services to meet the diverse requirements of modern applications. At the heart of efficient cloud operations are job scheduling and resource management, which are critical for optimizing system performance and ensuring timely and cost-effective service delivery. However, the dynamic and heterogeneous nature of cloud environments presents significant challenges for these tasks, as workloads and resource availability can fluctuate unpredictably. Traditional approaches, including heuristic and meta-heuristic algorithms, often struggle to adapt to these real-time changes due to their reliance on static models or predefined rules. Deep Reinforcement Learning (DRL) has emerged as a promising solution to these challenges by enabling systems to learn and adapt policies based on continuous observations of the environment, facilitating intelligent and responsive decision-making. This survey provides a comprehensive review of DRL-based algorithms for job scheduling and resource management in cloud computing, analyzing their methodologies, performance metrics, and practical applications. We also highlight emerging trends and future research directions, offering valuable insights into leveraging DRL to advance both job scheduling and resource management in cloud computing.

Yan Gu, Zhaoze Liu, Shuhong Dai, Cong Liu, Ying Wang, Shen Wang, Georgios Theodoropoulos, Long Cheng1/3/2025

cs.CR cs.DC cs.IT math.IT

OciorMVBA: Near-Optimal Error-Free Asynchronous MVBA

arXiv:2501.00214v1 Announce Type: new Abstract: In this work, we propose an error-free, information-theoretically secure, asynchronous multi-valued validated Byzantine agreement (MVBA) protocol, called OciorMVBA. This protocol achieves MVBA consensus on a message $\boldsymbol{w}$ with expected $O(n |\boldsymbol{w}|\log n + n^2 \log q)$ communication bits, expected $O(n^2)$ messages, expected $O(\log n)$ rounds, and expected $O(\log n)$ common coins, under optimal resilience $n \geq 3t + 1$ in an $n$-node network, where up to $t$ nodes may be dishonest. Here, $q$ denotes the alphabet size of the error correction code used in the protocol. When error correction codes with a constant alphabet size (e.g., Expander Codes) are used, $q$ becomes a constant. An MVBA protocol that guarantees all required properties without relying on any cryptographic assumptions, such as signatures or hashing, except for the common coin assumption, is said to be information-theoretically secure (IT secure). Under the common coin assumption, an MVBA protocol that guarantees all required properties in all executions is said to be error-free. We also propose another error-free, IT-secure, asynchronous MVBA protocol, called OciorMVBArr. This protocol achieves MVBA consensus with expected $O(n |\boldsymbol{w}| + n^2 \log n)$ communication bits, expected $O(1)$ rounds, and expected $O(1)$ common coins, under a relaxed resilience (RR) of $n \geq 5t + 1$. Additionally, we propose a hash-based asynchronous MVBA protocol, called OciorMVBAh. This protocol achieves MVBA consensus with expected $O(n |\boldsymbol{w}| + n^3)$ bits, expected $O(1)$ rounds, and expected $O(1)$ common coins, under optimal resilience $n \geq 3t + 1$.

Jinyuan Chen1/3/2025

FedCod: An Efficient Communication Protocol for Cross-Silo Federated Learning with Coding

arXiv:2501.00216v1 Announce Type: new Abstract: Federated Learning (FL) is an innovative distributed machine learning paradigm that enables multiple parties to collaboratively train a model without sharing their raw data, thereby preserving data privacy. Communication efficiency concerns arise in cross-silo FL, particularly due to the network heterogeneity and fluctuations associated with geo-distributed silos. Most existing solutions to these problems focus on algorithmic improvements that alter the FL algorithm but sacrificing the training performance. How to address these problems from a network perspective that is decoupled from the FL algorithm remains an open challenge. In this paper, we propose FedCod, a new application layer communication protocol designed for cross-silo FL. FedCod transparently utilizes a coding mechanism to enhance the efficient use of idle bandwidth through client-to-client communication, and dynamically adjusts coding redundancy to mitigate network bottlenecks and fluctuations, thereby improving the communication efficiency and accelerating the training process. In our real-world experiments, FedCod demonstrates a significant reduction in average communication time by up to 62% compared to the baseline, while maintaining FL training performance and optimizing inter-client communication traffic.

Peishen Yan, Jun Li, Hao Wang, Tao Song, Yang Hua, Lu Peng, Haihui Zhou, Haibing Guan1/3/2025

cs.DC cs.CR cs.DS

Constant Degree Networks for Almost-Everywhere Reliable Transmission

arXiv:2501.00337v1 Announce Type: new Abstract: In the almost-everywhere reliable message transmission problem, introduced by [Dwork, Pippenger, Peleg, Upfal'86], the goal is to design a sparse communication network $G$ that supports efficient, fault-tolerant protocols for interactions between all node pairs. By fault-tolerant, we mean that that even if an adversary corrupts a small fraction of vertices in $G$, then all but a small fraction of vertices can still communicate perfectly via the constructed protocols. Being successful to do so allows one to simulate, on a sparse graph, any fault-tolerant distributed computing task and secure multi-party computation protocols built for a complete network, with only minimal overhead in efficiency. Previous works on this problem achieved either constant-degree networks tolerating $o(1)$ faults, constant-degree networks tolerating a constant fraction of faults via inefficient protocols (exponential work complexity), or poly-logarithmic degree networks tolerating a constant fraction of faults. We show a construction of constant-degree networks with efficient protocols (i.e., with polylogarithmic work complexity) that can tolerate a constant fraction of adversarial faults, thus solving the main open problem of Dwork et al.. Our main contribution is a composition technique for communication networks, based on graph products. Our technique combines two networks tolerant to adversarial edge-faults to construct a network with a smaller degree while maintaining efficiency and fault-tolerance. We apply this composition result multiple times, using the polylogarithmic-degree edge-fault tolerant networks constructed in a recent work of [Bafna, Minzer, Vyas'24] (that are based on high-dimensional expanders) with itself, and then with the constant-degree networks (albeit with inefficient protocols) of [Upfal'92].

Mitali Bafna, Dor Minzer1/3/2025

cs.RO cs.CG cs.DC cs.MA cs.SY eess.SY

Impossibility of Self-Organized Aggregation without Computation

arXiv:2501.00390v1 Announce Type: new Abstract: In their seminal work, Gauci et al. (2014) studied the fundamental task of aggregation, wherein multiple robots need to gather without an a priori agreed-upon meeting location, using minimal hardware. That paper considered differential-drive robots that are memoryless and unable to compute. Moreover, the robots cannot communicate with one another and are only equipped with a simple sensor that determines whether another robot is directly in front of them. Despite those severe limitations, Gauci et al. introduced a controller and proved mathematically that it aggregates a system of two robots for any initial state. Unfortunately, for larger systems, the same controller aggregates empirically in many cases but not all. Thus, the question of whether a controller exists that aggregates for any number of robots remains open. In this paper, we show that no such controller exists by investigating the geometric structure of controllers. In addition, we disprove the aggregation proof of the paper above for two robots and present an alternative controller alongside a simple and rigorous aggregation proof.

Roy Steinberg, Kiril Solovey1/3/2025

An efficient validated asynchronous byzantine agreement protocol using committee

arXiv:2501.00717v1 Announce Type: new Abstract: We present a Byzantine agreement protocol to address the inefficiencies inherent in multi-valued Byzantine agreement protocols, i.e., a version of the Byzantine agreement protocol where every party broadcasts its request, and at the end of the protocol, every party agrees on one of the party's requests. The protocol we present is a validated asynchronous Byzantine agreement protocol, i.e., a party's request must be validated by some external validity property before it is proposed for agreement. Differently from most of the MVBA protocols, we allow only a subset of total parties to broadcast their requests instead of all, and we make the subset selection stochastic each time the parties choose to broadcast a new set of requests. Then, at the time of the agreement, we choose a party from the selected subset, and the parties reach an agreement on the selected party's broadcast. Extensive theoretical analysis shows that this approach can produce efficient output regarding messages and computation overhead, but the protocol is time-consuming.

Nasit S Sony1/3/2025

Gradient Compression and Correlation Driven Federated Learning for Wireless Traffic Prediction

arXiv:2501.00732v1 Announce Type: new Abstract: Wireless traffic prediction plays an indispensable role in cellular networks to achieve proactive adaptation for communication systems. Along this line, Federated Learning (FL)-based wireless traffic prediction at the edge attracts enormous attention because of the exemption from raw data transmission and enhanced privacy protection. However FL-based wireless traffic prediction methods still rely on heavy data transmissions between local clients and the server for local model updates. Besides, how to model the spatial dependencies of local clients under the framework of FL remains uncertain. To tackle this, we propose an innovative FL algorithm that employs gradient compression and correlation-driven techniques, effectively minimizing data transmission load while preserving prediction accuracy. Our approach begins with the introduction of gradient sparsification in wireless traffic prediction, allowing for significant data compression during model training. We then implement error feedback and gradient tracking methods to mitigate any performance degradation resulting from this compression. Moreover, we develop three tailored model aggregation strategies anchored in gradient correlation, enabling the capture of spatial dependencies across diverse clients. Experiments have been done with two real-world datasets and the results demonstrate that by capturing the spatio-temporal characteristics and correlation among local clients, the proposed algorithm outperforms the state-of-the-art algorithms and can increase the communication efficiency by up to two orders of magnitude without losing prediction accuracy. Code is available at https://github.com/chuanting/FedGCC.

Chuanting Zhang, Haixia Zhang, Shuping Dang, Basem Shihada, Mohamed-Slim Alouini1/3/2025

cs.DC cs.AI cs.LG

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

arXiv:2501.01005v1 Announce Type: new Abstract: Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze1/3/2025

Did we miss P In CAP? Partial Progress Conjecture under Asynchrony

arXiv:2501.00021v1 Announce Type: new Abstract: Each application developer desires to provide its users with consistent results and an always-available system despite failures. Boldly, the CALM theorem disagrees. It states that it is hard to design a system that is both consistent and available under network partitions; select at most two out of these three properties. One possible solution is to design coordination-free monotonic applications. However, a majority of real-world applications require coordination. We resolve this dilemma by conjecturing that partial progress is possible under network partitions. This partial progress ensures the system appears responsive to a subset of clients and achieves non-zero throughput during failures. To this extent, we present the design of our CASSANDRA consensus protocol that allows partitioned replicas to order client requests.

Junchao Chen, Suyash Gupta, Daniel P. Hughes, Mohammad Sadoghi1/3/2025

cs.OS cs.DC cs.LG

Dynamic Optimization of Storage Systems Using Reinforcement Learning Techniques

arXiv:2501.00068v1 Announce Type: new Abstract: The exponential growth of data-intensive applications has placed unprecedented demands on modern storage systems, necessitating dynamic and efficient optimization strategies. Traditional heuristics employed for storage performance optimization often fail to adapt to the variability and complexity of contemporary workloads, leading to significant performance bottlenecks and resource inefficiencies. To address these challenges, this paper introduces RL-Storage, a novel reinforcement learning (RL)-based framework designed to dynamically optimize storage system configurations. RL-Storage leverages deep Q-learning algorithms to continuously learn from real-time I/O patterns and predict optimal storage parameters, such as cache size, queue depths, and readahead settings[1]. The proposed framework operates within the storage kernel, ensuring minimal latency and low computational overhead. Through an adaptive feedback mechanism, RL-Storage dynamically adjusts critical parameters, achieving efficient resource utilization across a wide range of workloads. Experimental evaluations conducted on a range of benchmarks, including RocksDB and PostgreSQL, demonstrate significant improvements, with throughput gains of up to 2.6x and latency reductions of 43% compared to baseline heuristics. Additionally, RL-Storage achieves these performance enhancements with a negligible CPU overhead of 0.11% and a memory footprint of only 5 KB, making it suitable for seamless deployment in production environments. This work underscores the transformative potential of reinforcement learning techniques in addressing the dynamic nature of modern storage systems. By autonomously adapting to workload variations in real time, RL-Storage provides a robust and scalable solution for optimizing storage performance, paving the way for next-generation intelligent storage infrastructures.

Chiyu Cheng, Chang Zhou, Yang Zhao, Jin Cao1/3/2025

cs.LG cs.AI cs.DC

Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection

arXiv:2501.00170v1 Announce Type: new Abstract: With the rapid expansion of edge devices, such as IoT devices, where crucial data needed for machine learning applications is generated, it becomes essential to promote their participation in privacy-preserving Federated Learning (FL) systems. The best way to achieve this desiderate is by reducing their training workload to match their constrained computational resources. While prior FL research has address the workload constrains by introducing lightweight models on the edge, limited attention has been given to optimizing on-device training efficiency through reducing the amount of data need during training. In this work, we propose FedFT-EDS, a novel approach that combines Fine-Tuning of partial client models with Entropy-based Data Selection to reduce training workloads on edge devices. By actively selecting the most informative local instances for learning, FedFT-EDS reduces training data significantly in FL and demonstrates that not all user data is equally beneficial for FL on all rounds. Our experiments on CIFAR-10 and CIFAR-100 show that FedFT-EDS uses only 50% user data while improving the global model performance compared to baseline methods, FedAvg and FedProx. Importantly, FedFT-EDS improves client learning efficiency by up to 3 times, using one third of training time on clients to achieve an equivalent performance to the baselines. This work highlights the importance of data selection in FL and presents a promising pathway to scalable and efficient Federate Learning.

Hongrui Shi, Valentin Radu, Po Yang1/3/2025

cs.CR cs.DC cs.NI

Interactive cybersecurity training system based on simulation environments

arXiv:2501.00186v1 Announce Type: new Abstract: Rapid progress in the development of information technology has led to a significant increase in the number and complexity of cyber threats. Traditional methods of cybersecurity training based on theoretical knowledge do not provide a sufficient level of practical skills to effectively counter real threats. The article explores the possibilities of integrating simulation environments into the cybersecurity training process as an effective approach to improving the quality of training. The article presents the architecture of a simulation environment based on a cluster of KVM hypervisors, which allows creating scalable and flexible platforms at minimal cost. The article describes the implementation of various scenarios using open source software tools such as pfSense, OPNsense, Security Onion, Kali Linux, Parrot Security OS, Ubuntu Linux, Oracle Linux, FreeBSD, and others, which create realistic conditions for practical training.

Dmytro Tymoshchuk, Vasyl Yatskiv, Vitaliy Tymoshchuk, Nataliya Yatskiv1/3/2025

Parallel I/O Characterization and Optimization on Large-Scale HPC Systems: A 360-Degree Survey

arXiv:2501.00203v1 Announce Type: new Abstract: Driven by artificial intelligence, data science, and high-resolution simulations, I/O workloads and hardware on high-performance computing (HPC) systems have become increasingly complex. This complexity can lead to large I/O overheads and overall performance degradation. These inefficiencies are often mitigated using tools and techniques for characterizing, analyzing, and optimizing the I/O behavior of HPC applications. That said, the myriad number of tools and techniques available makes it challenging to navigate to the best approach. In response, this paper surveys 131 papers from the ACM Digital Library, IEEE Xplore, and other reputable journals to provide a comprehensive analysis, synthesized in the form of a taxonomy, of the current landscape of parallel I/O characterization, analysis, and optimization of large-scale HPC systems. We anticipate that this taxonomy will serve as a valuable resource for enhancing I/O performance of HPC applications.

Hammad Ather, Jean Luca Bez, Chen Wang, Hank Childs, Allen D. Malony, Suren Byna1/3/2025

cs.DC cs.AI cs.AR

Debunking the CUDA Myth Towards GPU-based AI Systems

arXiv:2501.00210v1 Announce Type: new Abstract: With the rise of AI, NVIDIA GPUs have become the de facto standard for AI system design. This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs for AI model serving. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU's programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU's dominance in the AI server market, though further improvements are necessary to fully compete with NVIDIA's robust software ecosystem.

Yunjae Lee, Juntaek Lim, Jehyeon Bang, Eunyeong Cho, Huijong Jeong, Taesu Kim, Hyungjun Kim, Joonhyung Lee, Jinseop Im, Ranggi Hwang, Se Jung Kwon, Dongsoo Lee, Minsoo Rhu1/3/2025

Communication-and-Computation Efficient Split Federated Learning: Gradient Aggregation and Resource Management

arXiv:2501.01078v1 Announce Type: new Abstract: With the prevalence of Large Learning Models (LLM), Split Federated Learning (SFL), which divides a learning model into server-side and client-side models, has emerged as an appealing technology to deal with the heavy computational burden for network edge clients. However, existing SFL frameworks would frequently upload smashed data and download gradients between the server and each client, leading to severe communication overheads. To address this issue, this work proposes a novel communication-and-computation efficient SFL framework, which allows dynamic model splitting (server- and client-side model cutting point selection) and broadcasting of aggregated smashed data gradients. We theoretically analyze the impact of the cutting point selection on the convergence rate of the proposed framework, revealing that model splitting with a smaller client-side model size leads to a better convergence performance and vise versa. Based on the above insights, we formulate an optimization problem to minimize the model convergence rate and latency under the consideration of data privacy via a joint Cutting point selection, Communication and Computation resource allocation (CCC) strategy. To deal with the proposed mixed integer nonlinear programming optimization problem, we develop an algorithm by integrating the Double Deep Q-learning Network (DDQN) with convex optimization methods. Extensive experiments validate our theoretical analyses across various datasets, and the numerical results demonstrate the effectiveness and superiority of the proposed communication-efficient SFL compared with existing schemes, including parallel split learning and traditional SFL mechanisms.

Yipeng Liang, Qimei Chen, Guangxu Zhu, Muhammad Kaleem Awan, Hao Jiang1/3/2025