cs.PF

17 posts

arXiv:2501.00279v1 Announce Type: new Abstract: BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies. Building on our preliminary work demonstrating the potential of automatic *gemm offload, this paper extends the framework to all level-3 BLAS operations and introduces SCILIB-Accel, a novel tool for automatic BLAS offload. SCILIB-Accel leverages the memory coherency in Grace-Hopper and introduces a Device First-Use data movement policy inspired by the OpenMP First-Touch approach in multi-socket CPU programming, minimizing CPU-GPU data transfers for typical scientific computing codes. Additionally, utilizing dynamic binary instrumentation, the tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation. SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups. Notably, for the LSMS method in the MuST suite, a 3x speedup was achieved on Grace-Hopper compared to Grace-Grace.

Junjie Li1/3/2025

arXiv:2501.01058v1 Announce Type: cross Abstract: The MaxCut problem is a fundamental problem in Combinatorial Optimization, with significant implications across diverse domains such as logistics, network design, and statistical physics. The algorithm represents innovative approaches that balance theoretical rigor with practical scalability. The proposed method introduces a Quantum Genetic Algorithm (QGA) using a Grover-based evolutionary framework and divide-and-conquer principles. By partitioning graphs into manageable subgraphs, optimizing each independently, and applying graph contraction to merge the solutions, the method exploits the inherent binary symmetry of MaxCut to ensure computational efficiency and robust approximation performance. Theoretical analysis establishes a foundation for the efficiency of the algorithm, while empirical evaluations provide quantitative evidence of its effectiveness. On complete graphs, the proposed method consistently achieves the true optimal MaxCut values, outperforming the Semidefinite Programming (SDP) approach, which provides up to 99.7\% of the optimal solution for larger graphs. On Erd\H{o}s-R\'{e}nyi random graphs, the QGA demonstrates competitive performance, achieving median solutions within 92-96\% of the SDP results. These results showcase the potential of the QGA framework to deliver competitive solutions, even under heuristic constraints, while demonstrating its promise for scalability as quantum hardware evolves.

Paulo A. Viana, Fernando M. de Paula Neto1/3/2025

arXiv:2410.21349v3 Announce Type: replace Abstract: Recently, large language models (LLMs) have achieved significant progress in automated code generation. Despite their strong instruction-following capabilities, these models frequently struggled to align with user intent in coding scenarios. In particular, they were hampered by datasets that lacked diversity and failed to address specialized tasks or edge cases. Furthermore, challenges in supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) led to failures in generating precise, human-intent-aligned code. To tackle these challenges and improve the code generation performance for automated programming systems, we propose Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization (i.e., FALCON). FALCON is structured into two hierarchical levels. From the global level, long-term memory improves code quality by retaining and applying learned knowledge. At the local level, short-term memory allows for the incorporation of immediate feedback from compilers and AI systems. Additionally, we introduce meta-reinforcement learning with feedback rewards to solve the global-local bi-level optimization problem and enhance the model's adaptability across diverse code generation tasks. Extensive experiments demonstrate that our technique achieves state-of-the-art performance, leading other reinforcement learning methods by more than 4.5 percentage points on the MBPP benchmark and 6.1 percentage points on the Humaneval benchmark. The open-sourced code is publicly available at https://github.com/titurte/FALCON.

Zeyuan Li, Yangfan He, Lewei He, Jianhui Wang, Tianyu Shi, Bin Lei, Yuchen Li, Qiuwu Chen1/3/2025

arXiv:2501.00203v1 Announce Type: new Abstract: Driven by artificial intelligence, data science, and high-resolution simulations, I/O workloads and hardware on high-performance computing (HPC) systems have become increasingly complex. This complexity can lead to large I/O overheads and overall performance degradation. These inefficiencies are often mitigated using tools and techniques for characterizing, analyzing, and optimizing the I/O behavior of HPC applications. That said, the myriad number of tools and techniques available makes it challenging to navigate to the best approach. In response, this paper surveys 131 papers from the ACM Digital Library, IEEE Xplore, and other reputable journals to provide a comprehensive analysis, synthesized in the form of a taxonomy, of the current landscape of parallel I/O characterization, analysis, and optimization of large-scale HPC systems. We anticipate that this taxonomy will serve as a valuable resource for enhancing I/O performance of HPC applications.

Hammad Ather, Jean Luca Bez, Chen Wang, Hank Childs, Allen D. Malony, Suren Byna1/3/2025

arXiv:2501.01057v1 Announce Type: new Abstract: The growing necessity for enhanced processing capabilities in edge devices with limited resources has led us to develop effective methods for improving high-performance computing (HPC) applications. In this paper, we introduce LASP (Lightweight Autotuning of Scientific Application Parameters), a novel strategy designed to address the parameter search space challenge in edge devices. Our strategy employs a multi-armed bandit (MAB) technique focused on online exploration and exploitation. Notably, LASP takes a dynamic approach, adapting seamlessly to changing environments. We tested LASP with four HPC applications: Lulesh, Kripke, Clomp, and Hypre. Its lightweight nature makes it particularly well-suited for resource-constrained edge devices. By employing the MAB framework to efficiently navigate the search space, we achieved significant performance improvements while adhering to the stringent computational limits of edge devices. Our experimental results demonstrate the effectiveness of LASP in optimizing parameter search on edge devices.

Abrar Hossain, Abdel-Hameed A. Badawy, Mohammad A. Islam, Tapasya Patki, Kishwar Ahmed1/3/2025

arXiv:2408.07843v2 Announce Type: replace Abstract: There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code, HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.

Ronald M. Caplan, Miko M. Stulajter, Jon A. Linker, Jeff Larkin, Henry A. Gabb, Shiquan Su, Ivan Rodriguez, Zachary Tschirhart, Nicholas Malaya12/25/2024

arXiv:2411.10958v3 Announce Type: replace Abstract: Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a hardware-friendly thread-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the accuracy of INT4 $QK$. Third, we propose to use an FP32 Matmul buffer for $PV$ to enhance the accuracy of FP8 $\widetilde PV$. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at https://github.com/thu-ml/SageAttention.

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen12/25/2024

arXiv:2407.05805v2 Announce Type: replace-cross Abstract: In this work, we present a tutorial on how to account for the computational time complexity overhead of signal processing in the spectral efficiency (SE) analysis of wireless waveforms. Our methodology is particularly relevant in scenarios where achieving higher SE entails a penalty in complexity, a common trade-off present in 6G candidate waveforms. We consider that SE derives from the data rate, which is impacted by time-dependent overheads. Thus, neglecting the computational complexity overhead in the SE analysis grants an unfair advantage to more computationally complex waveforms, as they require larger computational resources to meet a signal processing runtime below the symbol period. We demonstrate our points with two case studies. In the first, we refer to IEEE 802.11a-compliant baseband processors from the literature to show that their runtime significantly impacts the SE perceived by upper layers. In the second case study, we show that waveforms considered less efficient in terms of SE can outperform their more computationally expensive counterparts if provided with equivalent high-performance computational resources. Based on these cases, we believe our tutorial can address the comparative SE analysis of waveforms that operate under different computational resource constraints.

Saulo Queiroz, Jo\~ao P. Vilela, Benjamin Koon Kei Ng, Chan-Tong Lam, Edmundo Monteiro12/24/2024

arXiv:2412.13203v2 Announce Type: replace Abstract: AI infrastructures, predominantly GPUs, have delivered remarkable performance gains for deep learning. Conversely, scientific computing, exemplified by quantum chemistry systems, suffers from dynamic diversity, where computational patterns are more diverse and vary dynamically, posing a significant challenge to sponge acceleration off GPUs. In this paper, we propose Matryoshka, a novel elastically-parallel technique for the efficient execution of quantum chemistry system with dynamic diversity on GPU. Matryoshka capitalizes on Elastic Parallelism Transformation, a property prevalent in scientific systems yet underexplored for dynamic diversity, to elastically realign parallel patterns with GPU architecture. Structured around three transformation primitives (Permutation, Deconstruction, and Combination), Matryoshka encompasses three core components. The Block Constructor serves as the central orchestrator, which reformulates data structures accommodating dynamic inputs and constructs fine-grained GPU-efficient compute blocks. Within each compute block, the Graph Compiler operates offline, generating high-performance code with clear computational path through an automated compilation process. The Workload Allocator dynamically schedules workloads with varying operational intensities to threads online. It achieves highly efficient parallelism for compute-intensive operations and facilitates fusion with neighboring memory-intensive operations automatically. Extensive evaluation shows that Matryoshka effectively addresses dynamic diversity, yielding acceleration improvements of up to 13.86x (average 9.41x) over prevailing state-of-the-art approaches on 13 quantum chemistry systems.

Tuowei Wang, Kun Li, Donglin Bai, Fusong Ju, Leo Xia, Ting Cao, Ju Ren, Yaoxue Zhang, Mao Yang12/24/2024

arXiv:2412.17510v1 Announce Type: new Abstract: Sparse matrices have recently played a significant and impactful role in scientific computing, including artificial intelligence-related fields. According to historical studies on sparse matrix--vector multiplication (SpMV), Krylov subspace methods are particularly sensitive to the effects of round-off errors when using floating-point arithmetic. By employing multiple-precision linear computation, convergence can be stabilized by reducing these round-off errors. In this paper, we present the performance of our accelerated SpMV using SIMD instructions, demonstrating its effectiveness through various examples, including Krylov subspace methods.

Tomonori Kouya12/24/2024

arXiv:2412.17793v1 Announce Type: cross Abstract: Singular Spectrum Analysis (SSA) occupies a prominent place in the real signal analysis toolkit alongside Fourier and Wavelet analysis. In addition to the two aforementioned analyses, SSA allows the separation of patterns directly from the data space into the data space, with data that need not be strictly stationary, continuous, or even normally sampled. In most cases, SSA relies on a combination of Hankel or Toeplitz matrices and Singular Value Decomposition (SVD). Like Fourier and Wavelet analysis, SSA has its limitations. The main bottleneck of the method can be summarized in three points. The first is the diagonalization of the Hankel/Toeplitz matrix, which can become a major problem from a memory and/or computational point of view if the time series to be analyzed is very long or heavily sampled. The second point concerns the size of the analysis window, typically denoted as 'L', which will affect the detection of patterns in the time series as well as the dimensions of the Hankel/Toeplitz matrix. Finally, the third point concerns pattern reconstruction: how to easily identify in the eigenvector/eigenvalue space which patterns should be grouped. We propose to address each of these issues by describing a hopefully effective approach that we have been developing for over 10 years and that has yielded good results in our research work.

Fernando Lopes, Dominique Gibert, Vincent Courtillot, Jean-Louis Le Mou\"el, Jean-Baptiste Boul\'e12/24/2024

arXiv:2412.16187v1 Announce Type: new Abstract: Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce LSH-E, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. LSH-E quickly locates tokens in the cache that are cosine dissimilar to the current query token. This is achieved by computing the Hamming distance between binarized Gaussian projections of the current token query and cached token keys, with a projection length much smaller than the embedding dimension. We maintain a lightweight binary structure in GPU memory to facilitate these calculations. Unlike existing compression strategies that compute attention to determine token retention, LSH-E makes these decisions pre-attention, thereby reducing computational costs. Additionally, LSH-E is dynamic - at every decoding step, the key and value of the current token replace the embeddings of a token expected to produce the lowest attention score. We demonstrate that LSH-E can compress the KV cache by 30%-70% while maintaining high performance across reasoning, multiple-choice, long-context retrieval and summarization tasks.

Minghui Liu, Tahseen Rabbani, Tony O'Halloran, Ananth Sankaralingam, Mary-Anne Hartley, Brian Gravelle, Furong Huang, Cornelia Ferm\"uller, Yiannis Aloimonos12/24/2024

arXiv:2412.16638v1 Announce Type: new Abstract: This work focuses on the numerical study of a recently published class of Runge-Kutta methods designed for mixed-precision arithmetic. We employ the methods in solving partial differential equations on modern hardware. In particular we investigate what speedups are achievable by the use of mixed precision and the dependence of the methods algorithmic compatibility with the computational hardware. We use state-of-the-art software, utilizing the Ginkgo library, which is designed to incorporate mixed precision arithmetic, and perform numerical tests of 3D problems on both GPU and CPU architectures. We show that significant speedups can be achieved but that performance depends on solver parameters and performance of software kernels.

Ivo Dravins, Marcel Koch, Victoria Griehl, Katharina Kormann12/24/2024

arXiv:2412.16716v1 Announce Type: new Abstract: Modern network architectures have shaped market segments, governments, and communities with intelligent and pervasive applications. Ongoing digital transformation through technologies such as softwarization, network slicing, and AI drives this evolution, along with research into Beyond 5G (B5G) and 6G architectures. Network slices require seamless management, observability, and intelligent-native resource allocation, considering user satisfaction, cost efficiency, security, and energy. Slicing orchestration architectures have been extensively studied to accommodate these requirements, particularly in resource allocation for network slices. This study explored the observability of resource allocation regarding network slice performance in two nationwide testbeds. We examined their allocation effects on slicing connectivity latency using a partial factorial experimental method with Central Processing Unit (CPU) and memory combinations. The results reveal different resource impacts across the testbeds, indicating a non-uniform influence on the CPU and memory within the same network slice.

Rodrigo Moreira, Larissa F. Rodrigues Moreira, Tereza C. Carvalho, Fl\'avio de Oliveira Silva12/24/2024

arXiv:2412.16888v1 Announce Type: new Abstract: Modern software systems are often highly configurable to tailor varied requirements from diverse stakeholders. Understanding the mapping between configurations and the desired performance attributes plays a fundamental role in advancing the controllability and tuning of the underlying system, yet has long been a dark hole of knowledge due to its black-box nature. While there have been previous efforts in performance analysis for these systems, they analyze the configurations as isolated data points without considering their inherent spatial relationships. This renders them incapable of interrogating many important aspects of the configuration space like local optima. In this work, we advocate a novel perspective to rethink performance analysis -- modeling the configuration space as a structured ``landscape''. To support this proposition, we designed \our, an open-source, graph data mining empowered fitness landscape analysis (FLA) framework. By applying this framework to $86$M benchmarked configurations from $32$ running workloads of $3$ real-world systems, we arrived at $6$ main findings, which together constitute a holistic picture of the landscape topography, with thorough discussions about their implications on both configuration tuning and performance modeling.

Mingyu Huang, Peili Mao, Ke Li12/24/2024

arXiv:2412.10856v2 Announce Type: replace Abstract: To deploy LLMs on resource-contained platforms such as mobile robotics and wearables, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) models have shown promising results in text generation on resource-constrained devices thanks to their computational efficiency. However, these models remain too large to be deployed on embedded devices due to their high parameter count. In this paper, we propose an efficient suite of compression techniques, tailored to the RWKV architecture. These techniques include low-rank approximation, sparsity predictors, and clustering head, designed to align with the model size. Our methods compress the RWKV models by 4.95--3.8x with only 2.95pp loss in accuracy.

Wonkyo Choe, Yangfeng Ji, Felix Xiaozhu Lin12/23/2024

arXiv:2412.16001v1 Announce Type: new Abstract: Important memory-bound kernels, such as linear algebra, convolutions, and stencils, rely on SIMD instructions as well as optimizations targeting improved vectorized data traversal and data re-use to attain satisfactory performance. On on temporary CPU architectures, the hardware prefetcher is of key importance for efficient utilization of the memory hierarchy. In this paper, we demonstrate that transforming a memory access pattern consisting of a single stride to one that concurrently accesses multiple strides, can boost the utilization of the hardware prefetcher, and in turn improves the performance of memory-bound kernels significantly. Using a set of micro-benchmarks, we establish that accessing memory in a multi-strided manner enables more cache lines to be concurrently brought into the cache, resulting in improved cache hit ratios and higher effective memory bandwidth without the introduction of costly software prefetch instructions. Subsequently, we show that multi-strided variants of a collection of six memory-bound dense compute kernels outperform state-of-the-art counterparts on three different micro-architectures. More specifically, for kernels among which Matrix Vector Multiplication, Convolution Stencil and kernels from PolyBench, we achieve significant speedups of up to 12.55x over Polly, 2.99x over MKL, 1.98x over OpenBLAS, 1.08x over Halide and 1.87x over OpenCV. The code transformation to take advantage of multi-strided memory access is a natural extension of the loop unroll and loop interchange techniques, allowing this method to be incorporated into compiler pipelines in the future.

Miguel O. Blom, Kristian F. D. Rietveld, Rob V. van Nieuwpoort12/23/2024