cs.AR

31 posts

arXiv:2501.01259v1 Announce Type: new Abstract: In the field of digital signal processing, the fast Fourier transform (FFT) is a fundamental algorithm, with its processors being implemented using either the pipelined architecture, well-known for high-throughput applications but weak in hardware utilization, or the memory-based architecture, designed for area-constrained scenarios but failing to meet stringent throughput requirements. Therefore, we propose an adaptive hybrid FFT, which leverages the strengths of both pipelined and memory-based architectures. In this paper, we propose an adaptive hybrid FFT processor that combines the advantages of both architectures, and it has the following features. First, a set of radix-$2^k$ multi-path delay commutators (MDC) units are developed to support high-performance large-size processing. Second, a conflict-free memory access scheme is formulated to ensure a continuous data flow without data contention. Third, We demonstrate the existence of a series of bit-dimension permutations for reordering input data, satisfying the generalized constraints of variable-length, high-radix, and any level of parallelism for wide adaptivity. Furthermore, the proposed FFT processor has been implemented on a field-programmable gate array (FPGA). As a result, the proposed work outperforms conventional memory-based FFT processors by requiring fewer computation cycles. It achieves higher hardware utilization than pipelined FFT architectures, making it suitable for highly demanding applications.

Fangyu Zhao, Chunhua Xiao, Zhiguo Wang, Xiaohua Du, Bo Dong1/3/2025

arXiv:2501.00011v1 Announce Type: cross Abstract: Atmospheric models demand a lot of computational power and solving the chemical processes is one of its most computationally intensive components. This work shows how to improve the computational performance of the Multiscale Online Nonhydrostatic AtmospheRe CHemistry model (MONARCH), a chemical weather prediction system developed by the Barcelona Supercomputing Center. The model implements the new flexible external package Chemistry Across Multiple Phases (CAMP) for the solving of gas- and aerosol-phase chemical processes, that allows multiple chemical processes to be solved simultaneously as a single system. We introduce a novel strategy to simultaneously solve multiple instances of a chemical mechanism, represented in the model as grid-cells, obtaining a speedup up to 9x using thousands of cells. In addition, we present a GPU strategy for the most time-consuming function of CAMP. The GPU version achieves up to 1.2x speedup compared to CPU. Also, we optimize the memory access in the GPU to increase its speedup up to 1.7x.

Christian Guzman Ruiz, Matthew Dawson, Mario C. Acosta, Oriol Jorba, Eduardo Cesar Galobardes, Carlos P\'erez Garc\'ia-Pando, Kim Serradell1/3/2025

arXiv:2501.00331v1 Announce Type: cross Abstract: Demonstrating small error rates by integrating quantum error correction (QEC) into an architecture of quantum computing is the next milestone towards scalable fault-tolerant quantum computing (FTQC). Encoding logical qubits with superconducting qubits and surface codes is considered a promising candidate for FTQC architectures. In this paper, we propose an FTQC architecture, which we call Q3DE, that enhances the tolerance to multi-bit burst errors (MBBEs) by cosmic rays with moderate changes and overhead. There are three core components in Q3DE: in-situ anomaly DEtection, dynamic code DEformation, and optimized error DEcoding. In this architecture, MBBEs are detected only from syndrome values for error correction. The effect of MBBEs is immediately mitigated by dynamically increasing the encoding level of logical qubits and re-estimating probable recovery operation with the rollback of the decoding process. We investigate the performance and overhead of the Q3DE architecture with quantum-error simulators and demonstrate that Q3DE effectively reduces the period of MBBEs by 1000 times and halves the size of their region. Therefore, Q3DE significantly relaxes the requirement of qubit density and qubit chip size to realize FTQC. Our scheme is versatile for mitigating MBBEs, i.e., temporal variations of error properties, on a wide range of physical devices and FTQC architectures since it relies only on the standard features of topological stabilizer codes.

Yasunari Suzuki, Takanori Sugiyama, Tomochika Arai, Wang Liao, Koji Inoue, Teruo Tanimoto1/3/2025

arXiv:2501.00482v1 Announce Type: cross Abstract: Some applications require electronic systems to operate at extremely high temperature. Extending the operating temperature range of automotive-grade CMOS processes -- through the use of dedicated design techniques -- can provide an important cost-effective advantage. We present a second-order discrete-time delta-sigma analog-to-digital converter operating at a temperature of up to 250 $^\circ$C, well beyond the 175 $^\circ$C qualification temperature of the automotive-grade CMOS process used for its fabrication (XFAB XT018). The analog-to-digital converter incorporates design techniques that are effective in mitigating the adverse effects of the high temperature, such as increased leakage currents and electromigration. We use configurations of dummy transistors for leakage compensation, clock-boosting methods to limit pass-gate cross-talk, and we optimized the circuit architecture to ensure stability and accuracy at high temperature. Comprehensive measurements demonstrate that the analog-to-digital converter achieves a signal-to-noise ratio exceeding 93 dB at 250 $^\circ$C, with an effective number of bits of 12, and a power consumption of only 44~mW. The die area of the converter is only 0.065~mm$^2$ and the area overhead of the high-temperature mitigation circuits is only 13.7%. The Schreier Figure of Merit is 140~dB at the maximum temperature of 250 $^\circ$C, proving the potential of the circuit for reliable operation in challenging applications such as gas and oil extraction and aeronautics.

Christian Sbrana, Alessandro Catania, Tommaso Toschi, Sebastiano Strangio, Giuseppe Iannaccone1/3/2025

arXiv:2402.19376v2 Announce Type: replace Abstract: General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synthesis evaluation against binary MAC design across multiple bitwidths and clock frequencies using TSMC N5 process node to assess commercial implementation potential. We demonstrate the existence of high bit sparsity in eight pretrained INT8 DL workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput, it still achieves 30% improvement on both power and energy.

Harideep Nair, Prabhu Vellaisamy, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen1/3/2025

arXiv:2501.00032v1 Announce Type: new Abstract: Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to LLaMA.cpp-based solution. The optimized kernels are available at https://github.com/ggerganov/llama.cpp.

Dibakar Gope, David Mansell, Danny Loh, Ian Bratt1/3/2025

arXiv:2501.00210v1 Announce Type: new Abstract: With the rise of AI, NVIDIA GPUs have become the de facto standard for AI system design. This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs for AI model serving. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU's programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU's dominance in the AI server market, though further improvements are necessary to fully compete with NVIDIA's robust software ecosystem.

Yunjae Lee, Juntaek Lim, Jehyeon Bang, Eunyeong Cho, Huijong Jeong, Taesu Kim, Hyungjun Kim, Joonhyung Lee, Jinseop Im, Ranggi Hwang, Se Jung Kwon, Dongsoo Lee, Minsoo Rhu1/3/2025

arXiv:2501.00642v1 Announce Type: new Abstract: Large Language Models (LLMs) based agents are transforming the programming language landscape by facilitating learning for beginners, enabling code generation, and optimizing documentation workflows. Hardware Description Languages (HDLs), with their smaller user community, stand to benefit significantly from the application of LLMs as tools for learning new HDLs. This paper investigates the challenges and solutions of enabling LLMs for HDLs, particularly for HDLs that LLMs have not been previously trained on. This work introduces HDLAgent, an AI agent optimized for LLMs with limited knowledge of various HDLs. It significantly enhances off-the-shelf LLMs.

Mark Zakharov, Farzaneh Rabiei Kashanaki, Jose Renau1/3/2025

arXiv:2501.00742v1 Announce Type: new Abstract: Partial differential equation (PDE) is an important math tool in science and engineering. This paper experimentally demonstrates an optical neural PDE solver by leveraging the back-propagation-free on-photonic-chip training of physics-informed neural networks.

Yequan Zhao, Xian Xiao, Antoine Descos, Yuan Yuan, Xinling Yu, Geza Kurczveil, Marco Fiorentino, Zheng Zhang, Raymond G. Beausoleil1/3/2025

arXiv:2501.00883v1 Announce Type: new Abstract: The dispersed node locations and complex topologies of edge networks, combined with intricate dynamic microservice dependencies, render traditional centralized microservice architectures (MSAs) unsuitable. In this paper, we propose a decentralized microservice architecture (DMSA), which delegates scheduling functions from the control plane to edge nodes. DMSA redesigns and implements three core modules of microservice discovery, monitoring, and scheduling for edge networks to achieve precise awareness of instance deployments, low monitoring overhead and measurement errors, and accurate dynamic scheduling, respectively. Particularly, DMSA has customized a microservice scheduling scheme that leverages multi-port listening and zero-copy forwarding to guarantee high data forwarding efficiency. Moreover, a dynamic weighted multi-level load balancing algorithm is proposed to adjust scheduling dynamically with consideration of reliability, priority, and response delay. Finally, we have implemented a physical verification platform for DMSA. Extensive empirical results demonstrate that compared to state-of-the-art and traditional scheduling schemes, DMSA effectively counteracts link failures and network fluctuations, improving the service response delay and execution success rate by approximately $60\% \sim 75\%$ and $10\%\sim15\%$, respectively.

Yuang Chen, Chengdi Lu, Yongsheng Huang, Chang Wu, Fengqian Guo, Hancheng Lu, Chang Wen Chen1/3/2025

arXiv:2501.00921v1 Announce Type: new Abstract: In current chip design processes, using multiple tools to obtain a gate-level netlist often results in the loss of source code correlation. SynAlign addresses this challenge by automating the alignment process, simplifying iterative design, reducing overhead, and maintaining correlation across various tools. This enhances the efficiency and effectiveness of chip design workflows. Improving characteristics such as frequency through iterative design is essential for enhancing accelerators and chip designs. While synthesis tools produce netlists with critical path information, designers often lack the tools to trace these netlist cells back to their original source code. Mapping netlist components to source code provides early feedback on timing and power for frontend designers. SynAlign automatically aligns post-optimized netlists with the original source code without altering compilers or synthesis processes. Its alignment strategy relies on the consistent design structure throughout the chip design cycle, even with changes in compiler flow. This consistency allows engineers to maintain a correlation between modified designs and the original source code across various tools. Remarkably, SynAlign can tolerate up to 61\% design net changes without impacting alignment accuracy.

Sakshi Garg, Jose Renau1/3/2025

arXiv:2501.01147v1 Announce Type: new Abstract: This project focuses on the design and implementation of an AHB to APB Bridge for efficient communication in System-on-Chip (SoC) architectures. The Advanced High-performance Bus (AHB) is used for high-speed operations, typically connecting processors and memory, while the Advanced Peripheral Bus (APB) is optimized for low-power, low-speed peripheral devices. The AHB to APB Bridge serves as an interface that converts complex, high-speed AHB transactions into simpler, single-cycle APB transactions, enabling seamless data transfer between fast components and slower peripherals. The bridge manages clock domain synchronization, transaction conversion, and flow control, ensuring compatibility between AHB's burst transfers and APB's non-pipelined protocol. Implemented in Verilog and simulated on FPGA using Xilinx Vivado, this bridge design provides a robust solution for integrating high-performance and low-power components within a single SoC. This project also evaluates the bridge's functionality and performance through testbenches covering various operational scenarios, validating its efficiency in handling diverse system requirements.

Gopi Chand Ananthu, Riadul Islam1/3/2025

arXiv:2412.17955v1 Announce Type: new Abstract: General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89\%, 87\%, and 50\%, respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 mm^2 die area, 417.72 mW power, and 8.86 uJ energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet-50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.

Prabhu Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen12/25/2024

arXiv:2412.17966v1 Announce Type: new Abstract: General matrix multiplication (GEMM) is a ubiquitous computing kernel/algorithm for data processing in diverse applications, including artificial intelligence (AI) and deep learning (DL). Recent shift towards edge computing has inspired GEMM architectures based on unary computing, which are predominantly stochastic and rate-coded systems. This paper proposes a novel GEMM architecture based on temporal-coding, called tuGEMM, that performs exact computation. We introduce two variants of tuGEMM, serial and parallel, with distinct area/power-latency trade-offs. Post-synthesis Power-Performance-Area (PPA) in 45 nm CMOS are reported for 2-bit, 4-bit, and 8-bit computations. The designs illustrate significant advantages in area-power efficiency over state-of-the-art stochastic unary systems especially at low precisions, e.g. incurring just 0.03 mm^2 and 9 mW for 4 bits, and 0.01 mm^2 and 4 mW for 2 bits. This makes tuGEMM ideal for power constrained mobile and edge devices performing always-on real-time sensory processing.

Harideep Nair, Prabhu Vellaisamy, Albert Chen, Joseph Finn, Anna Li, Manav Trivedi, John Paul Shen12/25/2024

arXiv:2412.17977v1 Announce Type: new Abstract: Temporal Neural Networks (TNNs), a special class of spiking neural networks, draw inspiration from the neocortex in utilizing spike-timings for information processing. Recent works proposed a microarchitecture framework and custom macro suite for designing highly energy-efficient application-specific TNNs. These recent works rely on manual hardware design, a labor-intensive and time-consuming process. Further, there is no open-source functional simulation framework for TNNs. This paper introduces TNNGen, a pioneering effort towards the automated design of TNNs from PyTorch software models to post-layout netlists. TNNGen comprises a novel PyTorch functional simulator (for TNN modeling and application exploration) coupled with a Python-based hardware generator (for PyTorch-to-RTL and RTL-to-Layout conversions). Seven representative TNN designs for time-series signal clustering across diverse sensory modalities are simulated and their post-layout hardware complexity and design runtimes are assessed to demonstrate the effectiveness of TNNGen. We also highlight TNNGen's ability to accurately forecast silicon metrics without running hardware process flow.

Prabhu Vellaisamy, Harideep Nair, Vamsikrishna Ratnakaram, Dhruv Gupta, John Paul Shen12/25/2024

arXiv:2412.18534v1 Announce Type: new Abstract: Graph convolutional networks (GCNs) are popular for building machine-learning application for graph-structured data. This widespread adoption led to the development of specialized GCN hardware accelerators. In this work, we address a key architectural challenge for GCN accelerators: how to detect errors in GCN computations arising from random hardware faults with the least computation cost. Each GCN layer performs a graph convolution, mathematically equivalent to multiplying three matrices, computed through two separate matrix multiplications. Existing Algorithm-based Fault Tolerance(ABFT) techniques can check the results of individual matrix multiplications. However, for a GCN layer, this check should be performed twice. To avoid this overhead, this work introduces GCN-ABFT that directly calculates a checksum for the entire three-matrix product within a single GCN layer, providing a cost-effective approach for error detection in GCN accelerators. Experimental results demonstrate that GCN-ABFT reduces the number of operations needed for checksum computation by over 21% on average for representative GCN applications. These savings are achieved without sacrificing fault-detection accuracy, as evidenced by the presented fault-injection analysis.

Christodoulos Peltekis, Giorgos Dimitrakopoulos12/25/2024

arXiv:2412.18579v1 Announce Type: new Abstract: Lookup tables (LUTs) are frequently used to efficiently store arrays of precomputed values for complex mathematical computations. When used in the context of neural networks, these functions exhibit a lack of recognizable patterns which presents an unusual challenge for conventional logic synthesis techniques. Several approaches are known to break down a single large lookup table into multiple smaller ones that can be recombined. Traditional methods, such as plain tabulation, piecewise linear approximation, and multipartite table methods, often yield inefficient hardware solutions when applied to LUT-based NNs. This paper introduces ReducedLUT, a novel method to reduce the footprint of the LUTs by injecting don't cares into the compression process. This additional freedom introduces more self-similarities which can be exploited using known decomposition techniques. We then demonstrate a particular application to machine learning; by replacing unobserved patterns within the training data of neural network models with don't cares, we enable greater compression with minimal model accuracy degradation. In practice, we achieve up to $1.63\times$ reduction in Physical LUT utilization, with a test accuracy drop of no more than $0.01$ accuracy points.

Oliver Cassidy, Marta Andronic, Samuel Coward, George A. Constantinides12/25/2024

arXiv:2408.01254v2 Announce Type: replace Abstract: In order to follow the ever-growing computational complexity and data intensity of state-of-the-art AI models, new computing paradigms are being proposed. These paradigms aim at achieving high energy efficiency, by mitigating the Von Neumann bottleneck that relates to the energy cost of moving data between the processing cores and the memory. Convolutional Neural Networks (CNNs) are susceptible to this bottleneck, given the massive data they have to manage. Systolic Arrays (SAs) are promising architectures to mitigate the data transmission cost, thanks to high data utilization of Processing Elements (PEs). These PEs continuously exchange and process data locally based on specific dataflows (like weight stationary and row stationary), in turn reducing the number of memory accesses to the main memory. In SAs, convolutions are managed either as matrix multiplications or exploiting the raster-order scan of sliding windows. However, data redundancy is a primary concern affecting area, power and energy. In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing. TrIM maximizes the local input utilization, minimizes the weight data movement and solves the data redundancy problem. Furthermore, TrIM does not incur the significant on-chip memory penalty introduced by the row stationary dataflow. When compared to state-of-the-art SA dataflows the high data utilization offered by TrIM guarantees ~10x less memory access. Furthermore, considering that PEs continuously overlap multiplications and accumulations, TrIM achieves high throughput (up to 81.8% higher than row stationary), other than requiring a limited number of registers (up to 15.6x fewer registers than row stationary).

Cristian Sestito, Shady Agwa, Themis Prodromakis12/24/2024

arXiv:2412.16432v1 Announce Type: new Abstract: We propose DFModel, a modeling framework for mapping dataflow computation graphs onto large-scale systems. Mapping a workload to a system requires optimizing dataflow mappings at various levels, including the inter-chip (between chips) level and the intra-chip (within a chip) level. DFModel is, to the best of our knowledge, the first framework to perform the optimization at multiple levels of the memory hierarchy and the interconnection network hierarchy. We use DFModel to explore a wide range of workloads on a variety of systems. Evaluated workloads include two state-of-the-art machine learning applications (Large Language Models and Deep Learning Recommendation Models) and two high-performance computing applications (High Performance LINPACK and Fast Fourier Transform). System parameters investigated span the combination of dataflow and traditional accelerator architectures, memory technologies (DDR, HBM), interconnect technologies (PCIe, NVLink), and interconnection network topologies (torus, DGX, dragonfly). For a variety of workloads on a wide range of systems, the DFModel provided a mapping that predicts an average of 1.25X better performance compared to the ones measured on real systems. DFModel shows that for large language model training, dataflow architectures achieve 1.52X higher performance, 1.59X better cost efficiency, and 1.6X better power efficiency compared to non-dataflow architectures. On an industrial system with dataflow architectures, the DFModel-optimized dataflow mapping achieves a speedup of 6.13X compared to non-dataflow mappings from previous performance models such as Calculon, and 1.52X compared to a vendor provided dataflow mapping.

Sho Ko, Nathan Zhang, Olivia Hsu, Ardavan Pedram, Kunle Olukotun12/24/2024

arXiv:2412.16757v1 Announce Type: new Abstract: In this work, we present a control variate approximation technique that enables the exploitation of highly approximate multipliers in Deep Neural Network (DNN) accelerators. Our approach does not require retraining and significantly decreases the induced error due to approximate multiplications, improving the overall inference accuracy. As a result, our approach enables satisfying tight accuracy loss constraints while boosting the power savings. Our experimental evaluation, across six different DNNs and several approximate multipliers, demonstrates the versatility of our approach and shows that compared to the accurate design, our control variate approximation achieves the same performance, 45% power reduction, and less than 1% average accuracy loss. Compared to the corresponding approximate designs without using our technique, our approach improves the accuracy by 1.9x on average.

Georgios Zervakis, Fabio Frustaci, Ourania Spantidi, Iraklis Anagnostopoulos, Hussam Amrouch, J\"org Henkel12/24/2024