cs.AR

68 posts

arXiv:2407.08128v2 Announce Type: replace Abstract: This paper introduces the notion of referring forms as a new metric for analyzing sequential circuits from a functional perspective. Sequential circuits are modeled as causal stream functions, the outputs of which depend solely on the past and current inputs. Referring forms are defined based on the type expressions of functions and represent how a circuit refers to past inputs. The key contribution of this study is identifying a universal property in multiple clock domain circuits using referring forms. This theoretical framework is expected to enhance the comprehension and analysis of sequential circuits.

Shunji Nishimura1/22/2025

arXiv:2501.10430v1 Announce Type: new Abstract: Aquaculture involves cultivating marine and freshwater organisms, with real-time monitoring of aquatic parameters being crucial in fish farming. This thesis proposes an IoT-based framework using sensors and Arduino for efficient monitoring and control of water quality. Different sensors including pH, temperature, and turbidity are placed in cultivating pond water and each of them is connected to a common microcontroller board built on an Arduino Uno. The sensors read the data from the water and store it as a CSV file in an IoT cloud named Thingspeak through the Arduino Microcontroller. In the experimental part, we collected data from 5 ponds with various sizes and environments. After getting the real-time data, we compared these with the standard reference values. As a result, we can make the decision about which ponds are satisfactory for cultivating fish and what is not. After that, we labeled the data with 11 fish categories including Katla, sing, prawn, rui, koi, pangas, tilapia, silvercarp, karpio, magur, and shrimp. In addition, the data were analyzed using 10 machine learning (ML) algorithms containing J48, Random Forest, K-NN, K*, LMT, REPTree, JRIP, PART, Decision Table, and Logit boost. After experimental evaluation, it was observed among 5 ponds, only three ponds were perfect for fish farming, where these 3 ponds only satisfied the standard reference values of pH (6.5-8.5), Temperature (16-24)oC, Turbidity (below 10)ntu, Conductivity (970-1825){\mu}S/cm, and Depth (1-4) meter. Among the state-of-the-art machine learning algorithms, Random Forest achieved the highest score of performance metrics as accuracy 94.42%, kappa statistics 93.5%, and Avg. TP Rate 94.4%. In addition, we calculated the BOD, COD, and DO for one scenario. This study includes details of the proposed IoT system's prototype hardware.

Md. Monirul Islam1/22/2025

arXiv:2501.11867v1 Announce Type: new Abstract: This paper presents digital hardware for computing polynomial multiplication using Number Theoretic Transform (NTT), specifically designed for implementation on Field Programmable Gate Arrays (FPGAs). Multiplying two large polynomials applies to many modern encryption schemes, including those based on Ring Learning with Error (RLWE). The proposed design uses First In, First Out (FIFO) buffers to make the design fully pipelined and capable of computing two n degree polynomials in n/2 clock cycles. This hardware proposes a two-fold reduction in the processing time of polynomial multiplication compared to state-of-the-art enabling twice as much encryption in the same amount of time. Despite that, the proposed hardware utilizes fewer resources than the fastest-reported work.

Moslem Heidarpur, Mitra Mirhassani, Norman Chang1/22/2025

arXiv:2405.14209v2 Announce Type: replace Abstract: Compute eXpress Link (CXL) is emerging as a promising memory interface technology. Because of the common unavailability of CXL devices, the performance of the CXL memory is largely unknown. What are the use cases for the CXL memory? What are the impacts of the CXL memory on application performance? How to use the CXL memory in combination with existing memory components? In this work, we study the performance of three genuine CXL memory-expansion cards from different vendors. We characterize the basic performance of the CXL memory, study how HPC applications and large language models can benefit from the CXL memory, and study the interplay between memory tiering and page interleaving. We also propose a novel data object-level interleaving policy to match the interleaving policy with memory access patterns. We reveal the challenges and opportunities of using the CXL memory.

Xi Wang, Jie Liu, Jianbo Wu, Shuangyan Yang, Jie Ren, Bhanu Shankar, Dong Li1/22/2025

arXiv:2501.10864v1 Announce Type: new Abstract: This paper details the purpose, difficulties, theory, implementation, and results of developing a Fast Fourier Transform (FFT) using the prime factor algorithm on an embedded system. Many applications analyze the frequency content of signals, which is referred to as spectral analysis. Some of these applications include communication systems, radar systems, control systems, seismology, speech, music, sonar, finance, image processing, and neural networks. For many real-time applications, the speed at which the spectral analysis is performed is crucial. In order to perform spectral analysis, a Fourier transform is employed. For embedded systems, where spectral analysis is done digitally, a discrete Fourier transform (DFT) is employed. The main goal for this project is to develop an FFT for a 36-point DFT on the Nuvoton Nu-LB-NUC140V2. In this case, the prime factor algorithm is utilized to compute a fast DFT.

Josh Vernon, D. G. Perera1/22/2025

arXiv:2408.01902v3 Announce Type: replace Abstract: Characterizing and understanding graph neural networks (GNNs) is essential for identifying performance bottlenecks and facilitating their deployment in parallel and distributed systems. Despite substantial work in this area, a comprehensive survey on characterizing and understanding GNNs from a computer architecture perspective is lacking. This work presents a comprehensive survey, proposing a triple-level classification method to categorize, summarize, and compare existing efforts, particularly focusing on their implications for parallel architectures and distributed systems. We identify promising future directions for GNN characterization that align with the challenges of optimizing hardware and software in parallel and distributed systems. Our survey aims to help scholars systematically understand GNN performance bottlenecks and execution patterns from a computer architecture perspective, thereby contributing to the development of more efficient GNN implementations across diverse parallel architectures and distributed systems.

Meng Wu, Mingyu Yan, Wenming Li, Xiaochun Ye, Dongrui Fan, Yuan Xie1/22/2025

arXiv:2501.11286v1 Announce Type: new Abstract: The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digital-based accelerators, there is growing interest in exploring photonics due to its high energy efficiency and ultra-fast processing speeds. However, the significant signal conversion overhead limits the performance of photonic-based accelerators. In this work, we propose HyAtten, a photonic-based attention accelerator with minimize signal conversion overhead. HyAtten incorporates a signal comparator to classify signals into two categories based on whether they can be processed by low-resolution converters. HyAtten integrates low-resolution converters to process all low-resolution signals, thereby boosting the parallelism of photonic computing. For signals requiring high-resolution conversion, HyAtten uses digital circuits instead of signal converters to reduce area and latency overhead. Compared to state-of-the-art photonic-based Transformer accelerator, HyAtten achieves 9.8X performance/area and 2.2X energy-efficiency/area improvement.

Huize Li, Dan Chen, Tulika Mitra1/22/2025

arXiv:2501.11839v1 Announce Type: new Abstract: Automating analog and radio-frequency (RF) circuit design using machine learning (ML) significantly reduces the time and effort required for parameter optimization. This study explores supervised ML-based approaches for designing circuit parameters from performance specifications across various circuit types, including homogeneous and heterogeneous designs. By evaluating diverse ML models, from neural networks like transformers to traditional methods like random forests, we identify the best-performing models for each circuit. Our results show that simpler circuits, such as low-noise amplifiers, achieve exceptional accuracy with mean relative errors as low as 0.3% due to their linear parameter-performance relationships. In contrast, complex circuits, like power amplifiers and voltage-controlled oscillators, present challenges due to their non-linear interactions and larger design spaces. For heterogeneous circuits, our approach achieves an 88% reduction in errors with increased training data, with the receiver achieving a mean relative error as low as 0.23%, showcasing the scalability and accuracy of the proposed methodology. Additionally, we provide insights into model strengths, with transformers excelling in capturing non-linear mappings and k-nearest neighbors performing robustly in moderately linear parameter spaces, especially in heterogeneous circuits with larger datasets. This work establishes a foundation for extending ML-driven design automation, enabling more efficient and scalable circuit design workflows.

Asal Mehradfar, Xuzhe Zhao, Yue Niu, Sara Babakniya, Mahdi Alesheikh, Hamidreza Aghasi, Salman Avestimehr1/22/2025

arXiv:2501.12032v1 Announce Type: new Abstract: Keeping ML-based recommender models up-to-date as data drifts and evolves is essential to maintain accuracy. As a result, online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. Such an approach results in high deployment costs and energy consumption. For instance, a recent report from industrial deployments shows that data storage and ingestion pipelines can account for over 60\% of the power consumption in a recommender system. In this paper, we tackle the issue from a hardware perspective by introducing Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion. As part of the design, we define MiniPipe, the smallest pipeline unit enabling multi-pipeline implementation by executing various data preprocessing tasks across the single board, giving Piper the ability to be reconfigured at runtime. Our results, using publicly released commercial pipelines, show that Piper, prototyped on a power-efficient FPGA, achieves a 39$\sim$105$\times$ speedup over a server-grade, 128-core CPU and 3$\sim$17$\times$ speedup over GPUs like RTX 3090 and A100 in multiple pipelines. The experimental analysis demonstrates that Piper provides advantages in both latency and energy efficiency for preprocessing tasks in recommender systems, providing an alternative design point for systems that today are in very high demand.

Yu Zhu, Wenqi Jiang, Gustavo Alonso1/22/2025

arXiv:2501.12084v1 Announce Type: new Abstract: Modern GPUs, with their specialized hardware like tensor cores, are essential for demanding AI and deep learning applications. This study presents a comprehensive, multi-level microbenchmarking analysis of the NVIDIA Hopper GPU architecture, delving into its performance characteristics and novel features. We benchmark Hopper's memory subsystem latency and throughput, comparing its L2 partitioned cache behavior and global memory access patterns against recent GPU generations, Ampere and Ada Lovelace. Our analysis reveals significant performance differences and architectural improvements in Hopper. A core contribution of this work is a detailed evaluation of Hopper's fourth-generation tensor cores, including their FP8 precision support and the novel asynchronous wgmma instructions, assessing their impact on matrix multiply-accumulate operations. We further investigate the performance implications of other key Hopper innovations: DPX instructions for accelerating dynamic programming algorithms, distributed shared memory (DSM) for inter-SM communication, and the Tensor Memory Accelerator (TMA) for asynchronous data movement. This multi-level approach encompasses instruction-level microbenchmarks, library-level analysis of the Transformer Engine, and application-level benchmarks of tensor core performance within large language models. Our findings provide valuable, in-depth insights for software developers seeking to optimize performance and develop accurate performance models for the Hopper architecture, ultimately contributing to a deeper understanding of its potential for accelerating AI and other computationally intensive workloads.

Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Hongyuan Liu, Qiang Wang, Xiaowen Chu1/22/2025

arXiv:2412.20954v2 Announce Type: replace Abstract: Customized processors are attractive solutions for vast domain-specific applications due to their high energy efficiency. However, designing a processor in traditional flows is time-consuming and expensive. To address this, researchers have explored methods including the use of agile development tools like Chisel or SpinalHDL, high-level synthesis (HLS) from programming languages like C or SystemC, and more recently, leveraging large language models (LLMs) to generate hardware description language (HDL) code from natural language descriptions. However, each method has limitations in terms of expressiveness, correctness, and performance, leading to a persistent contradiction between the level of automation and the effectiveness of the design. Overall, how to automatically design highly efficient and practical processors with minimal human effort remains a challenge. In this paper, we propose AGON, a novel framework designed to leverage LLMs for the efficient design of out-of-order (OoO) customized processors with minimal human effort. Central to AGON is the nano-operator function (nOP function) based Intermediate Representation (IR), which bridges high-level descriptions and hardware implementations while decoupling functionality from performance optimization, thereby providing an automatic design framework that is expressive and efficient, has correctness guarantees, and enables PPA (Power, Performance, and Area) optimization. Experimental results show that superior to previous LLM-assisted automatic design flows, AGON facilitates designing a series of customized OoO processors that achieve on average 2.35 $\times$ speedup compared with BOOM, a general-purpose CPU designed by experts, with minimal design effort.

Chongxiao Li, Di Huang, Pengwei Jin, Tianyun Ma, Husheng Han, Shuyao Cheng, Yifan Hao, Yongwei Zhao, Guanglin Xu, Zidong Du, Rui Zhang, Xiaqing Li, Yuanbo Wen, Xing Hu, Qi Guo1/22/2025

arXiv:2501.10682v1 Announce Type: new Abstract: The CXL-based solid-state drive (CXL-SSD) provides a promising approach towards scaling the main memory capacity at low cost. However, the CXL-SSD faces performance challenges due to the long flash access latency and unpredictable events such as garbage collection in the SSD device, stalling the host processor and wasting compute cycles. Although the CXL interface enables the byte-granular data access to the SSD, accessing flash chips is still at page granularity due to physical limitations. The mismatch of access granularity causes significant unnecessary I/O traffic to flash chips, worsening the suboptimal end-to-end data access performance. In this paper, we present SkyByte, an efficient CXL-based SSD that employs a holistic approach to address the aforementioned challenges by co-designing the host operating system (OS) and SSD controller. To alleviate the long memory stall when accessing the CXL-SSD, SkyByte revisits the OS context switch mechanism and enables opportunistic context switches upon the detection of long access delays. To accommodate byte-granular data accesses, SkyByte architects the internal DRAM of the SSD controller into a cacheline-level write log and a page-level data cache, and enables data coalescing upon log cleaning to reduce the I/O traffic to flash chips. SkyByte also employs optimization techniques that include adaptive page migration for exploring the performance benefits of fast host memory by promoting hot pages in CXL-SSD to the host. We implement SkyByte with a CXL-SSD simulator and evaluate its efficiency with various data-intensive applications. Our experiments show that SkyByte outperforms current CXL-based SSD by 6.11X, and reduces the I/O traffic to flash chips by 23.08X on average. SkyByte also reaches 75% of the performance of the ideal case that assumes unlimited DRAM capacity in the host, which offers an attractive cost-effective solution.

Haoyang Zhang, Yuqi Xue, Yirui Eric Zhou, Shaobo Li, Jian Huang1/22/2025

arXiv:2501.10983v1 Announce Type: new Abstract: Previous schemes for designing secure branch prediction unit (SBPU) based on physical isolation can only offer limited security and significantly affect BPU's prediction capability, leading to prominent performance degradation. Moreover, encryption-based SBPU schemes based on periodic key re-randomization have the risk of being compromised by advanced attack algorithms, and the performance overhead is also considerable. To this end, this paper proposes a conflict-invisible SBPU (CIBPU). CIBPU employs redundant storage design, load-aware indexing, and replacement design, as well as an encryption mechanism without requiring periodic key updates, to prevent attackers' perception of branch conflicts. We provide a thorough security analysis, which shows that CIBPU achieves strong security throughout the BPU's lifecycle. We implement CIBPU in a RISC-V core model in gem5. The experimental results show that CIBPU causes an average performance overhead of only 1.12%-2.20% with acceptable hardware storage overhead, which is the lowest among the state-of-the-art SBPU schemes. CIBPU has also been implemented in the open-source RISC-V core, SonicBOOM, which is then burned onto an FPGA board. The evaluation based on the board shows an average performance degradation of 2.01%, which is approximately consistent with the result obtained in gem5.

Zhe Zhou, Fei Tong, Hongyu Wang, Xiaoyu Cheng, Fang Jiang, Zhikun Zhang, Yuxing Mao1/22/2025

arXiv:2501.11159v1 Announce Type: new Abstract: This paper presents LiFT, a lightweight, fully quantized 3D object detection algorithm for LiDAR data, optimized for real-time inference on FPGA platforms. Through an in-depth analysis of FPGA-specific limitations, we identify a set of FPGA-induced constraints that shape the algorithm's design. These include a computational complexity limit of 30 GMACs (billion multiply-accumulate operations), INT8 quantization for weights and activations, 2D cell-based processing instead of 3D voxels, and minimal use of skip connections. To meet these constraints while maximizing performance, LiFT combines novel mechanisms with state-of-the-art techniques such as reparameterizable convolutions and fully sparse architecture. Key innovations include the Dual-bound Pillar Feature Net, which boosts performance without increasing complexity, and an efficient scheme for INT8 quantization of input features. With a computational cost of just 20.73 GMACs, LiFT stands out as one of the few algorithms targeting minimal-complexity 3D object detection. Among comparable methods, LiFT ranks first, achieving an mAP of 51.84% and an NDS of 61.01% on the challenging NuScenes validation dataset. The code will be available at https://github.com/vision-agh/lift.

Konrad Lis, Tomasz Kryjak, Marek Gorgon1/22/2025

arXiv:2501.10658v1 Announce Type: new Abstract: The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of $1.4$~$7.0\times$ and $1.5$~$146.1\times$, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by $0.1\%$~$3.1\%$ using the $L_2$ distance similarity, $0.1\%$~$3.4\%$ with the $L_1$ distance similarity, and $0.1\%$~$3.8\%$ when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from $1.4\%$ to $3.0\%$.

Guoyu Li (University of Chinese Academy of Sciences, Microsoft Research), Shengyu Ye (Microsoft Research), Chunyun Chen (NTU Singapore), Yang Wang (Microsoft Research), Fan Yang (Microsoft Research), Ting Cao (Microsoft Research), Cheng Liu (University of Chinese Academy of Sciences), Mohamed M. Sabry (NTU Singapore), Mao Yang (Microsoft Research)1/22/2025

arXiv:2501.11211v1 Announce Type: new Abstract: Diffusion models achieve superior performance in image generation tasks. However, it incurs significant computation overheads due to its iterative structure. To address these overheads, we analyze this iterative structure and observe that adjacent time steps in diffusion models exhibit high value similarity, leading to narrower differences between consecutive time steps. We adapt these characteristics to a quantized diffusion model and reveal that the majority of these differences can be represented with reduced bit-width, and even zero. Based on our observations, we propose the Ditto algorithm, a difference processing algorithm that leverages temporal similarity with quantization to enhance the efficiency of diffusion models. By exploiting the narrower differences and the distributive property of layer operations, it performs full bit-width operations for the initial time step and processes subsequent steps with temporal differences. In addition, Ditto execution flow optimization is designed to mitigate the memory overhead of temporal difference processing, further boosting the efficiency of the Ditto algorithm. We also design the Ditto hardware, a specialized hardware accelerator, fully exploiting the dynamic characteristics of the proposed algorithm. As a result, the Ditto hardware achieves up to 1.5x speedup and 17.74% energy saving compared to other accelerators.

Sungbin Kim, Hyunwuk Lee, Wonho Cho, Mincheol Park, Won Woo Ro1/22/2025

arXiv:2501.09242v2 Announce Type: replace Abstract: Customized accelerators have transformed modern computing by enhancing energy efficiency and performance through specialization. Field Programmable Gate Arrays play a pivotal role in this domain due to their flexibility and high-performance potential. High-Level Synthesis and source-to-source compilers simplify hardware design by converting high-level code into hardware descriptions enriched with directives. However, achieving high Quality of Results in FPGA designs remains challenging, requiring complex transformations, strategic directive use, and efficient data management. While existing approaches like Design Space Exploration (DSE) and source-to-source compilers have made strides in improving performance, they often address isolated aspects of the design process. This paper introduces Prometheus, a holistic framework that integrates task fusion, tiling, loop permutation, computation-communication overlap, and concurrent task execution into a unified design space. Leveraging Non-Linear Problem methodologies, Prometheus explores this space to find solutions under resource constraints, enabling bitstream generation.

St\'ephane Pouget, Michael Lo, Louis-No\"el Pouchet, Jason Cong1/22/2025

arXiv:2501.11554v1 Announce Type: new Abstract: Egomotion estimation is crucial for applications such as autonomous navigation and robotics, where accurate and real-time motion tracking is required. However, traditional methods relying on inertial sensors are highly sensitive to external conditions, and suffer from drifts leading to large inaccuracies over long distances. Vision-based methods, particularly those utilising event-based vision sensors, provide an efficient alternative by capturing data only when changes are perceived in the scene. This approach minimises power consumption while delivering high-speed, low-latency feedback. In this work, we propose a fully event-based pipeline for egomotion estimation that processes the event stream directly within the event-based domain. This method eliminates the need for frame-based intermediaries, allowing for low-latency and energy-efficient motion estimation. We construct a shallow spiking neural network using a synaptic gating mechanism to convert precise event timing into bursts of spikes. These spikes encode local optical flow velocities, and the network provides an event-based readout of egomotion. We evaluate the network's performance on a dedicated chip, demonstrating strong potential for low-latency, low-power motion estimation. Additionally, simulations of larger networks show that the system achieves state-of-the-art accuracy in egomotion estimation tasks with event-based cameras, making it a promising solution for real-time, power-constrained robotics applications.

Hugh Greatorex, Michele Mastella, Madison Cotteret, Ole Richter, Elisabetta Chicca1/22/2025

arXiv:2411.06376v2 Announce Type: replace Abstract: Peripheral Component Interconnect Express (PCIe) is the de facto interconnect standard for high-speed peripherals and CPUs. Prototyping and optimizing PCIe devices for emerging scenarios is an ongoing challenge. Since Transaction Layer Packets (TLPs) capture device-CPU interactions, it is crucial to analyze and generate realistic TLP traces for effective device design and optimization. Generative AI offers a promising approach for creating intricate, custom TLP traces necessary for PCIe hardware and software development. However, existing models often generate impractical traces due to the absence of PCIe-specific constraints, such as TLP ordering and causality. This paper presents Phantom, the first framework that treats TLP trace generation as a generative AI problem while incorporating PCIe-specific constraints. We validate Phantom's effectiveness by generating TLP traces for an actual PCIe network interface card. Experimental results show that Phantom produces practical, large-scale TLP traces, significantly outperforming existing models, with improvements of up to 1000$\times$ in task-specific metrics and up to 2.19$\times$ in Frechet Inception Distance (FID) compared to backbone-only methods.

Zhibai Huang, Yihan Shen, Yongchen Xie, Zhixiang Wei, Yun wang, Fangxin Liu, Tao Song, Zhengwei Qi1/14/2025

arXiv:2412.06838v2 Announce Type: replace Abstract: Brains perform decision-making by Bayes theorem. The theorem quantifies events as probabilities and, based on probability rules, renders the decisions. Learning from this, Bayes theorem can be applied to enable efficient user-scene interactions. However, given the probabilistic nature, implementing Bayes theorem in hardware using conventional deterministic computing can incur excessive computational cost and decision latency. Though challenging, here we present a probabilistic computing approach based on memristors to implement the Bayes theorem. We integrate memristors with Boolean logics and, by exploiting the volatile stochastic switching of the memristors, realise probabilistic logic operations, key for hardware Bayes theorem implementation. To empirically validate the efficacy of the hardware Bayes theorem in user-scene interactions, we develop lightweight Bayesian inference and fusion hardware operators using the probabilistic logics and apply the operators in road scene parsing for self-driving, including route planning and obstacle detection. The results show our operators can achieve reliable decisions in less than 0.4 ms (or equivalently 2,500 fps), outperforming human decision-making and the existing driving assistance systems.

Lekai Song, Pengyu Liu, Yang Liu, Jingfang Pei, Wenyu Cui, Songwei Liu, Yingyi Wen, Teng Ma, Kong-Pang Pun, Leonard W. T. Ng, Guohua Hu1/14/2025