cs.AR

84 posts

arXiv:2502.07842v2 Announce Type: replace Abstract: Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, imposed by cell limitations and the need for multiple cells for higher-bit weights, present further challenges. While fine-grained partial-sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums efficiently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robustness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant.

Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim, Jong Hwan Ko3/14/2025

arXiv:2503.10207v1 Announce Type: new Abstract: Kyber, an IND-CCA2-secure lattice-based post-quantum key-encapsulation mechanism, is the winner of the first post-quantum cryptography standardization process of the US National Institute of Standards and Technology. In this work, we provide an efficient implementation of Kyber on ESP32, a very popular microcontroller for Internet of Things applications. We hand-partition the Kyber algorithm to enable utilization of the ESP32 dual-core architecture, which allows us to speed up its execution by 1.21x (keygen), 1.22x (encaps) and 1.20x (decaps). We also explore the possibility of gaining further improvement by utilizing the ESP32 SHA and AES coprocessor and achieve a culminated speed-up of 1.72x (keygen), 1.84x (encaps) and 1.69x (decaps).

Fabian Segatz, Muhammad Ihsan Al Hafiz3/14/2025

arXiv:2503.07242v2 Announce Type: replace Abstract: Convolutional Neural Networks (CNNs) serve various applications with diverse performance and resource requirements. Model-aware CNN accelerators best address these diverse requirements. These accelerators usually combine multiple dedicated Compute Engines (CEs). The flexibility of Field-Programmable Gate Arrays (FPGAs) enables the design of such multiple Compute-Engine (multiple-CE) accelerators. However, existing multiple-CE accelerators differ in how they arrange their CEs and distribute the FPGA resources and CNN operators among the CEs. The design space of multiple-CE accelerators comprises numerous such arrangements, which makes a systematic identification of the best ones an open challenge. This paper proposes a multiple-CE accelerator analytical Cost Model (MCCM) and an evaluation methodology built around MCCM. The model and methodology streamline the expression of any multiple-CE accelerator and provide a fast evaluation of its performance and efficiency. MCCM is in the order of 100000x faster than traditional synthesis-based evaluation and has an average accuracy of > 90%. The paper presents three use cases of MCCM. The first describes an end-to-end evaluation of state-of-the-art multiple-CE accelerators considering various metrics, CNN models, and resource budgets. The second describes fine-grained evaluation that helps identify performance bottlenecks of multiple-CE accelerators. The third demonstrates that MCCM fast evaluation enables exploring the vast design space of multiple-CE accelerators. These use cases show that no unique CE arrangement achieves the best results given different metrics, CNN models, and resource budgets. They also show that fast evaluation enables design space exploration, resulting in accelerator designs that outperform state-of-the-art ones. MCCM is available at https://github.com/fqararyah/MCCM.

Fareed Qararyah, Mohammad Ali Maleki, Pedro Trancoso3/14/2025

arXiv:2503.09676v1 Announce Type: cross Abstract: Research into the development of special-purpose computing architectures designed to solve quadratic unconstrained binary optimization (QUBO) problems has flourished in recent years. It has been demonstrated in the literature that such special-purpose solvers can outperform traditional CMOS architectures by orders of magnitude with respect to timing metrics on synthetic problems. However, they face challenges with constrained problems such as the quadratic assignment problem (QAP), where mapping to binary formulations such as QUBO introduces overhead and limits parallelism. In-memory computing (IMC) devices, such as memristor-based analog Ising machines, offer significant speedups and efficiency gains over traditional CPU-based solvers, particularly for solving combinatorial optimization problems. In this work, we present a novel local search heuristic designed for IMC hardware to tackle the QAP. Our approach enables massive parallelism that allows for computing of full neighbourhoods simultaneously to make update decisions. We ensure binary solutions remain feasible by selecting local moves that lead to neighbouring feasible solutions, leveraging feasible-space search heuristics and the underlying structure of a given problem. Our approach is compatible with both digital computers and analog hardware. We demonstrate its effectiveness in CPU implementations by comparing it with state-of-the-art heuristics for solving the QAP.

Haesol Im, Chan-Woo Yang, Moslem Noori, Dmitrii Dobrynin, Elisabetta Valiante, Giacomo Pedretti, Arne Heittmann, Thomas Van Vaerenbergh, Masoud Mohseni, John Paul Strachan, Dmitri Strukov, Ray Beausoleil, Ignacio Rozada3/14/2025

arXiv:2404.18407v2 Announce Type: replace Abstract: Physical design watermarking on contemporary integrated circuit (IC) layout encodes signatures without considering the dense connections and design constraints, which could lead to performance degradation on the watermarked products. This paper presents ICMarks, a quality-preserving and robust watermarking framework for modern IC physical design. ICMarks embeds unique watermark signatures during the physical design's placement stage, thereby authenticating the IC layout ownership. ICMarks's novelty lies in (i) strategically identifying a region of cells to watermark with minimal impact on the layout performance and (ii) a two-level watermarking framework for augmented robustness toward potential removal and forging attacks. Extensive evaluations on benchmarks of different design objectives and sizes validate that ICMarks incurs no wirelength and timing metrics degradation, while successfully proving ownership. Furthermore, we demonstrate ICMarks is robust against two major watermarking attack categories, namely, watermark removal and forging attacks; even if the adversaries have prior knowledge of the watermarking schemes, the signatures cannot be removed without significantly undermining the layout quality.

Ruisi Zhang, Rachel Selina Rajarathnam, David Z. Pan, Farinaz Koushanfar3/14/2025

arXiv:2503.10296v1 Announce Type: new Abstract: This paper discusses the integration challenges and strategies for designing mobile robots, by focusing on the task-driven, optimal selection of hardware and software to balance safety, efficiency, and minimal usage of resources such as costs, energy, computational requirements, and weight. We emphasize the interplay between perception and motion planning in decision-making by introducing the concept of occupancy queries to quantify the perception requirements for sampling-based motion planners. Sensor and algorithm performance are evaluated using False Negative Rates (FPR) and False Positive Rates (FPR) across various factors such as geometric relationships, object properties, sensor resolution, and environmental conditions. By integrating perception requirements with perception performance, an Integer Linear Programming (ILP) approach is proposed for efficient sensor and algorithm selection and placement. This forms the basis for a co-design optimization that includes the robot body, motion planner, perception pipeline, and computing unit. We refer to this framework for solving the co-design problem of mobile robots as CODEI, short for Co-design of Embodied Intelligence. A case study on developing an Autonomous Vehicle (AV) for urban scenarios provides actionable information for designers, and shows that complex tasks escalate resource demands, with task performance affecting choices of the autonomy stack. The study demonstrates that resource prioritization influences sensor choice: cameras are preferred for cost-effective and lightweight designs, while lidar sensors are chosen for better energy and computational efficiency.

Dejan Milojevic, Gioele Zardini, Miriam Elser, Andrea Censi, Emilio Frazzoli3/14/2025

arXiv:2503.09975v1 Announce Type: new Abstract: Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90\% MFU, while incurring an accuracy degradation of less than 1\%.

Joonhyung Lee, Shmulik Markovich-Golan, Daniel Ohayon, Yair Hanani, Gunho Park, Byeongwook Kim, Asaf Karnieli, Uri Livne, Haihao Shen, Tai Huang, Se Jung Kwon, Dongsoo Lee3/14/2025

arXiv:2503.09650v1 Announce Type: new Abstract: With the advancement of Large Language Models (LLMs), the importance of accelerators that efficiently process LLM computations has been increasing. This paper discusses the necessity of LLM accelerators and provides a comprehensive analysis of the hardware and software characteristics of the main commercial LLM accelerators. Based on this analysis, we propose considerations for the development of next-generation LLM accelerators and suggest future research directions.

Sihyeong Park, Jemin Lee, Byung-Soo Kim, Seokhun Jeon3/14/2025

arXiv:2503.05116v1 Announce Type: new Abstract: Graph processing requires irregular, fine-grained random access patterns incompatible with contemporary off-chip memory architecture, leading to inefficient data access. This inefficiency makes graph processing an extremely memory-bound application. Because of this, existing graph processing accelerators typically employ a graph tiling-based or processing-in-memory (PIM) approach to relieve the memory bottleneck. In the tiling-based approach, a graph is split into chunks that fit within the on-chip cache to maximize data reuse. In the PIM approach, arithmetic units are placed within memory to perform operations such as reduction or atomic addition. However, both approaches have several limitations, especially when implemented on current memory standards (i.e., DDR). Because the access granularity provided by DDR is much larger than that of the graph vertex property data, much of the bandwidth and cache capacity are wasted. PIM is meant to alleviate such issues, but it is difficult to use in conjunction with the tiling-based approach, resulting in a significant disadvantage. Furthermore, placing arithmetic units inside a memory chip is expensive, thereby supporting multiple types of operation is thought to be impractical. To address the above limitations, we present Piccolo, an end-to-end efficient graph processing accelerator with fine-grained in-memory random scatter-gather. Instead of placing expensive arithmetic units in off-chip memory, Piccolo focuses on reducing the off-chip traffic with non-arithmetic function-in-memory of random scatter-gather. To fully benefit from in-memory scatter-gather, Piccolo redesigns the cache and MHA of the accelerator such that it can enjoy both the advantage of tiling and in-memory operations. Piccolo achieves a maximum speedup of 3.28$\times$ and a geometric mean speedup of 1.62$\times$ across various and extensive benchmarks.

Changmin Shin, Jaeyong Song, Hongsun Jang, Dogeun Kim, Jun Sung, Taehee Kwon, Jae Hyung Ju, Frank Liu, Yeonkyu Choi, Jinho Lee3/10/2025

arXiv:2502.20415v2 Announce Type: replace Abstract: Neuromorphic computing is a relatively new discipline of computer science, where the principles of biological brain's computation and memory are used to create a new way of processing information, based on networks of spiking neurons. Those networks can be implemented as both analog and digital implementations, where for the latter, the Field Programmable Gate Arrays (FPGAs) are a frequent choice, due to their inherent flexibility, allowing the researchers to easily design hardware neuromorphic architecture (NMAs). Moreover, digital NMAs show good promise in simulating various spiking neural networks because of their inherent accuracy and resilience to noise, as opposed to analog implementations. This paper presents an overview of digital NMAs implemented on FPGAs, with a goal of providing useful references to various architectural design choices to the researchers interested in digital neuromorphic systems. We present a taxonomy of NMAs that highlights groups of distinct architectural features, their advantages and disadvantages and identify trends and predictions for the future of those architectures.

Wiktor J. Szczerek, Artur Podobas3/10/2025

arXiv:2307.05815v2 Announce Type: replace Abstract: The globalization of semiconductor supply chains has exposed Network-on-Chip (NoC) interconnects in System-on-Chip (SoC) architectures to critical security risks, including reverse engineering and IP theft. To address these threats, this work builds on two methodologies: ObNoCs [11], which obfuscates NoC topologies using programmable multiplexers, and POTENT [10], which enhances post-synthesis security against SAT-based attacks. These techniques ensure robust protection of NoC interconnects with minimal performance overhead. As the industry shifts to chiplet-based heterogeneous architectures, this research extends ObNoCs and POTENT to secure intra- and inter-chiplet interconnects. New challenges, such as safeguarding inter-chiplet communication and interposer design, are addressed through enhanced obfuscation, authentication, and encryption mechanisms. Experimental results demonstrate the practicality of these approaches for high-security applications, ensuring trust and reliability in monolithic and modular systems.

Dipal Halder3/10/2025

arXiv:2503.05197v1 Announce Type: new Abstract: Point clouds are increasingly important in intelligent applications, but frequent off-chip memory traffic in accelerators causes pipeline stalls and leads to high energy consumption. While conventional line buffer techniques can eliminate off-chip traffic, they cannot be directly applied to point clouds due to their inherent computation patterns. To address this, we introduce two techniques: compulsory splitting and deterministic termination, enabling fully-streaming processing. We further propose StreamGrid, a framework that integrates these techniques and automatically optimizes on-chip buffer sizes. Our evaluation shows StreamGrid reduces on-chip memory by 61.3\% and energy consumption by 40.5\% with marginal accuracy loss compared to the baselines without our techniques. Additionally, we achieve 10.0$\times$ speedup and 3.9$\times$ energy efficiency over state-of-the-art accelerators.

Yu Feng, Zheng Liu, Weikai Lin, Zihan Liu, Jingwen Leng, Minyi Guo, Zhezhi He, Jieru Zhao, Yuhao Zhu3/10/2025

arXiv:2503.04991v1 Announce Type: new Abstract: Compute Express Link (CXL) switch allows memory extension via PCIe physical layer to address increasing demand for larger memory capacities in data centers. However, CXL attached memory introduces 170ns to 400ns memory latency. This becomes a significant performance bottleneck for applications that host data in persistent memory as all updates, after traversing the CXL switch, must reach persistent domain to ensure crash consistent updates.We make a case for persistent CXL switch to persist updates as soon as they reach the switch and hence significantly reduce latency of persisting data. To enable this, we presented a system independent persistent buffer (PB) design that ensures data persistency at CXL switch. Our PB design provides 12\% speedup, on average, over volatile CXL switch. Our \textit{read forwarding} optimization improves speedup to 15\%.

Khan Shaikhul Hadi, Naveed Ul Mustafa, Mark Heinrich, Yan Solihin3/10/2025

arXiv:2503.04846v1 Announce Type: new Abstract: Fault injection attacks represent a class of threats that can compromise embedded systems across multiple layers of abstraction, such as system software, instruction set architecture (ISA), microarchitecture, and physical implementation. Early detection of these vulnerabilities and understanding their root causes along with their propagation from the physical layer to the system software is critical to secure the cyberinfrastructure. This present presents a comprehensive methodology for conducting controlled fault injection attacks at the pre-silicon level and an analysis of the underlying system for root-causing behavior. As the driving application, we use the clock glitch attacks in AI/ML applications for critical misclassification. Our study aims to characterize and diagnose the impact of faults within the RISC-V instruction set and pipeline stages, while tracing fault propagation from the circuit level to the AI/ML application software. This analysis resulted in discovering a novel vulnerability through controlled clock glitch parameters, specifically targeting the RISC-V decode stage.

Arsalan Ali Malik, Harshvadan Mihir, Aydin Aysu3/10/2025

arXiv:2503.05290v1 Announce Type: new Abstract: Transformers are central to advances in artificial intelligence (AI), excelling in fields ranging from computer vision to natural language processing. Despite their success, their large parameter count and computational demands challenge efficient acceleration. To address these limitations, this paper proposes MatrixFlow, a novel co-designed system-accelerator architecture based on a loosely coupled systolic array including a new software mapping approach for efficient transformer code execution. MatrixFlow is co-optimized via a novel dataflow-based matrix multiplication technique that reduces memory overhead. These innovations significantly improve data throughput, which is critical for handling the extensive computations required by transformers. We validate our approach through full system simulation using gem5 across various BERT and ViT Transformer models featuring different data types, demonstrating significant application-wide speed-ups. Our method achieves up to a 22x improvement compared to a many-core CPU system, and outperforms the closest state-of-the-art loosely-coupled and tightly-coupled accelerators by over 5x and 8x, respectively.

Qunyou Liu, Marina Zapater, David Atienza3/10/2025

arXiv:2407.12282v2 Announce Type: replace Abstract: Macro placement is a vital step in digital circuit design that defines the physical location of large collections of components, known as macros, on a 2D chip. Because key performance metrics of the chip are determined by the placement, optimizing it is crucial. Existing learning-based methods typically fall short because of their reliance on reinforcement learning (RL), which is slow and struggles to generalize, requiring online training on each new circuit. Instead, we train a diffusion model capable of placing new circuits zero-shot, using guided sampling in lieu of RL to optimize placement quality. To enable such models to train at scale, we designed a capable yet efficient architecture for the denoising model, and propose a novel algorithm to generate large synthetic datasets for pre-training. To allow zero-shot transfer to real circuits, we empirically study the design decisions of our dataset generation algorithm, and identify several key factors enabling generalization. When trained on our synthetic data, our models generate high-quality placements on unseen, realistic circuits, achieving competitive performance on placement benchmarks compared to state-of-the-art methods.

Vint Lee, Minh Nguyen, Leena Elzeiny, Chun Deng, Pieter Abbeel, John Wawrzynek3/10/2025

arXiv:2501.11554v1 Announce Type: new Abstract: Egomotion estimation is crucial for applications such as autonomous navigation and robotics, where accurate and real-time motion tracking is required. However, traditional methods relying on inertial sensors are highly sensitive to external conditions, and suffer from drifts leading to large inaccuracies over long distances. Vision-based methods, particularly those utilising event-based vision sensors, provide an efficient alternative by capturing data only when changes are perceived in the scene. This approach minimises power consumption while delivering high-speed, low-latency feedback. In this work, we propose a fully event-based pipeline for egomotion estimation that processes the event stream directly within the event-based domain. This method eliminates the need for frame-based intermediaries, allowing for low-latency and energy-efficient motion estimation. We construct a shallow spiking neural network using a synaptic gating mechanism to convert precise event timing into bursts of spikes. These spikes encode local optical flow velocities, and the network provides an event-based readout of egomotion. We evaluate the network's performance on a dedicated chip, demonstrating strong potential for low-latency, low-power motion estimation. Additionally, simulations of larger networks show that the system achieves state-of-the-art accuracy in egomotion estimation tasks with event-based cameras, making it a promising solution for real-time, power-constrained robotics applications.

Hugh Greatorex, Michele Mastella, Madison Cotteret, Ole Richter, Elisabetta Chicca1/22/2025

arXiv:2501.11286v1 Announce Type: new Abstract: The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digital-based accelerators, there is growing interest in exploring photonics due to its high energy efficiency and ultra-fast processing speeds. However, the significant signal conversion overhead limits the performance of photonic-based accelerators. In this work, we propose HyAtten, a photonic-based attention accelerator with minimize signal conversion overhead. HyAtten incorporates a signal comparator to classify signals into two categories based on whether they can be processed by low-resolution converters. HyAtten integrates low-resolution converters to process all low-resolution signals, thereby boosting the parallelism of photonic computing. For signals requiring high-resolution conversion, HyAtten uses digital circuits instead of signal converters to reduce area and latency overhead. Compared to state-of-the-art photonic-based Transformer accelerator, HyAtten achieves 9.8X performance/area and 2.2X energy-efficiency/area improvement.

Huize Li, Dan Chen, Tulika Mitra1/22/2025

arXiv:2501.11839v1 Announce Type: new Abstract: Automating analog and radio-frequency (RF) circuit design using machine learning (ML) significantly reduces the time and effort required for parameter optimization. This study explores supervised ML-based approaches for designing circuit parameters from performance specifications across various circuit types, including homogeneous and heterogeneous designs. By evaluating diverse ML models, from neural networks like transformers to traditional methods like random forests, we identify the best-performing models for each circuit. Our results show that simpler circuits, such as low-noise amplifiers, achieve exceptional accuracy with mean relative errors as low as 0.3% due to their linear parameter-performance relationships. In contrast, complex circuits, like power amplifiers and voltage-controlled oscillators, present challenges due to their non-linear interactions and larger design spaces. For heterogeneous circuits, our approach achieves an 88% reduction in errors with increased training data, with the receiver achieving a mean relative error as low as 0.23%, showcasing the scalability and accuracy of the proposed methodology. Additionally, we provide insights into model strengths, with transformers excelling in capturing non-linear mappings and k-nearest neighbors performing robustly in moderately linear parameter spaces, especially in heterogeneous circuits with larger datasets. This work establishes a foundation for extending ML-driven design automation, enabling more efficient and scalable circuit design workflows.

Asal Mehradfar, Xuzhe Zhao, Yue Niu, Sara Babakniya, Mahdi Alesheikh, Hamidreza Aghasi, Salman Avestimehr1/22/2025

arXiv:2501.10430v1 Announce Type: new Abstract: Aquaculture involves cultivating marine and freshwater organisms, with real-time monitoring of aquatic parameters being crucial in fish farming. This thesis proposes an IoT-based framework using sensors and Arduino for efficient monitoring and control of water quality. Different sensors including pH, temperature, and turbidity are placed in cultivating pond water and each of them is connected to a common microcontroller board built on an Arduino Uno. The sensors read the data from the water and store it as a CSV file in an IoT cloud named Thingspeak through the Arduino Microcontroller. In the experimental part, we collected data from 5 ponds with various sizes and environments. After getting the real-time data, we compared these with the standard reference values. As a result, we can make the decision about which ponds are satisfactory for cultivating fish and what is not. After that, we labeled the data with 11 fish categories including Katla, sing, prawn, rui, koi, pangas, tilapia, silvercarp, karpio, magur, and shrimp. In addition, the data were analyzed using 10 machine learning (ML) algorithms containing J48, Random Forest, K-NN, K*, LMT, REPTree, JRIP, PART, Decision Table, and Logit boost. After experimental evaluation, it was observed among 5 ponds, only three ponds were perfect for fish farming, where these 3 ponds only satisfied the standard reference values of pH (6.5-8.5), Temperature (16-24)oC, Turbidity (below 10)ntu, Conductivity (970-1825){\mu}S/cm, and Depth (1-4) meter. Among the state-of-the-art machine learning algorithms, Random Forest achieved the highest score of performance metrics as accuracy 94.42%, kappa statistics 93.5%, and Avg. TP Rate 94.4%. In addition, we calculated the BOD, COD, and DO for one scenario. This study includes details of the proposed IoT system's prototype hardware.

Md. Monirul Islam1/22/2025