cs.MS

8 posts

arXiv:2501.00279v1 Announce Type: new Abstract: BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies. Building on our preliminary work demonstrating the potential of automatic *gemm offload, this paper extends the framework to all level-3 BLAS operations and introduces SCILIB-Accel, a novel tool for automatic BLAS offload. SCILIB-Accel leverages the memory coherency in Grace-Hopper and introduces a Device First-Use data movement policy inspired by the OpenMP First-Touch approach in multi-socket CPU programming, minimizing CPU-GPU data transfers for typical scientific computing codes. Additionally, utilizing dynamic binary instrumentation, the tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation. SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups. Notably, for the LSMS method in the MuST suite, a 3x speedup was achieved on Grace-Hopper compared to Grace-Grace.

Junjie Li1/3/2025

arXiv:2403.02237v2 Announce Type: replace-cross Abstract: We present our investigation of the study of two variable hypergeometric series, namely Appell $F_{1}$ and $F_{3}$ series, and obtain a comprehensive list of its analytic continuations enough to cover the whole real $(x,y)$ plane, except on their singular loci. We also derive analytic continuations of their 3-variable generalization, the Lauricella $F_{D}^{(3)}$ series and the Lauricella-Saran $F_{S}^{(3)}$ series, leveraging the analytic continuations of $F_{1}$ and $F_{3}$, which ensures that the whole real $(x,y,z)$ space is covered, except on the singular loci of these functions. While these studies are motivated by the frequent occurrence of these multivariable hypergeometric functions in Feynman integral evaluation, they can also be used whenever they appear in other branches of mathematical physics. To facilitate their practical use, we provide four packages: $\texttt{AppellF1.wl}$, $\texttt{AppellF3.wl}$, $\texttt{LauricellaFD.wl}$, and $\texttt{LauricellaSaranFS.wl}$ in $\textit{MATHEMATICA}$. These packages are applicable for generic as well as non-generic values of parameters, keeping in mind their utilities in the evaluation of the Feynman integrals. We explicitly present various physical applications of these packages in the context of Feynman integral evaluation and compare the results using other packages such as $\texttt{FIESTA}$. Upon applying the appropriate conventions for numerical evaluation, we find that the results obtained from our packages are consistent. Various $\textit{Mathematica}$ notebooks demonstrating different numerical results are also provided along with this paper.

Souvik Bera, Tanay Pathak1/3/2025

arXiv:2408.07843v2 Announce Type: replace Abstract: There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code, HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.

Ronald M. Caplan, Miko M. Stulajter, Jon A. Linker, Jeff Larkin, Henry A. Gabb, Shiquan Su, Ivan Rodriguez, Zachary Tschirhart, Nicholas Malaya12/25/2024

arXiv:2412.17265v1 Announce Type: new Abstract: Xiaomai is an intelligent tutoring system (ITS) designed to help Chinese college students in learning advanced mathematics and preparing for the graduate school math entrance exam. This study investigates two distinctive features within Xiaomai: the incorporation of free-response questions with automatic feedback and the metacognitive element of reflecting on self-made errors.

Ying Fang, Bo He, Zhi Liu, Sannyuya Liu, Zhonghua Yan, Jianwen Sun12/24/2024

arXiv:2412.16161v1 Announce Type: new Abstract: In this short article I introduce the evitaicossa package which provides functionality for antiassociative algebras in the R programming language; it is available on CRAN at https://CRAN.R-project.org/package=evitaicossa.

Robin K. S. Hankinn12/24/2024

arXiv:2410.10908v2 Announce Type: replace Abstract: Julia has been heralded as a potential successor to Python for scientific machine learning and numerical computing, boasting ergonomic and performance improvements. Since Julia's inception in 2012 and declaration of language goals in 2017, its ecosystem and language-level features have grown tremendously. In this paper, we take a modern look at Julia's features and ecosystem, assess the current state of the language, and discuss its viability and pitfalls as a replacement for Python as the de-facto scientific machine learning language. We call for the community to address Julia's language-level issues that are preventing further adoption.

Edward Berman, Jacob Ginesin12/23/2024

arXiv:2310.19051v4 Announce Type: replace-cross Abstract: The Hurst exponent is a significant metric for characterizing time sequences with long-term memory property and it arises in many fields. The available methods for estimating the Hurst exponent can be categorized into time-domain and spectrum-domain methods. Although there are various estimation methods for the Hurst exponent, there are still some disadvantages that should be overcome: firstly, the estimation methods are mathematics-oriented instead of engineering-oriented; secondly, the accuracy and effectiveness of the estimation algorithms are inadequately assessed; thirdly, the framework of classification for the estimation methods are insufficient; and lastly there is a lack of clear guidance for selecting proper estimation in practical problems involved in data analysis. The contributions of this paper lie in four aspects: 1) the optimal sequence partition method is proposed for designing the estimation algorithms for Hurst exponent; 2) the algorithmic pseudo-codes are adopted to describe the estimation algorithms, which improves the understandability and usability of the estimation methods and also reduces the difficulty of implementation with computer programming languages; 3) the performance assessment is carried for the typical estimation algorithms via the ideal time sequence with given Hurst exponent and the practical time sequence captured in applications; 4) the guidance for selecting proper algorithms for estimating the Hurst exponent is presented and discussed. It is expected that the systematic survey of available estimation algorithms could help the users to understand the principles and the assessment of the various estimation methods could help the users to select, implement and apply the estimation algorithms of interest in practical situations in an easy way.

Hong-Yan Zhang, Zhi-Qiang Feng, Si-Yu Feng, Yu Zhou12/23/2024

arXiv:2412.15221v1 Announce Type: new Abstract: The gps2gtfs package addresses a critical need for converting raw Global Positioning System (GPS) trajectory data from public transit vehicles into the widely used GTFS (General Transit Feed Specification) format. This transformation enables various software applications to efficiently utilize real-time transit data for purposes such as tracking, scheduling, and arrival time prediction. Developed in Python, gps2gtfs employs techniques like geo-buffer mapping, parallel processing, and data filtering to manage challenges associated with raw GPS data, including high volume, discontinuities, and localization errors. This open-source package, available on GitHub and PyPI, enhances the development of intelligent transportation solutions and fosters improved public transit systems globally.

Shiveswarran Ratneswaran, Uthayasanker Thayasivam, Sivakumar Thillaiambalam12/23/2024