cs.DL

19 posts

arXiv:2501.06656v1 Announce Type: new Abstract: For analysis of bibliographic data, we can obtain from bibliographic databases the corresponding collection of bibliographic networks. Recently OpenAlex, a new open-access bibliographic database, became available. We present OpenAlex2Pajek, an R package for converting OpenAlex data into a collection of Pajek's networks. For an illustration, we created a temporal weighted network describing the co-authorship between world countries for years from 1990 to 2023. We present some analyses of this network.

Vladimir Batagelj1/14/2025

arXiv:2409.11065v2 Announce Type: replace-cross Abstract: Digital research data management is increasingly integrated across universities and research institutions, addressing the handling of research data throughout its lifecycle according to the FAIR data principles (Findable, Accessible, Interoperable, Reusable). Recent emphasis on the semantic and interlinking aspects of research data, e.g., by using ontologies and knowledge graphs further enhances findability and reusability. This work presents a framework for creating and maintaining a knowledge graph specifically for low-temperature plasma (LTP) science and technology. The framework leverages a domain-specific ontology called Plasma-O, along with the VIVO software as a platform for semantic information management in LTP research. While some research fields are already prepared to use ontologies and knowledge graphs for information management, their application in LTP research is nascent. This work aims to bridge this gap by providing a framework that not only improves research data management but also fosters community participation in building the domain-specific ontology and knowledge graph based on the published materials. The results may also support other research fields in the practical use of knowledge graphs for semantic information management.

Ihda Chaerony Siffa, Robert Wagner, Laura Vilardell Scholten, Markus M. Becker1/14/2025

arXiv:2501.07267v1 Announce Type: new Abstract: Scientific team dynamics are critical in determining the nature and impact of research outputs. However, existing methods for classifying author roles based on self-reports and clustering lack comprehensive contextual analysis of contributions. Thus, we present a transformative approach to classifying author roles in scientific teams using advanced large language models (LLMs), which offers a more refined analysis compared to traditional clustering methods. Specifically, we seek to complement and enhance these traditional methods by utilizing open source and proprietary LLMs, such as GPT-4, Llama3 70B, Llama2 70B, and Mistral 7x8B, for role classification. Utilizing few-shot prompting, we categorize author roles and demonstrate that GPT-4 outperforms other models across multiple categories, surpassing traditional approaches such as XGBoost and BERT. Our methodology also includes building a predictive deep learning model using 10 features. By training this model on a dataset derived from the OpenAlex database, which provides detailed metadata on academic publications -- such as author-publication history, author affiliation, research topics, and citation counts -- we achieve an F1 score of 0.76, demonstrating robust classification of author roles.

Wonduk Seo, Yi Bu1/14/2025

arXiv:2501.06959v1 Announce Type: new Abstract: Music source separation demixes a piece of music into its individual sound sources (vocals, percussion, melodic instruments, etc.), a task with no simple mathematical solution. It requires deep learning methods involving training on large datasets of isolated music stems. The most commonly available datasets are made from commercial Western music, limiting the models' applications to non-Western genres like Carnatic music. Carnatic music is a live tradition, with the available multi-track recordings containing overlapping sounds and bleeds between the sources. This poses a challenge to commercially available source separation models like Spleeter and Hybrid Demucs. In this work, we introduce 'Sanidha', the first open-source novel dataset for Carnatic music, offering studio-quality, multi-track recordings with minimal to no overlap or bleed. Along with the audio files, we provide high-definition videos of the artists' performances. Additionally, we fine-tuned Spleeter, one of the most commonly used source separation models, on our dataset and observed improved SDR performance compared to fine-tuning on a pre-existing Carnatic multi-track dataset. The outputs of the fine-tuned model with 'Sanidha' are evaluated through a listening study.

Venkatakrishnan Vaidyanathapuram Krishnan, Noel Alben, Anish Nair, Nathaniel Condit-Schultz1/14/2025

arXiv:2501.06956v1 Announce Type: new Abstract: In the rapidly evolving landscape of technological innovation, safeguarding intellectual property rights through patents is crucial for fostering progress and stimulating research and development investments. This report introduces a ground-breaking Patent Novelty Assessment and Claim Generation System, meticulously crafted to dissect the inventive aspects of intellectual property and simplify access to extensive patent claim data. Addressing a crucial gap in academic institutions, our system provides college students and researchers with an intuitive platform to navigate and grasp the intricacies of patent claims, particularly tailored for the nuances of Chinese patents. Unlike conventional analysis systems, our initiative harnesses a proprietary Chinese API to ensure unparalleled precision and relevance. The primary challenge lies in the complexity of accessing and comprehending diverse patent claims, inhibiting effective innovation upon existing ideas. Our solution aims to overcome these barriers by offering a bespoke approach that seamlessly retrieves comprehensive claim information, finely tuned to the specifics of the Chinese patent landscape. By equipping users with efficient access to comprehensive patent claim information, our transformative platform seeks to ignite informed exploration and innovation in the ever-evolving domain of intellectual property. Its envisioned impact transcends individual colleges, nurturing an environment conducive to research and development while deepening the understanding of patented concepts within the academic community.

Kapil Kashyap, Sean Fargose, Gandhar Dhonde, Aditya Mishra1/14/2025

arXiv:2501.06557v1 Announce Type: new Abstract: Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on Zenodo, serving as a valuable resource for researchers and developers. By addressing current gaps and proposing future directions, this work aims to support the advancement of Italian speech technologies and linguistic research.

Marco Giordano, Claudia Rinaldi1/14/2025

arXiv:2501.06843v1 Announce Type: new Abstract: The Global Research infrastructure (GRI) is made up of the repositories and organizations that provide persistent identifiers (PIDs) and metadata for many kinds of research objects and connect these objects to funders, research institutions, researchers, and one another using PIDs. The INFORMATE Project has combined three data sources to focus on understanding how the global research infrastructure might help the US National Science Foundation (NSF) and other federal agencies identify and characterize the impact of their support. In this paper we present INFORMATE observations of three data systems. The NSF Award database represents NSF funding while the NSF Public Access Repository (PAR) and CHORUS, as a proxy for the GRI, represent two different view of results of that funding. We compare the first at the level of awards and the second two at the level of published research articles. Our findings demonstrate that CHORUS datasets include significantly more NSF awards and more related papers than does PAR. Our findings also suggest that time plays a significant role in the inclusion of award metadata across the sources analyzed. Data in those sources travel very different journeys, each presenting different obstacles to metadata completeness and suggesting necessary actions on the parts of authors and publishers to ensure that publication and funding metadata are captured. We discuss these actions, as well as implications our findings have for emergent technologies such as artificial intelligence and natural language processing.

Jamaica Jones, Ted Habermann1/14/2025

arXiv:2501.06833v1 Announce Type: new Abstract: In English literature, the 19th century witnessed a significant transition in styles, themes, and genres. Consequently, the novels from this period display remarkable diversity. This paper explores these variations by examining the evolution of term usage in 19th century English novels through the lens of information retrieval. By applying a query expansion-based approach to a decade-segmented collection of fiction from the British Library, we examine how related terms vary over time. Our analysis employs multiple standard metrics including Kendall's tau, Jaccard similarity, and Jensen-Shannon divergence to assess overlaps and shifts in expanded query term sets. Our results indicate a significant degree of divergence in the related terms across decades as selected by the query expansion technique, suggesting substantial linguistic and conceptual changes throughout the 19th century novels.

Suchana Datta, Dwaipayan Roy, Derek Greene, Gerardine Meaney1/14/2025

arXiv:2501.03771v1 Announce Type: new Abstract: We report evidence of a new set of sneaked references discovered in the scientific literature. Sneaked references are references registered in the metadata of publications without being listed in reference section or in the full text of the actual publications where they ought to be found. We document here 80,205 references sneaked in metadata of the International Journal of Innovative Science and Research Technology (IJISRT). These sneaked references are registered with Crossref and all cite -- thus benefit -- this same journal. Using this dataset, we evaluate three different methods to automatically identify sneaked references. These methods compare reference lists registered with Crossref against the full text or the reference lists extracted from PDF files. In addition, we report attempts to scale the search for sneaked references to the scholarly literature.

Lonni Besan\c{c}on, Guillaume Cabanac, Cyril Labb\'e, Alexander Magazinov, Jules di Scala, Dominika Tkaczyk, Kathryn Weber-Boer1/8/2025

arXiv:2501.03632v1 Announce Type: new Abstract: Data are at the heart of electronic resource management in academic libraries. Assessing the usage data of electronic resources has become a prevalent approach to demonstrate the value of digital collections, justify library expenditures, and gain insights into how users interact with library materials. This study analyzes the usage statistics of electronic books (ebooks) generated locally by the OpenURL link resolver in an academic library, and statistics collected by platform vendors based on Release 5 of the Counting Online Usage of Networked Electronic Resource (COUNTER R5). Three content provider platforms (Cambridge Core, EBSCOhost, and ScienceDirect) were analyzed as data sources. The COUNTER and link resolver statistics were examined to determine the degree of association between these two metrics. The Spearman correlation coefficient was moderate (rs > 0.561 and < 0.678) and statistically significant (p < .01). This suggests that these metrics capture different aspects of the usage of ebooks in different contexts. Other factors, such as the types of access to electronic resources and the units of content delivered, were also examined. The study concludes with a discussion regarding the scope and limitations of link resolver and COUNTER R5 as library metrics for measuring the usage of ebooks.

Mercedes Echeverria, Yacelli Bustamante1/8/2025

arXiv:2501.00564v1 Announce Type: cross Abstract: Thermoelectric materials provide a sustainable way to convert waste heat into electricity. However, data-driven discovery and optimization of these materials are challenging because of a lack of a reliable database. Here we developed a comprehensive database of 7,123 thermoelectric compounds, containing key information such as chemical composition, structural detail, seebeck coefficient, electrical and thermal conductivity, power factor, and figure of merit (ZT). We used the GPTArticleExtractor workflow, powered by large language models (LLM), to extract and curate data automatically from the scientific literature published in Elsevier journals. This process enabled the creation of a structured database that addresses the challenges of manual data collection. The open access database could stimulate data-driven research and advance thermoelectric material analysis and discovery.

Suman Itani, Yibo Zhang, Jiadong Zang1/3/2025

arXiv:2501.00473v1 Announce Type: new Abstract: Despite enormous efforts devoted to understand the characteristics and impacts of retracted papers, little is known about the mechanisms underlying the dynamics of their harm and the dynamics of its propagation. Here, we propose a citation-based framework to quantify the harm caused by retracted papers, aiming to uncover why their harm persists and spreads so widely. We uncover an ``attention escape'' mechanism, wherein retracted papers postpone significant harm, more prominently affect indirectly citing papers, and inflict greater harm on citations in journals with an impact factor less than $10$. This mechanism allows retracted papers to inflict harm outside the attention of authors and publishers, thereby evading their intervention. This study deepens understanding of the harm caused by retracted papers, emphasizes the need to activate and enhance the attention of authors and publishers, and offers new insights and a foundation for strategies to mitigate their harm and prevent its spread.

Yunyou Huang, Jiahui Zhao, Dandan Cui, Zhengxin Yang, Bingjie Xia, Qi Liang, Wenjing Liu, Li Ma, Suqin Tang, Tianyong Hao, Zhifei Zhang, Wanling Gao, Jianfeng Zhan1/3/2025

arXiv:2501.00367v1 Announce Type: new Abstract: This paper investigates the performance of several representative large models in the tasks of literature recommendation and explores potential biases in research exposure. The results indicate that not only LLMs' overall recommendation accuracy remains limited but also the models tend to recommend literature with greater citation counts, later publication date, and larger author teams. Yet, in scholar recommendation tasks, there is no evidence that LLMs disproportionately recommend male, white, or developed-country authors, contrasting with patterns of known human biases.

Yifan Tian, Yixin Liu, Yi Bu, Jiqun Liu1/3/2025

arXiv:2501.00939v1 Announce Type: new Abstract: The Web has drastically simplified our access to knowledge and learning, and fact-checking online resources has become a part of our daily routine. Studying online knowledge consumption is thus critical for understanding human behavior and informing the design of future platforms. In this Chapter, we approach this subject by describing the navigation patterns of the readers of Wikipedia, the world's largest platform for open knowledge. We provide a comprehensive overview of what is known about the three steps that characterize navigation on Wikipedia: (1) how readers reach the platform, (2) how readers navigate the platform, and (3) how readers leave the platform. Finally, we discuss open problems and opportunities for future research in this field.

Tiziano Piccardi, Robert West1/3/2025

arXiv:2412.18063v1 Announce Type: new Abstract: This paper introduces LMRPA, a novel Large Model-Driven Robotic Process Automation (RPA) model designed to greatly improve the efficiency and speed of Optical Character Recognition (OCR) tasks. Traditional RPA platforms often suffer from performance bottlenecks when handling high-volume repetitive processes like OCR, leading to a less efficient and more time-consuming process. LMRPA allows the integration of Large Language Models (LLMs) to improve the accuracy and readability of extracted text, overcoming the challenges posed by ambiguous characters and complex text structures.Extensive benchmarks were conducted comparing LMRPA to leading RPA platforms, including UiPath and Automation Anywhere, using OCR engines like Tesseract and DocTR. The results are that LMRPA achieves superior performance, cutting the processing times by up to 52\%. For instance, in Batch 2 of the Tesseract OCR task, LMRPA completed the process in 9.8 seconds, where UiPath finished in 18.1 seconds and Automation Anywhere finished in 18.7 seconds. Similar improvements were observed with DocTR, where LMRPA outperformed other automation tools conducting the same process by completing tasks in 12.7 seconds, while competitors took over 20 seconds to do the same. These findings highlight the potential of LMRPA to revolutionize OCR-driven automation processes, offering a more efficient and effective alternative solution to the existing state-of-the-art RPA models.

Osama Hosam Abdellaif, Abdelrahman Nader, Ali Hamdi12/25/2024

arXiv:2412.18100v1 Announce Type: new Abstract: The rapid growth of scientific techniques and knowledge is reflected in the exponential increase in new patents filed annually. While these patents drive innovation, they also present significant burden for researchers and engineers, especially newcomers. To avoid the tedious work of navigating a vast and complex landscape to identify trends and breakthroughs, researchers urgently need efficient tools to summarize, evaluate, and contextualize patents, revealing their innovative contributions and underlying scientific principles.To address this need, we present EvoPat, a multi-LLM-based patent agent designed to assist users in analyzing patents through Retrieval-Augmented Generation (RAG) and advanced search strategies. EvoPat leverages multiple Large Language Models (LLMs), each performing specialized roles such as planning, identifying innovations, and conducting comparative evaluations. The system integrates data from local databases, including patents, literature, product catalogous, and company repositories, and online searches to provide up-to-date insights. The ability to collect information not included in original database automatically is also implemented. Through extensive testing in the natural language processing (NLP) domain, we demonstrate that EvoPat outperforms GPT-4 in tasks such as patent summarization, comparative analysis, and technical evaluation. EvoPat represents a significant step toward creating AI-powered tools that empower researchers and engineers to efficiently navigate the complexities of the patent landscape.

Suyuan Wang, Xueqian Yin, Menghao Wang, Ruofeng Guo, Kai Nan12/25/2024

arXiv:2407.13993v3 Announce Type: replace Abstract: This paper introduces LLAssist, an open-source tool designed to streamline literature reviews in academic research. In an era of exponential growth in scientific publications, researchers face mounting challenges in efficiently processing vast volumes of literature. LLAssist addresses this issue by leveraging Large Language Models (LLMs) and Natural Language Processing (NLP) techniques to automate key aspects of the review process. Specifically, it extracts important information from research articles and evaluates their relevance to user-defined research questions. The goal of LLAssist is to significantly reduce the time and effort required for comprehensive literature reviews, allowing researchers to focus more on analyzing and synthesizing information rather than on initial screening tasks. By automating parts of the literature review workflow, LLAssist aims to help researchers manage the growing volume of academic publications more efficiently.

Christoforus Yoga Haryanto12/23/2024

arXiv:2412.15831v1 Announce Type: new Abstract: Questions within surveys, called survey items, are used in the social sciences to study latent concepts, such as the factors influencing life satisfaction. Instead of using explicit citations, researchers paraphrase the content of the survey items they use in-text. However, this makes it challenging to find survey items of interest when comparing related work. Automatically parsing and linking these implicit mentions to survey items in a knowledge base can provide more fine-grained references. We model this task, called Survey Item Linking (SIL), in two stages: mention detection and entity disambiguation. Due to an imprecise definition of the task, existing datasets used for evaluating the performance for SIL are too small and of low-quality. We argue that latent concepts and survey item mentions should be differentiated. To this end, we create a high-quality and richly annotated dataset consisting of 20,454 English and German sentences. By benchmarking deep learning systems for each of the two stages independently and sequentially, we demonstrate that the task is feasible, but observe that errors propagate from the first stage, leading to a lower overall task performance. Moreover, mentions that require the context of multiple sentences are more challenging to identify for models in the first stage. Modeling the entire context of a document and combining the two stages into an end-to-end system could mitigate these problems in future work, and errors could additionally be reduced by collecting more diverse data and by improving the quality of the knowledge base. The data and code are available at https://github.com/e-tornike/SIL .

Tornike Tsereteli, Daniel Ruffinelli, Simone Paolo Ponzetto12/23/2024

arXiv:2412.15249v1 Announce Type: new Abstract: Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially due to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods.

Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H. Laradji, Krishnamurthy DJ Dvijotham, Jason Stanley, Laurent Charlin, Christopher Pal12/23/2024