cs.SI
93 postsarXiv:2501.06721v1 Announce Type: new Abstract: Link prediction is a fundamental problem in graph theory with diverse applications, including recommender systems, community detection, and identifying spurious connections. While feature-based methods achieve high accuracy, their reliance on node attributes limits their applicability in featureless graphs. For such graphs, structure-based approaches, including common neighbor-based and degree-dependent methods, are commonly employed. However, the effectiveness of these methods depends on graph density, with common neighbor-based algorithms performing well in dense graphs and degree-dependent methods being more suitable for sparse or tree-like graphs. Despite this, the literature lacks a clear criterion to distinguish between dense and sparse graphs. This paper introduces the average clustering coefficient as a criterion for assessing graph density to assist with the choice of link prediction algorithms. To address the scarcity of datasets for empirical analysis, we propose a novel graph generation method based on the Barabasi-Albert model, which enables controlled variation of graph density while preserving structural heterogeneity. Through comprehensive experiments on synthetic and real-world datasets, we establish an empirical boundary for the average clustering coefficient that facilitates the selection of effective link prediction techniques.
arXiv:2406.17918v4 Announce Type: replace Abstract: In our recent research, we have developed a framework called GraphSnapShot, which has been proven an useful tool for graph learning acceleration. GraphSnapShot is a framework for fast cache, storage, retrieval and computation for graph learning. It can quickly store and update the local topology of graph structure and allows us to track patterns in the structure of graph networks, just like take snapshots of the graphs. In experiments, GraphSnapShot shows efficiency, it can achieve up to 30% training acceleration and 73% memory reduction for lossless graph ML training compared to current baselines such as dgl.This technique is particular useful for large dynamic graph learning tasks such as social media analysis and recommendation systems to process complex relationships between entities. The code for GraphSnapShot is publicly available at https://github.com/NoakLiu/GraphSnapShot.
arXiv:2409.18976v2 Announce Type: replace Abstract: The proliferation of complex online media has accelerated the process of ideology formation, influenced by stakeholders through advertising channels. The media channels, which vary in cost and effectiveness, present a dilemma in prioritizing optimal fund allocation. There are technical challenges in describing the optimal budget allocation between channels over time, which involves defining the finite vector structure of controls on the chart. To enhance marketing productivity, it's crucial to determine how to distribute a budget across all channels to maximize business outcomes like revenue and ROI. Therefore, the strategy for media budget allocation is primarily an exercise focused on cost and achieving goals, by identifying a specific framework for a media program. Numerous researchers optimize the achievement and frequency of media selection models to aid superior planning decisions amid complexity and vast information availability. In this study, we present a planning model using the media mix model for advertising construction campaigns. Additionally, a decision-making strategy centered on FMEA identifies and prioritizes financial risk factors of the media system in companies. Despite some limitations, this research proposes a decision-making approach based on Z-number theory. To address the drawbacks of the RPN score, the suggested decision-making methodology integrates Z-SWARA and Z-WASPAS techniques with the FMEA method.
arXiv:2501.07267v1 Announce Type: new Abstract: Scientific team dynamics are critical in determining the nature and impact of research outputs. However, existing methods for classifying author roles based on self-reports and clustering lack comprehensive contextual analysis of contributions. Thus, we present a transformative approach to classifying author roles in scientific teams using advanced large language models (LLMs), which offers a more refined analysis compared to traditional clustering methods. Specifically, we seek to complement and enhance these traditional methods by utilizing open source and proprietary LLMs, such as GPT-4, Llama3 70B, Llama2 70B, and Mistral 7x8B, for role classification. Utilizing few-shot prompting, we categorize author roles and demonstrate that GPT-4 outperforms other models across multiple categories, surpassing traditional approaches such as XGBoost and BERT. Our methodology also includes building a predictive deep learning model using 10 features. By training this model on a dataset derived from the OpenAlex database, which provides detailed metadata on academic publications -- such as author-publication history, author affiliation, research topics, and citation counts -- we achieve an F1 score of 0.76, demonstrating robust classification of author roles.
arXiv:2309.13305v2 Announce Type: replace Abstract: Online social networks are major platforms for disseminating both real and fake news. Many users, intentionally or unintentionally, spread harmful content, fake news, and rumors in fields such as politics and business. Consequently, numerous studies have been conducted in recent years to assess user credibility. A significant shortcoming of most existing methods is that they categorize users as either real or fake. However, in real-world applications, it is often more desirable to consider several levels of user credibility. Another limitation is that existing approaches only utilize a portion of important features, which reduces their performance. In this paper, due to the lack of an appropriate dataset for multilevel user credibility assessment, we first design a method to collect data suitable for assessing credibility at multiple levels. Then, we develop the MultiCred model, which places users at one of several levels of credibility based on a rich and diverse set of features extracted from users' profiles, tweets, and comments. MultiCred leverages deep language models to analyze textual data and deep neural models to process non-textual features. Our extensive experiments reveal that MultiCred significantly outperforms existing approaches in terms of several accuracy measures.
arXiv:2406.00344v2 Announce Type: replace Abstract: Bipartite graphs are ubiquitous in many domains, e.g., e-commerce platforms, social networks, and academia, by modeling interactions between distinct entity sets. Within these graphs, the butterfly motif, a complete 2*2 biclique, represents the simplest yet significant subgraph structure, crucial for analyzing complex network patterns. Counting the butterflies offers significant benefits across various applications, including community analysis and recommender systems. Additionally, the temporal dimension of bipartite graphs, where edges activate within specific time frames, introduces the concept of historical butterfly counting, i.e., counting butterflies within a given time interval. This temporal analysis sheds light on the dynamics and evolution of network interactions, offering new insights into their mechanisms. Despite its importance, no existing algorithm can efficiently solve the historical butterfly counting task. To address this, we design two novel indices whose memory footprints are dependent on #butterflies and #wedges, respectively. Combining these indices, we propose a graph structure-aware indexing approach that significantly reduces memory usage while preserving exceptional query speed. We theoretically prove that our approach is particularly advantageous on power-law graphs, a common characteristic of real-world bipartite graphs, by surpassing traditional complexity barriers for general graphs. Extensive experiments reveal that our query algorithms outperform existing methods by up to five magnitudes, effectively balancing speed with manageable memory requirements.
arXiv:2305.15622v2 Announce Type: replace Abstract: Given the growing concerns about fairness in machine learning and the impressive performance of Graph Neural Networks (GNNs) on graph data learning, algorithmic fairness in GNNs has attracted significant attention. While many existing studies improve fairness at the group level, only a few works promote individual fairness, which renders similar outcomes for similar individuals. A desirable framework that promotes individual fairness should (1) balance between fairness and performance, (2) accommodate two commonly-used individual similarity measures (externally annotated and computed from input features), (3) generalize across various GNN models, and (4) be computationally efficient. Unfortunately, none of the prior work achieves all the desirables. In this work, we propose a novel method, GFairHint, which promotes individual fairness in GNNs and achieves all aforementioned desirables. GFairHint learns fairness representations through an auxiliary link prediction task, and then concatenates the representations with the learned node embeddings in original GNNs as a "fairness hint". Through extensive experimental investigations on five real-world graph datasets under three prevalent GNN models covering both individual similarity measures above, GFairHint achieves the best fairness results in almost all combinations of datasets with various backbone models, while generating comparable utility results, with much less computational cost compared to the previous state-of-the-art (SoTA) method.
arXiv:2409.11554v2 Announce Type: replace Abstract: Graph machine learning, particularly using graph neural networks, fundamentally relies on node features. Nevertheless, numerous real-world systems, such as social and biological networks, often lack node features due to various reasons, including privacy concerns, incomplete or missing data, and limitations in data collection. In such scenarios, researchers typically resort to methods like structural and positional encoding to construct node features. However, the length of such features is contingent on the maximum value within the property being encoded, for example, the highest node degree, which can be exceedingly large in applications like scale-free networks. Furthermore, these encoding schemes are limited to categorical data and might not be able to encode metrics returning other type of values. In this paper, we introduce a novel, universally applicable encoder, termed \emph{PropEnc}, which constructs expressive node embedding from any given graph metric. \emph{PropEnc} leverages histogram construction combined with reversed index encoding, offering a flexible method for node features initialization. It supports flexible encoding in terms of both dimensionality and type of input, demonstrating its effectiveness across diverse applications. \emph{PropEnc} allows encoding metrics in low-dimensional space which effectively address the sparsity challenge and enhances the efficiency of the models. We show that \emph{PropEnc} can construct node features that either exactly replicate one-hot encoding or closely approximate indices under various settings. Our extensive evaluations in graph classification setting across multiple social networks that lack node features support our hypothesis. The empirical results conclusively demonstrate that \emph{PropEnc} is both an efficient and effective mechanism for constructing node features from diverse set of graph metrics.
arXiv:2501.06843v1 Announce Type: new Abstract: The Global Research infrastructure (GRI) is made up of the repositories and organizations that provide persistent identifiers (PIDs) and metadata for many kinds of research objects and connect these objects to funders, research institutions, researchers, and one another using PIDs. The INFORMATE Project has combined three data sources to focus on understanding how the global research infrastructure might help the US National Science Foundation (NSF) and other federal agencies identify and characterize the impact of their support. In this paper we present INFORMATE observations of three data systems. The NSF Award database represents NSF funding while the NSF Public Access Repository (PAR) and CHORUS, as a proxy for the GRI, represent two different view of results of that funding. We compare the first at the level of awards and the second two at the level of published research articles. Our findings demonstrate that CHORUS datasets include significantly more NSF awards and more related papers than does PAR. Our findings also suggest that time plays a significant role in the inclusion of award metadata across the sources analyzed. Data in those sources travel very different journeys, each presenting different obstacles to metadata completeness and suggesting necessary actions on the parts of authors and publishers to ensure that publication and funding metadata are captured. We discuss these actions, as well as implications our findings have for emergent technologies such as artificial intelligence and natural language processing.
arXiv:2501.07182v1 Announce Type: new Abstract: TikTok has gradually become one of the most pervasive social media platforms in our daily lives. In this research article, I explore how users on TikTok discussed the crisis in Palestine that worsened in 2023. Using network analysis, I situate keywords representing the conflict and categorize them thematically based on a coding schema derived from politically and ideologically differentiable stances. I conclude that that activism and propaganda are contending amongst themselves in the thriving space afforded by TikTok today.
arXiv:2501.07368v1 Announce Type: new Abstract: Social media play a key role in mobilizing collective action, holding the potential for studying the pathways that lead individuals to actively engage in addressing global challenges. However, quantitative research in this area has been limited by the absence of granular and large-scale ground truth about the level of participation in collective action among individual social media users. To address this limitation, we present a novel suite of text classifiers designed to identify expressions of participation in collective action from social media posts, in a topic-agnostic fashion. Grounded in the theoretical framework of social movement mobilization, our classification captures participation and categorizes it into four levels: recognizing collective issues, engaging in calls-to-action, expressing intention of action, and reporting active involvement. We constructed a labeled training dataset of Reddit comments through crowdsourcing, which we used to train BERT classifiers and fine-tune Llama3 models. Our findings show that smaller language models can reliably detect expressions of participation (weighted F1=0.71), and rival larger models in capturing nuanced levels of participation. By applying our methodology to Reddit, we illustrate its effectiveness as a robust tool for characterizing online communities in innovative ways compared to topic modeling, stance detection, and keyword-based methods. Our framework contributes to Computational Social Science research by providing a new source of reliable annotations useful for investigating the social dynamics of collective action.
arXiv:2501.07025v1 Announce Type: cross Abstract: Many Natural Language Processing (NLP) related applications involves topics and sentiments derived from short documents such as consumer reviews and social media posts. Topics and sentiments of short documents are highly sparse because a short document generally covers a few topics among hundreds of candidates. Imputation of missing data is sometimes hard to justify and also often unpractical in highly sparse data. We developed a method for calculating a weighted similarity for highly sparse data without imputation. This weighted similarity is consist of three components to capture similarities based on both existence and lack of common properties and pattern of missing values. As a case study, we used a community detection algorithm and this weighted similarity to group different shampoo brands based on sparse topic sentiments derived from short consumer reviews. Compared with traditional imputation and similarity measures, the weighted similarity shows better performance in both general community structures and average community qualities. The performance is consistent and robust across metrics and community complexities.
arXiv:2501.06873v1 Announce Type: cross Abstract: We analyze over 44,000 NBER and CEPR working papers from 1980 to 2023 using a custom language model to construct knowledge graphs that map economic concepts and their relationships. We distinguish between general claims and those documented via causal inference methods (e.g., DiD, IV, RDD, RCTs). We document a substantial rise in the share of causal claims-from roughly 4% in 1990 to nearly 28% in 2020-reflecting the growing influence of the "credibility revolution." We find that causal narrative complexity (e.g., the depth of causal chains) strongly predicts both publication in top-5 journals and higher citation counts, whereas non-causal complexity tends to be uncorrelated or negatively associated with these outcomes. Novelty is also pivotal for top-5 publication, but only when grounded in credible causal methods: introducing genuinely new causal edges or paths markedly increases both the likelihood of acceptance at leading outlets and long-run citations, while non-causal novelty exhibits weak or even negative effects. Papers engaging with central, widely recognized concepts tend to attract more citations, highlighting a divergence between factors driving publication success and long-term academic impact. Finally, bridging underexplored concept pairs is rewarded primarily when grounded in causal methods, yet such gap filling exhibits no consistent link with future citations. Overall, our findings suggest that methodological rigor and causal innovation are key drivers of academic recognition, but sustained impact may require balancing novel contributions with conceptual integration into established economic discourse.
arXiv:2501.07473v1 Announce Type: new Abstract: Political polarization, a key driver of social fragmentation, has drawn increasing attention for its role in shaping online and offline discourse. Despite significant efforts, accurately measuring polarization within ideological distributions remains a challenge. This study evaluates five widely used polarization measures, testing their strengths and weaknesses with synthetic datasets and a real-world case study on YouTube discussions during the 2020 U.S. Presidential Election. Building on these findings, we present a novel adaptation of Kleinberg's burst detection algorithm to improve mode detection in polarized distributions. By offering both a critical review and an innovative methodological tool, this work advances the analysis of ideological patterns in social media discourse.
arXiv:2501.07327v1 Announce Type: new Abstract: The advantages of temporal networks in capturing complex dynamics, such as diffusion and contagion, has led to breakthroughs in real world systems across numerous fields. In the case of human behavior, face-to-face interaction networks enable us to understand the dynamics of how communities emerge and evolve in time through the interactions, which is crucial in fields like epidemics, sociological studies and urban science. However, state-of-the-art datasets suffer from a number of drawbacks, such as short time-span for data collection and a small number of participants. Moreover, concerns arise for the participants' privacy and the data collection costs. Over the past years, many successful algorithms for static networks generation have been proposed, but they often do not tackle the social structure of interactions or their temporal aspect. In this work, we extend a recent network generation approach to capture the evolution of interactions between different communities. Our method labels nodes based on their community affiliation and constructs surrogate networks that reflect the interactions of the original temporal networks between nodes with different labels. This enables the generation of synthetic networks that replicate realistic behaviors. We validate our approach by comparing structural measures between the original and generated networks across multiple face-to-face interaction datasets.
arXiv:2501.06585v1 Announce Type: new Abstract: Data imputation is crucial for addressing challenges posed by missing values in multivariate time series data across various fields, such as healthcare, traffic, and economics, and has garnered significant attention. Among various methods, diffusion model-based approaches show notable performance improvements. However, existing methods often cause disharmonious boundaries between missing and known regions and overlook long-range dependencies in missing data estimation, leading to suboptimal results. To address these issues, we propose a Diffusion-based time Series Data Imputation (DSDI) framework. We develop a weight-reducing injection strategy that incorporates the predicted values of missing points with reducing weights into the reverse diffusion process to mitigate boundary inconsistencies. Further, we introduce a multi-scale S4-based U-Net, which combines hierarchical information from different levels via multi-resolution integration to capture long-term dependencies. Experimental results demonstrate that our model outperforms existing imputation methods.
arXiv:2306.04452v3 Announce Type: replace Abstract: Online social networks (OSNs) provide a platform for individuals to share information, exchange ideas, and build social connections beyond in-person interactions. For a specific topic or community, opinion leaders are individuals who have a significant influence on others' opinions. Detecting opinion leaders and modeling influence dynamics is crucial as they play a vital role in shaping public opinion and driving conversations. Existing research have extensively explored various graph-based and psychology-based methods for detecting opinion leaders, but there is a lack of cross-disciplinary consensus between definitions and methods. For example, node centrality in graph theory does not necessarily align with the opinion leader concepts in social psychology. This review paper aims to address this multi-disciplinary research area by introducing and connecting the diverse methodologies for identifying influential nodes. The key novelty is to review connections and cross-compare different multi-disciplinary approaches that have origins in: social theory, graph theory, compressed sensing theory, and control theory. Our first contribution is to develop cross-disciplinary discussion on how they tell a different tale of networked influence. Our second contribution is to propose trans-disciplinary research method on embedding socio-physical influence models into graph signal analysis. We showcase inter- and trans-disciplinary methods through a Twitter case study to compare their performance and elucidate the research progression with relation to psychology theory. We hope the comparative analysis can inspire further research in this cross-disciplinary area.
arXiv:2408.00139v2 Announce Type: replace Abstract: The related concepts of partisan belief systems, issue alignment, and partisan sorting are central to our understanding of politics. These phenomena have been studied using measures of alignment between pairs of topics, or how much individuals' attitudes toward a topic reveal about their attitudes toward another topic. We introduce a higher-order measure that extends the assessment of alignment beyond pairs of topics by quantifying the amount of information individuals' opinions on one topic reveal about a set of topics simultaneously. Applying this approach to legislative voting behavior shows that parliamentary systems typically exhibit similar multiway alignment characteristics, but can change in response to shifting intergroup dynamics. In American National Election Studies surveys, our approach reveals a growing significance of party identification together with a consistent rise in multiway alignment over time.
arXiv:2501.03266v1 Announce Type: new Abstract: LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. To address this, we analyze nearly 50,000 Chatbot Arena response-pairs using a novel fine-tuned RoBERTa model, that we trained on hand-labeled data to disentangle refusals due to ethical concerns from other refusals due to technical disabilities or lack of information. Our findings reveal a significant refusal penalty on content moderation, with users choosing ethical-based refusals roughly one-fourth as often as their preferred LLM response compared to standard responses. However, the context and phrasing play critical roles: refusals on highly sensitive prompts, such as illegal content, achieve higher win rates than less sensitive ethical concerns, and longer responses closely aligned with the prompt perform better. These results emphasize the need for nuanced moderation strategies that balance ethical safeguards with user satisfaction. Moreover, we find that the refusal penalty is notably lower in evaluations using the LLM-as-a-Judge method, highlighting discrepancies between user and automated assessments.
arXiv:2408.12875v2 Announce Type: replace Abstract: Graph Neural Networks (GNNs) have become essential tools for graph representation learning in various domains, such as social media and healthcare. However, they often suffer from fairness issues due to inherent biases in node attributes and graph structure, leading to unfair predictions. To address these challenges, we propose a novel GNN framework, DAB-GNN, that Disentangles, Amplifies, and deBiases attribute, structure, and potential biases in the GNN mechanism. DAB-GNN employs a disentanglement and amplification module that isolates and amplifies each type of bias through specialized disentanglers, followed by a debiasing module that minimizes the distance between subgroup distributions. Extensive experiments on five datasets demonstrate that DAB-GNN significantly outperforms ten state-of-the-art competitors in terms of achieving an optimal balance between accuracy and fairness. The codebase of DAB-GNN is available at https://github.com/Bigdasgit/DAB-GNN