Community detection on elite mathematicians’ collaboration network

Yurui Huang; Zimo Wang; Chaolin Tian; Yifang Ma

doi:10.2478/jdis-2024-0026

Journal of Data and Information Science >

2024 , Vol. 9 >Issue 4: 1 - 23

DOI: https://doi.org/10.2478/jdis-2024-0026

Research Papers

Community detection on elite mathematicians’ collaboration network

Yurui Huang ,
Zimo Wang ,
Chaolin Tian ,
Yifang Ma ^,^†

Expand

Department of Statistics and Data science, Southern University of Science and Technology, Shenzhen 518055, China

† Yifang Ma (Email: mayf@sustech.edu.cn; ORCID: 0000-0003-0326-7993).

Received date: 2023-07-25

Revised date: 2023-11-02

Accepted date: 2023-11-22

Online published: 2024-08-21

Copyright

Fold

Abstract

Purpose: This study focuses on understanding the collaboration relationships among mathematicians, particularly those esteemed as elites, to reveal the structures of their communities and evaluate their impact on the ﬁeld of mathematics.

Design/methodology/approach: Two community detection algorithms, namely Greedy Modularity Maximization and Infomap, are utilized to examine collaboration patterns among mathematicians. We conduct a comparative analysis of mathematicians’ centrality, emphasizing the influence of award-winning individuals in connecting network roles such as Betweenness, Closeness, and Harmonic centrality. Additionally, we investigate the distribution of elite mathematicians across communities and their relationships within diﬀerent mathematical sub-ﬁelds.

Findings: The study identiﬁes the substantial influence exerted by award-winning mathematicians in connecting network roles. The elite distribution across the network is uneven, with a concentration within speciﬁc communities rather than being evenly dispersed. Secondly, the research identiﬁes a positive correlation between distinct mathematical sub-ﬁelds and the communities, indicating collaborative tendencies among scientists engaged in related domains. Lastly, the study suggests that reduced research diversity within a community might lead to a higher concentration of elite scientists within that speciﬁc community.

Research limitations: The study’s limitations include its narrow focus on mathematicians, which may limit the applicability of the ﬁndings to broader scientiﬁc ﬁelds. Issues with manually collected data aﬀect the reliability of conclusions about collaborative networks.

Practical implications: This study oﬀers valuable insights into how elite mathematicians collaborate and how knowledge is disseminated within mathematical circles. Understanding these collaborative behaviors could aid in fostering better collaboration strategies among mathematicians and institutions, potentially enhancing scientiﬁc progress in mathematics.

Originality/value: The study adds value to understanding collaborative dynamics within the realm of mathematics, oﬀering a unique angle for further exploration and research.

Key words： Greedy modularity maximization; Infomap; Collaboration network; Community detection; Mathematical awardees

Cite this article

Yurui Huang , Zimo Wang , Chaolin Tian , Yifang Ma . Community detection on elite mathematicians’ collaboration network[J]. Journal of Data and Information Science, 2024 , 9(4) : 1 -23 . DOI: 10.2478/jdis-2024-0026

1 Introduction

In today’s world of scientiﬁc exploration and technological progress, the signiﬁcance of collaboration has grown alongside the rising complexity of scientiﬁc challenges (Laudel, 2001; Sonnenwald, 2007). This increased complexity requires collective eﬀorts, making collaboration more essential than ever. Collaborative initiatives enhance research by bringing in a variety of perspectives from diﬀerent disciplines, which are crucial for eﬀectively tackling complex challenges. This idea is backed by influential studies that illustrate how collaborations draw upon diverse expertise and viewpoints, improving the capacity to understand and solve intricate scientiﬁc problems (Huang et al., 2023; Klein, 2005; Qin et al., 1997).

The mechanisms of scientiﬁc collaboration often manifest through co-authorships in articles and patents (Barber & Scherngell, 2013; Huang et al., 2024; Newman, 2001; 2004). In separate studies across various ﬁelds, including biomedical research, physics, and mathematics, Newman identiﬁed that most scientists have a limited number of collaborators, while a few have hundreds or even thousands (Newman, 2004). Understanding collaboration patterns is crucial to comprehending the evolution of scientiﬁc progress. Guimerà et al. (2005) studied factors like team size and involvement of newcomers and found that team composition signiﬁcantly influences both collaboration structures and outcomes in both the arts and sciences. Similarly, Tomassini and Luthi observed the dynamics of collaborative networks in genetic programming, recognizing the changing nature of these ties (Tomassini & Luthi, 2007). It is, therefore, essential to understand the collaboration practices of scientists to better learn how networks develop and function, which in turn, can inform strategies to improve collaboration in scientiﬁc research and innovation.

Building on this understanding of collaborative patterns, the researches by Barabâsi et al. (2002) and Barber and Scherngell (2013) further reinforce the signiﬁcance of co-authorship networks within the scientiﬁc community. These studies reveal that such networks are not uniformly structured but are composed of distinct substructures, underscoring their scale-free nature. This characteristic is shaped by both internal and external co-authorship links, influencing the networks’ scaling and topological evolution (Fortunato, 2010). Additionally, the identiﬁcation of communities within these networks, as indicated by Van Nguyen et al. (2012), is heavily reliant on the density of links within groups compared to those between them.

The eﬀective detection of communities within these networks is a pivotal aspect of network analysis, as highlighted by Reichardt and Bornholdt (2006), Fortunato and Hric (2016). This process, which unravels the few clusters in intricate networks, oﬀers invaluable insights into the organizational and functional structures therein. The journey of methodologies for community detection spans from foundational graph-theoretical approaches to the use of advanced machine learning and has been a key factor in identifying communities within various networks (Ng et al., 2001; Zhang et al., 2009; Zhao et al., 2005). The ﬁeld has been greatly advanced by the application of deep learning, probabilistic modeling, and multi-resolution strategies for the purposes of discovering communities within expansive and varied networks (Liu et al., 2020; Rosvall et al., 2009; Zhang et al., 2009). Zhang et al.(2023) applied the attributed network clustering algorithm (Falih et al., 2018) to understand the collaborations among statisticians better, and Mao et al. (2017) leveraged machine learning and network theory to detect topical scientiﬁc communities. Such innovative approaches not only support the analysis of collaboration patterns but also encourage strategic research and oﬀer a window into the collaborative evolution of scientiﬁc knowledge.

Recognizing the unique role of mathematics as a foundational language across scientiﬁc areas and its importance in fostering research collaborations, our study speciﬁcally investigates the collaboration network of mathematicians (Asif & Islam, 2016; Grossman, 2002). With this in mind, we focus on scientiﬁc collaboration and career development within the mathematical circle and wonder what the underlying community structure in the collaboration network of mathematicians is and what roles elite ones play within their communities compared with their peers. These research questions aim to identify and analyze communities within the collaboration network, shedding light on the collaborative characteristics of elite mathematicians and their peers in advancing mathematical research. To achieve this, the study draws publication data from OpenAlex (Priem et al., 2022), an open dataset, and collects supplementary information on mathematical awards. Focusing on collaborations among mathematicians, including recipients of prestigious awards in the ﬁeld of mathematics such as the Fields Medal and Lobachevsky Prize, the research aims to uncover and delineate clusters of collaborations within this network by applying two community detection algorithms: Greedy Modularity Maximization (GMM) (Clauset et al., 2004) and Infomap (Rosvall et al., 2009). Subsequently, a detailed analysis of these detected communities will be conducted, probing into their inherent characteristics, such as collaborative traits and impact within distinct mathematical sub-ﬁelds.

Investigating the network characteristics of mathematicians, particularly those who are award laureates, is a vital part of modern scientiﬁc research. This study aims to understand how these mathematicians are positioned within collaboration networks and the impact they have on these communities. By examining the roles, influence, and structural formations of these mathematicians, especially those who have received prestigious awards, we aim to shed light on the dynamics of their collaborative interactions. The insights gained from this research will enhance our understanding of collaborative networks in mathematics and provide valuable perspectives on the dynamics and evolution of scientiﬁc collaboration across various disciplines.

Table 1. The basic characteristics of mathematicians’ collaborative networks.

N	L	〈k〉	lnN	〈C〉	density
79,016	342,022	8.657	11.2774	0.1972	0.0001

2 Methodology

2.1 Collaboration network

In this study, we leverage the extensive dataset from OpenAlex, a key resource for our analysis. Our study collected data on scholars who fulﬁlled speciﬁc requirements: they were proliﬁc authors with more than ten publications between 1960 and 2021. We classiﬁed authors primarily active in mathematics and identiﬁed collaborative links among them. Figure 1(a) shows a signiﬁcant rise in collaboration among mathematicians after 2005, reflecting the necessity to tackle complex problems requiring interdisciplinary cooperation. Additionally, we compiled information on renowned mathematicians who have received awards in their ﬁeld from platforms including Wikipedia, Wikidata, and oﬃcial award websites. This supplementary dataset encompasses 341 unique award categories and includes detailed information on the mathematicians’ demographics and achievements. Of these distinguished award recipients, 395 individuals were successfully matched with proﬁles in the OpenAlex database.

View original graphic|Download|PPT slide

Figure 1. Data characteristics of mathematicians.

This data collection methodology enabled us to gain insights into the collaborative networks of 79,016 mathematicians aﬃliated with 8,833 institutions across 172 countries. We were also able to categorize their main areas of research within mathematics, including ﬁelds like pure mathematics, mathematical analysis, combinatorics, and statistics, as shown in Figure 1(b). Following this, we constructed an undirected weighted collaborative network, designated as G, where mathematicians are represented as nodes v₁, v₂, ⋯, v_N. The edges $l_{v_{i}v_{j}}$ and their respective weights $w_{v_{i}v_{j}}$ represent the collaborative connections and the frequency of collaborations, respectively, over the period from 1960 to 2021 (Priem et al., 2022).

In our network G, the average degree 〈k〉 is lower than the logarithm of the number of nodes lnN, indicating that the network’s interconnections are relatively sparse. This is further supported by the network density being less than 0.002, suggesting sparser connectivity than one might expect in a denser network. This ﬁnding implies that the collaboration network among mathematicians is characterized by less dense interconnections, with fewer links connecting the nodes than in a more interconnected network. We then examined the structural attributes of network G using various analytical methods, with the results visualized in Figure 2. Initially, we applied kernel density estimation (KDE) to assess the distribution of connected component sizes within G. The visualization of this analysis distinctly illustrates a signiﬁcant polarization in component sizes, indicating stark diﬀerences in the community structures present. Next, we analyzed node degree ranking to understand the relative importance or centrality of nodes in G. Figure 2(b) highlights critical nodes or ‘hubs’ in the network, suggesting the presence of highly connected individuals. Furthermore, we generated a histogram to depict the frequency distribution of node degrees in G, providing insights into the range and scarcity of speciﬁc node degrees and enhancing our understanding of the network’s connectivity patterns. These comprehensive analyses collectively shed light on the structural properties of the collaboration network G, oﬀering valuable insights into its network dynamics.

View original graphic|Download|PPT slide

Figure 2. Network characteristics.

2.2 Metrics of centrality

In network analysis and statistical modeling, centrality metrics are crucial for evaluating the signiﬁcance or influence of nodes within a network. These metrics are instrumental in deciphering the network’s structural nuances and the dynamics of influence and information flow. This paper highlights four primary centrality measures, each oﬀering unique insights into nodes’ roles within weighted networks: Betweenness, Closeness, Harmonic centrality, and Eigenvector centrality. For this analysis, we employed the Python interface of igraph (Csardi & Nepusz, 2006) to execute and calculate four key centrality metrics. Here, d^w (u, v) denotes the weight-adjusted shortest path between nodes u and v.

Betweenness. Betweenness is delineated by the degree to which a node falls on the shortest paths interlinking other nodes. Nodes with high Betweenness are akin to connectors or intermediaries and play a crucial role in the network’s eﬃcient communication. Mathematically, it is deﬁned as the total number of shortest paths that traverse through a particular node and is given by:

(1)$\operatorname{Betweenness}(v)=\sum_{s \neq v \neq t} \sigma_{s t}^{w}(v),$

where $\sigma_{s t}^{w}(v)$ is the tally of these shortest weighted paths from node s to node t via node v.

Closeness. Closeness evaluates a node’s proximity to all other nodes, advocating for the importance of nodes that can rapidly connect with the rest. Higher closeness indicates a node’s strategic position for swift information spread across the network. The closeness for a node is computed as the inverse sum of its shortest path distances to all other nodes:

(2)$\operatorname{Closeness}(v)=\frac{1}{\sum_{u \neq v} d^{w}(u, v)},$

Harmonic centrality. Harmonic centrality modiﬁes the concept of closeness centrality by focusing on the reciprocal of the weighted shortest paths from one node to all others rather than averaging these distances. This subtle diﬀerence shifts the emphasis toward the influence exerted by shorter paths within the network. Harmonic centrality is summed up as follows:

(3)$C_{H}(v)=\sum_{u \neq v} \frac{1}{d^{w}(u, v)},$

This measure emphasizes nodes that are not only close to other nodes but also connected to a wide range of other nodes, contributing to global network accessibility and cohesion.

Eigenvector centrality. Eigenvector centrality reflects a node’s prestige by recognizing that connections to influential nodes amplify a node’s score. In essence, nodes with extensive connections to other central nodes are deemed pivotal. The centrality score is iteratively derived from the neighbor’s scores, deﬁned by the following relation:

(4)Ax=λx

where A is the adjacency matrix reflecting the network’s connections with edge weights, x corresponds to the principal eigenvector given by the highest eigenvalue λ of A and the individual centrality score C_E(v) for node v is denoted by its respective value in x.

3 Findings

3.1 The influence of elite mathematicians in communities

The study of elite mathematicians within academic communities is instrumental in unraveling the complexities of knowledge sharing, collaborative patterns, and the structure of scholarly networks. This exploration, particularly through the lens of network analysis, has become a vital aspect of modern research (Asif & Islam, 2016; Gaskó et al., 2016; Izquierdo et al., 2018). Our research identiﬁes these elite mathematicians based on their signiﬁcant career achievements, namely their receipt of notable awards in mathematics. We then segment collaboration networks into distinct communities using two algorithms and focus on calculating the four aforementioned centrality metrics within these communities. This method provides a deeper insight into the dynamics of centrality and the critical role these eminent ﬁgures play in the formation and evolution of academic networks.

The outcomes, outlined in Table 2 and depicted in Figure 3, reveal a distinct trend when comparing award-winning mathematicians with their peers. Signiﬁcant diﬀerences in Betweenness, Closeness, and Harmonic Centrality between these two groups were identiﬁed, showing statistically meaningful variations (p < 0.05) through a t-test when computing centralities focused on the sub-network formed by the community division via the community detection algorithm. The heightened Betweenness Centrality among award recipients suggests their pivotal role as vital links or intermediaries among diverse nodes within the communities. Their elevated status implies a substantial influence over the shortest paths connecting other nodes. Conversely, the lower Betweenness Centrality among other mathematicians indicates a potential lack of intermediary roles, signaling a diﬀerent function or position within the network’s structure. Moreover, awardees exhibit superior Closeness and Harmonic Centrality metrics. The increased Closeness implies that awardees are more swiftly connected to other nodes, maintaining shorter average distances compared to non-recipients. This suggests enhanced and more direct communication channels within the group of awardees. Additionally, the heightened Harmonic Centrality among awardees highlights not only their proximity to others but also shorter harmonic mean distances, indicating more eﬀective communication routes and a stronger sense of interconnectedness within their subgroup.

Table 2. The t-test of the diﬀerence of centrality metrics between awardees and other mathematicians.

	Betweenness	Closeness	Harmonic centrality	Eigenvector centrality
GMM	47,777.2381^***	0.0141^***	0.0245^***	-0.0455^***
Infomap	96.5236^***	0.0053	0.0323^***	0.1148^***

^***p<0.001, ^**p<0.01, ^*p<0.05

View original graphic|Download|PPT slide

Figure 3. Four centrality metrics were computed within communities identiﬁed by both GMM and Infomap, and subsequently compared between awardees and other mathematicians. The metrics assessed were: (a), (e) Betweenness; (b), (f) Closeness; (c), (g) Harmonic Centrality; and (d), (h) Eigenvector Centrality.

The analysis of eigenvector centrality for awardees shows contrasting results between communities identiﬁed by GMM (negative) and Infomap (positive). As previously discussed, although both algorithms uncover community structures, they operate on diﬀerent principles and exhibit distinct strengths. GMM is straightforward but may encounter resolution limits, hindering its ability to detect very small or loosely connected communities. In contrast, Infomap is adept at identifying community structures in large, modular networks, typically resulting in more balanced community sizes. This is the consequence that in larger networks, eigenvector centrality tends to exhibit greater consistency. The complexity and sheer number of nodes in these extensive systems reduce the influence of minor connectivity changes on individual nodes’ eigenvector centrality, thereby enhancing their stability. Conversely, in smaller networks, certain nodes might display heightened eigenvector centrality due to simpler connection patterns, potentially leading to concentrated control of information flow among a few nodes. An examination of eigenvector centrality across the entire collaboration network reveals no substantial diﬀerence between awardees and their peers (−0.00127 with p-value of 0.4771). This implies that both award-winning and non-award-winning mathematicians exert a similar level of influence throughout the network, pointing to a comparable eﬀect on the network’s structural dynamics and their capacity to direct the flow of information.

3.2 Imbalanced distribution of talent mathematicians within communities

Researchers frequently collaborate with familiar individuals, such as mentors, past research colleagues, or established experts in their academic domain (Singh et al., 2021; Yu et al., 2021). These collaborations form complex networks characterized by well-established connections, notably evident in scholarly co-authorship. Understanding these connections is crucial in exploring scientists’ intricate collaboration patterns and identifying cohesive clusters or communities within academic networks.

In our study of mathematicians’ collaboration network, we employed two community detection algorithms, namely GMM and Infomap. The ﬁrst one identiﬁed 2,629 communities within network G, while the last one delineated 8,962 communities. Among these structures, the positioning of 395 award-winning mathematicians did not display an even distribution. As depicted in Figure 4(a) where each circle represents a community, and the radius is positively proportional to the number of awardees in this community. The community sizes identiﬁed by GMM showed a partially positive correlation with the presence of awardees. Notably, the community hosting the most awardees did not align with its overall size. Conversely, Infomap exhibited a modest positive link between community size and the number of awardees, notably observed in the largest community harboring the most awardees. This ﬁnding prompts consideration of an “elite mathematician” accumulation eﬀect, suggesting that awardees tend to collaborate within their elite circle or with potential awardees in the ﬁeld of mathematics. This encourages a further exploration of how award recipients are distributed across various communities by involving the statistical measure of skewness (Groeneveld & Meeden, 1984), formally deﬁned by the following equation:

(5)$\eta=\frac{1}{n} \frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{3}}{\left(\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\right)^{3 / 2}}$

where we denote the sizes of detected communities as c₁, c₂, ⋯, c_n, and the tally of awardees within these communities as a₁, a₂, ⋯, a_n. Here, $x_{i}=\frac{a_{i}}{c_{i}}-\frac{\sum_{i=1}^{n} a_{i}}{\sum_{i=1}^{n} c_{i}}$ signifies the deviation between the actual proportion of awardees in the i-tℎ community and the average awardee ratio across the entire network of collaborations. The skewness η quantifies the asymmetry of the data distribution around its mean. In essence, for the array of discrepancies x₁, x₂, ⋯, x_n reflecting the variation between the observed and expected ratios of awardees in each community skewness measures how asymmetric these deviations are from their mean.

View original graphic|Download|PPT slide

Figure 4. Distributed characteristics of mathematicians and mathematical awardees within communities.

In examining whether awardees are evenly distributed across communities, we postulate a straightforward, linear correlation between the number of awardees and the size of each community. To test this idea, we implemented a statistical simulation for randomly allocating awards among different communities. This simulation allocated awardee numbers across communities using a multinomial distribution F~Multinomial(n_awardee, p₁, p₂, …, p_n), where n_awardee = 395 represents the total number of awardees. The probabilities $p_{1}=\frac{c_{1}}{N}, p_{2}=\frac{c_{2}}{N}, \cdots, p_{n}=\frac{c_{n}}{N}$ correspond to each

community’s size as a fraction of the total size of the collaboration network G. This method helps us understand the potential random distribution of awardees across the network’s communities.

Next, we ran 1,000 random simulations using the described multinomial distribution F to evaluate the connection between community size and awardee distribution. These simulations allowed us to compute skewness values for the synthetically distributed awardees according to the community structures detected by GMM and Infomap, notated as η_j¹ and η_j² for each j of the 1,000 simulations, respectively. For comparative purposes, we determined the actual skewness values, η¹ and η², drawn from the real network’s community structures as segmented by GMM and Infomap. Our goal in comparing the simulated and real data distributions was to assess the homogeneity between what is expected and what is observed. As shown in Figure 4 (c) and (d), both η¹ and η² values notably surpass the range of random 𝜂 distributions. The skewness of awardees distribution in real network communities is more pronounced than in the randomized models. This indicates a tendency for award-winning mathematicians in actual networks to cluster more densely in certain communities rather than being evenly spread across the network, resonating with the principles of club theory in economics (Potts et al., 2017).

3.3 Community and research direction

Furthermore, we explore the complex interplay between the research orientations of mathematicians and the communities identiﬁed within their collaboration network, categorizing them into 16 distinct mathematics sub-ﬁelds (see Figure 1(b)). To evaluate the congruence between these sub-ﬁelds and the communities delineated by two diﬀerent algorithms, we apply the NMI (Details mentioned in the Appendix). For the purpose of robustness veriﬁcation, this section designs a random experiment where discipline labels are randomly assigned to all nodes in the network, and the NMI values between community labels and random discipline labels are calculated for diﬀerent algorithms. Through 1,000 repeated experiments, the study obtains the range of NMI values between community labels and discipline labels under random experiments. Comparing the communities identiﬁed by both algorithms against a random reassignment of sub-ﬁeld labels for each mathematician, Table 3 reveals a signiﬁcant positive correlation between the communities and the sub-ﬁelds. This implies that the identiﬁed communities inherently reflect aspects of these sub-ﬁelds, suggesting a natural tendency for mathematicians working in similar areas to collaborate more frequently.

Table 3. NMI analysis between true ﬁeld labels and detected community labels.

Method	GMM	Infomap
Real NMI	0.2222	0.2404
Random NMI	(0.03361, 0.03365)	(0.08247, 0.08250)

These insights lead us to question whether the community information, encapsulating aspects of mathematical sub-ﬁelds, could be eﬀectively represented using diﬀerent methodologies. In particular, we explore the relationship between the number of award recipients in a community, the community’s disciplinary makeup, and its size. To analyze the structured nature of ﬁeld labels within these communities, we introduce the Simpson index (Simpson, 1949; Somerfield et al., 2008). Our linear regression analysis, examining the relationship between the number of awardees, community sizes, and the Simpson index (indicative of disciplinary composition), yields compelling ﬁndings. The results displayed in Table 4 indicate that larger communities tend to have more award recipients, consistent with earlier observations in this study. Conversely, there is an inverse correlation between the Simpson index and the number of awardees. This suggests that communities with higher diversity in their ﬁeld labels or greater disorderliness are less likely to include award recipients. This ﬁnding implies that mathematicians who specialize in speciﬁc subdomains and collaborate intensively within those domains are more likely to achieve notable success in their ﬁeld.

Table 4. Linear regression analysis on the number of awardees.

	#awardees
Algorithm	GMM	Infomap
Community size	0.0056***	0.0092***
Simpson index	-0.1075*	-0.0339***

***p<0.001, **p<0.01, *p<0.05

4 Discussion

In our study, which centers on mathematicians and their collaborative networks, we strive to uncover the underlying community structures and collaboration patterns prevalent among scholars in the ﬁeld of mathematics. A key aspect of our investigation is the distinct role played by award-winning mathematicians within these networks. Employing two robust community detection algorithms GMM and Infomap our research delineates the collaboration clusters that involve these distinguished mathematicians. Our study, thereby, illuminates not just the collaborative behaviors of elite mathematicians but also the broader structure of knowledge dissemination and its impact on the evolution of the ﬁeld. By analyzing the roles and patterns of award-winning mathematicians, we gain a deeper understanding of the forces shaping mathematical research and its progression.

This study delves into the roles and impacts of elite mathematicians within academic networks, with a particular focus on their influence, which is evaluated using centrality metrics. The results show that mathematicians who are award recipients exhibit higher Betweenness, Closeness, and Harmonic centrality than their peers who haven’t received such awards. This suggests they have a more influential or connective role within networks.

Additionally, awardees are found to concentrate their collaborations within certain communities as opposed to engaging equally across the entire collaboration network. Randomized trials support these ﬁndings, indicating a tendency for awardees to cluster within particular communities. They are not only central in terms of their connections but also in their ability to bridge diﬀerent clusters within the network, facilitating the flow of ideas and collaborations across disciplinary boundaries speciﬁc to mathematicians.

Furthermore, our examination of the connections between mathematical sub-ﬁelds and the communities detected by the two algorithms shows a signiﬁcant positive correlation. This suggests that collaboration within the mathematician community loosely reflects these sub-ﬁelds, with individuals in similar areas tending to work together. This underscores the role of community structures in shaping research trajectories and the evolution of mathematical knowledge, highlighting how mathematicians in similar sub-ﬁelds collaborate to enhance coherence and specialization within their respective communities.

Lastly, the study examines the relationship between the number of awardees, the disciplinary diversity within communities, and the sizes of these communities. The analysis indicates that mathematicians are more likely to achieve success in their ﬁeld when they specialize in speciﬁc subdomains and participate extensively in collaborations within those subdomains. Moreover, our ﬁndings highlight the need for interdisciplinary approaches in studying interdisciplinary scientiﬁc collaboration. By extending our analysis beyond mathematics to include interactions across various scientiﬁc disciplines, future research can provide a more comprehensive understanding of collaborative dynamics and community structures in science. This broader perspective is crucial for addressing complex global challenges that require interdisciplinary solutions.

The primary limitation of this study is its focused examination of mathematicians and their awardees. This speciﬁc concentration may restrict the applicability of our ﬁndings to the broader scientiﬁc community. By directing our attention to a select group of scholars, we may not fully capture the collaborative patterns and community structures that exist across a wider range of scientiﬁc disciplines.

Consequently, future research would beneﬁt from adopting an interdisciplinary approach. Given the increasing focus on inter- disciplinary collaboration in contemporary research, it is imperative to investigate how collaborations span diﬀerent scientiﬁc domains. This would entail broadening our scope beyond mathematics to encompass interactions between mathematicians and researchers from various other ﬁelds. Additionally, this study encounters a data limitation due to the potential incompleteness of the dataset on mathematics awardees, a consequence of the manual online data collection process. This method could introduce issues such as data omissions and challenges in achieving a comprehensive dataset. The reliance on manually collected online data may result in inaccuracies due to inconsistencies in award reporting, the frequency of information updates, and variability in data accessibility. Although we have considered edge weights in GMM and Infomap, the application of other community detection methods like the k-clique method are also important in future work to obtain more robust results and conduct horizontal comparisons. Incorporating the k-clique method could provide additional insights into the tightly-knit sub-communities within the collaboration network, further enhancing the robustness and depth of our analysis.

These limitations could compromise the depth and reliability of our ﬁndings, particularly those concerning the roles and impacts of awardees in the mathematical community’s collaborative networks. Future studies would greatly beneﬁt from more systematic and automated data collection methods, which would likely reduce the occurrence of incomplete data and lead to more thorough and precise analyses. Moreover, the development of new community detection algorithms speciﬁcally designed for the distinctive characteristics of scientiﬁc networks could substantially improve the precision of community identiﬁcation. This tailored approach would enable a more accurate understanding of the collaborative dynamics across various scientiﬁc ﬁelds.

Appendix

1 Data collection

In this study, we leverage the extensive dataset from OpenAlex, a key resource for our analysis. OpenAlex provides a broad and openly accessible collection of scholarly works, encompassing authors, publication venues, institutions, and key concepts. This dataset, which is regularly updated, includes more than 243 million publication records and 90 million author proﬁles, along with detailed information on publishers, funders, and institutions. Unique IDs assigned to each scientiﬁc entity in OpenAlex help alleviate issues of name disambiguation. Research utilizing this dataset has been instrumental in understanding career patterns and the evolving dynamics within scholarly networks. OpenAlex has been widely used in the science of science research, enabling studies on collaboration patterns, individual career trajectories, and the overall framework of scholarly interactions (Harris et al., 2023; Williams et al., 2023; Xu et al., 2024).

Using OpenAlex, our study collected data on scholars who fulﬁlled speciﬁc requirements: they were proliﬁc authors with more than ten publications between 1960 and 2021. We initially screened the literature from this period, scrutinizing authorship ties and their respective primary disciplines, as deﬁned by level 0 in OpenAlex’s classiﬁcation system. This process allowed us to pinpoint those authors whose main scholarly output was in the ﬁeld of mathematics. Having classiﬁed these individuals as mathematicians, we further identiﬁed collaborative links among them, focusing on publications where their co-authors were also recognized as mathematicians.

Upon examining the graph presented in Figure 1(a), which illustrates the collaborative frequency among mathematicians across time and the annual count of active mathematicians, a notable trend is observed. There has been a signiﬁcant rise in collaboration among mathematicians, particularly after 2005. Although there has been growth in the number of mathematicians during this period, this increase has not been proportional to the rapid escalation in collaborative activities. The intensiﬁed collaborative eﬀorts reflect the necessity to tackle complex problems in today’s technologically advanced landscape, where interdisciplinary cooperation and larger research teams are often required. These patterns have resulted in substantial collaborative network data, underscoring the importance of these joint academic pursuits.

In addition to using OpenAlex for publication records, we have carefully compiled information on renowned mathematicians who have received awards in their ﬁeld. This supplementary dataset was sourced from a combination of platforms, including Wikipedia, Wikidata, and oﬃcial award-related websites. It includes detailed information such as the mathematicians’ names, genders, birth years, speciﬁc award titles, and the dates these awards were received. Importantly, this dataset encompasses 341 unique award categories. Among the top ten are the John Simon Guggenheim Memorial Foundation Fellowship, Fellow of the Royal Society, Order of the Red Banner of Labor, Fields Medal, Lenin Prize, National Medal of Science, Order of Lenin, Leroy P. Steele Prize, Lobachevsky Prize, and State Stalin Prize. The mathematicians in this dataset, born between 1,741 and 2,000, show a gender distribution of 521 males and 27 females. Of these distinguished award recipients, 395 individuals were successfully matched with proﬁles in the OpenAlex database.

2 Community detection

In this study, we introduced two widely-used community detection algorithms: GMM and Infomap to explore the sub-community structure in the collaboration network of mathematicians. Both GMM and Infomap are applied to weighted undirected networks, with their implementations in the igraph Python interface taking edge weights into consideration (Csardi & Nepusz, 2006). GMM is based on the maximization of modularity using a greedy algorithm, while Infomap is grounded in the principles of information theory. These two classic methods were chosen due to their robust theoretical foundations and their ability to eﬀectively analyze the community structure within the collaboration network.

2.1 Greedy Modularity Maximization

Let A_N_×_N be the adjacency matrix of the collaboration network G, the element A_uv of A satisﬁes

(6)$A_{u v}=\left\{\begin{array}{cc} w_{u v}, & \text { if mathematicians } u \text { and } v \text { had co-authored, } \\ 0, & \text { otherwises } \end{array}\right.$

In the context of communities, mathematicians may form clusters based on different community detection algorithms. The key measure for assessing the performance of such an algorithm is modularity, which can be mathematically described as follows:

(7)$Q=\frac{1}{2 m} \sum_{u v}\left[\boldsymbol{A}_{u v}-\frac{k_{u} k_{v}}{2 m}\right] \delta\left(c_{u}, c_{v}\right).$

Here, $m=\frac{1}{2} \sum_{u v} \mathbf{A}_{u v}$ represents the total number of edges within the graph G, where c_u denotes the community belonging to mathematician u. The term k_u=∑_vA_uv refers to the degree of mathematician u, the count of edges connected to it. The function δ(x,y) is an indicator function with a value of 1 when x = y and 0 otherwise.

The proportion of edge endpoints connected to community i is indicated as: $a_{i}=\frac{1}{2 m} \sum_{v} k_{v} \delta\left(c_{v}, i\right)$, and the proportion of edges linking vertices in community 𝑖 to

those in community 𝑗 is denoted by: $e_{i j}=\frac{1}{2} \sum_{u v} \mathrm{~A}_{u v} \delta\left(c_{u}, i\right) \delta\left(c_{v}, i\right)$. Consequently, the

variation in Q when two communities i and j merge is given by:

(8)ΔQ_ij=e_ij+e_ji–2a_ia_j.

Considering an iterative algorithm approach, let us assume a sparse matrix that encapsulates Δ𝑄𝑖𝑗 for each pair of communities 𝑖 and 𝑗 that are connected by at least one edge. Within the matrix Δ𝑄𝑖𝑗, the max-heap 𝐻 holds the maximum value from each row along with the labels 𝑖 and 𝑗 of the respective community pair. We then apply GMM (See Algorithm 1), which aims to optimize modularity (Asif & Islam, 2016; Harris et al., 2023). This objective is to recognize a division of nodes that maximizes the internal edges of communities compared to an expected result in a random arrangement.

View original graphic|Download|PPT slide

Figure A1. Communities and Sub-communities detected sequentially by GMM. The left sub-ﬁgure represents the result of the ﬁrst detection on the collaboration network. The middle one is the community detected with most awardees in the ﬁrst detection. The right one is network structure of one community in the second detection. The huge circle represents awardees, and the color indicates the sub-ﬁeld of mathematicians.

2.2 Infomap based on Map equation

Rosvall et al. (2009) introduced an innovative method that relies on random walks, a concept where a walker transitions between nodes stochastically to analyze information flow in a network. The key idea of the Infomap algorithm is to ﬁnd the most compact description of the network’s structure. This is achieved by encoding paths of random walks to reflect the network’s community structure. The algorithm iteratively tests diﬀerent network partitions to reduce the required information to describe the random walker’s trajectory. Through this process, it detects tight-knit communities with rich information exchange.

Infomap (Algorithm 2) cluster nodes together which facilitate quick and eﬃcient information exchange, forming well-deﬁned modules. The connections between these modules represent the channels along which information moves. The Map Equation, central to this method, is presented as follows:

(9)$\begin{aligned} L(\boldsymbol{M}) & =q_{\curvearrowright} \boldsymbol{H}(\mathcal{Q})+\sum_{i=1}^{m} p_{\circlearrowright}^{i} \boldsymbol{H}\left(\mathcal{P}^{i}\right), \\ & =\left(\sum_{i=1}^{m} q_{i \curvearrowright}\right) \log \left(\sum_{i=1}^{m} q_{i \curvearrowright}\right)-2 \sum_{i=1}^{m} q_{i \curvearrowright} \log \left(q_{i \curvearrowright}\right)-\sum_{\alpha=1}^{N} p_{\alpha} \log \left(p_{\alpha}\right)+\sum_{i=1}^{m}\left(q_{i \curvearrowright}+\sum_{\alpha \in i} p_{\alpha}\right) \log \left(q_{i \curvearrowright}+\sum_{\alpha \in i} p_{\alpha}\right) \end{aligned}$

where $q_{\curvearrowright}$ symbolizes the chance of a random walk transitioning between modules at any step, H(Q) is the entropy reflecting the randomness in module switching, H(Pⁱ) accounts for the entropy of movements within a module, and p_α is the frequency at which a node α is visited in the steady state of the random walk. The term $q_{i \curvearrowright}$ denotes the likelihood of the random walker leaving module i per step, and the sum of $p_{\circlearrowright}^{i}$ across all modules equals $1+q_{\curvearrowright}$, balancing the equation.

3 Comparative evaluation

In the sphere of community detection, it is vital to assess the algorithms based on their ability to mirror the structural cohesion seen in real-world groups. This endeavor requires metrics that can measure the congruity between identiﬁed communities by algorithms and the actual observed groupings. A widely recognized metric for this purpose, introduced by Danon et al. (2005), is the Normalized Mutual Information (NMI). This metric evaluates the shared information between the true classiﬁcation of community memberships and that uncovered by community detection algorithms, taking account of the size of each grouping.

Consequently, NMI oﬀers a standardized accuracy gauge in reflecting the authentic community structure. It is a robust tool for gauging and contrasting the eﬃcacy of various algorithms in discerning signiﬁcant community formations in networks.

At the heart of this metric is the confusion matrix P, where “real” communities constitute the rows and “detected” communities form the columns. An element 𝑃𝑖𝑗 within this matrix represents the count of nodes from actual community 𝑖 identiﬁed in detected community 𝑗. Deriving from principles of information theory, NMI is computed as follows:

(10)$I(A, B)=\frac{-2 \sum_{i=1}^{C_{A}} \sum_{j=1}^{C_{B}} P_{i j} \log \left(\frac{P_{i j} P}{P_{i \cdot} P_{. j}}\right)}{\sum_{i=1}^{C_{A}} P_{i \cdot} \log \left(\frac{P_{i \cdot}}{P}\right)+\sum_{i=1}^{C_{B}} P_{\cdot j} \log \left(\frac{P_{. j}}{P}\right)}.$

The variable C_A signifies the number of real communities, C_B is the number of detected communities, 𝑃_𝑖⋅ symbolizes the summation over row 𝑖 of the matrix P_ij, and P_⋅_j stands for the summation over column j. An NMI value of 1 implies perfect alignment between the detected and actual communities. On the other hand, an NMI of 0 indicates no correspondence, such as when an algorithm fails to diﬀerentiate any distinct communities and groups the entire network as a single community.

Acknowledgement

We thank OpenAlex for the scientiﬁc corpus dataset. The computation in this study was supported by the Center for Computational Science and Engineering of the Southern University of Science and Technology.

Funding information

This work was supported by grants from the National Natural Science Foundation of China No. NSFC62006109 and NSFC12031005, the 13th Five-year plan for Education Science Funding of Guangdong Province No. 2021GXJK349, No. 2020GXJK457, and the Stable Support Plan Program of Shenzhen Natural Science Fund No. 20220814165010001.

Data availability statement

The data are mainly collected from the open data resources-OpenAlex (https://openalex.org/), and the supplementary awardees’ information is available from the corresponding author upon reasonable request.

Author contributions

Yurui Huang (12031320@mail.sustech.edu.cn): Conceptualization (Lead), Data curation (Equal), Formal analysis (Lead), Software (Lead), Visualization (Lead), Writing - original draft (Lead).

Zimo Wang (sdheitu@yeah.net): Data curation (Equal), Formal analysis (Supporting), Software (Supporting), Writing - review & editing (Supporting).

Chaolin Tian (12131250@mail.sustech.edu.cn): Investigation (Supporting), Validation (Supporting), Writing - review & editing (Supporting).

Yifang Ma (mayf@sustech.edu.cn): Funding acquisition (Lead), Methodology (Equal), Project administration (Lead), Validation (Lead), Writing - review & editing (Equal).

References

Publishing order | Descend order by publishing year | Descend order by cited within

Asif

, & Islam

M. A.

(2016, April). Finding most collaborating mathematicians a co-author network analysis of mathematics domain. In 2016 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube) (pp. 289-293). IEEE.

Barabâsi

A.-L.

, Jeong

, Néda

, Ravasz

, Schubert

, & Vicsek

(2002). Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and its Applications, 311(3-4), 590-614.

Barber

M. J.

, & Scherngell

(2013). Is the European R&D network homogeneous? Distinguishing relevant network communities using graph theoretic and spatial interaction modelling approaches. Regional studies, 47(8), 1283-1298.

Clauset

, Newman

M. E.

, & Moore

(2004). Finding community structure in very large networks. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 70(6), 066111.

Csardi

, & Nepusz

(2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695, 1-9.

Danon

, Diaz-Guilera

, Duch

, & Arenas

(2005). Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09), P09008.

Falih

, Grozavu

, Kanawati

, & Bennani

(2018). Anca: Attributed network clustering algorithm. In Complex Networks & Their Applications VI: Proceedings of Complex Networks 2017 (The Sixth International Conference on Complex Networks and Their Applications) (pp. 241-252). Springer International Publishing.

Fortunato

(2010). Community detection in graphs. Physics Reports, 486(3-5), 75-174.

Fortunato

, & Hric

(2016). Community detection in networks: A user guide. Physics Reports, 659(11), 1-44.

Gaskó

, Lung

R. I.

, & Suciu

M. A.

(2016). A new network model for the study of scientific collaborations: Romanian computer science and mathematics co-authorship networks. Scientometrics, 108, 613-632.

Groeneveld

R. A.

, & Meeden

(1984). Measuring skewness and kurtosis. Journal of the Royal Statistical Society Series D: The Statistician, 33(4), 391-399.

Grossman

J. W.

(2002). Patterns of collaboration in mathematical research. SIAM News, 35(9), 8-9.

Guimerà

, Uzzi

, Spiro

, & Amaral

L. A. N.

(2005). Team assembly mechanisms determine collaboration network structure and team performance. Science, 308(5722), 697-702.

DOI PMID

Harris

M. J.

, Murtfeldt

, Wang

, Mordecai

E. A.

, & West

J. D.

(2023). The role and influence of perceived experts in an anti-vaccine misinformation community. medRxiv.

Huang

, Cheng

, Tian

, Jiang

, Ma

, & Ma

(2024). Talent hat, cross-border mobility, and career development in China. Quantitative Science Studies, 1-24.

Huang

, Tian

, & Ma

(2023). Practical operation and theoretical basis of difference-in-difference regression in science of science: The comparative trial on the scientific performance of Nobel laureates versus their coauthors. Journal of Data and Information Science, 8(1), 29-46.

DOI

Izquierdo

, Vessuri

, & Gonzalez

(2018). Scientific collaboration networks of mathematicians from the former soviet union in the global south. Journal of Education and Human Development, 7(4), 83-93.

Klein

J. T.

(2005). Interdisciplinary teamwork:The dynamics of collaboration and integration. In S. J.

Derry

, C. D.

Schunn

, & M. A.

Gernsbacher

(Eds.), Interdisciplinary Collaboration: An Emerging Cognitive Science (1st ed., pp. 23-50). NY: Psychology Press.

Laudel

(2001). Collaboration, creativity and rewards: Why and how scientists collaborate. International Journal of Technology Management, 22(7-8), 762-781.

Liu

, Xue

, Wu

, Zhou

, Hu

, Paris

, Nepal

, Yang

, & Yu

P. S.

(2020). Deep learning for community detection: progress, challenges and opportunities. arXiv preprint arXiv:2005.08225.

Mao

, Cao

, Lu

, & Li

(2017). Topic scientific community in science: A combined perspective of scientific collaboration and topics. Scientometrics, 112, 851-875.

Newman

M. E.

(2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2), 404-409.

Newman

M. E.

(2004). Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Sciences, 101(suppl_1), 5200-5205.

, Jordan

, & Weiss

(2001). On spectral clustering:Analysis and an algorithm. In T.

Dietterich

, S.

Becker

, & Z.

Ghahramani

(Eds.), Advances in Neural Information Processing Systems 14 (NIPS 2001).

Potts

, Hartley

, Montgomery

, Neylon

, & Rennie

(2017). A journal is a club: A new economic model for scholarly publishing. Prometheus, 35(1), 75-92.

Priem

, Piwowar

, & Orr

(2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833.

Qin

, Lancaster

F. W.

, & Allen

(1997). Types and levels of collaboration in interdisciplinary research in the sciences. Journal of the American Society for information Science, 48(10), 893-916.

Reichardt

, & Bornholdt

(2006). Statistical mechanics of community detection. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 74(1), 016110.

Rosvall

, Axelsson

, & Bergstrom

C. T.

(2009). The map equation. The European Physical Journal Special Topics, 178(1), 13-23.

Simpson

E. H.

(1949). Measurement of diversity. Nature, 163(4148), 688-688.

Singh

, Becattini

, Cascini

, & Škec

(2021). How familiarity impacts influence in collaborative teams? Proceedings of the Design Society, 1, 1735-1744.

Somerfield

P. J.

, Clarke

K. R.

, & Warwick

R. M.

(2008). Simpson index. In S. E.

Jørgensen

& B. D.

Fath

(Eds.), Encyclopedia of Ecology (pp. 3252-3255). Academic Press.

Sonnenwald

D. H.

(2007). Scientific collaboration. Annual Review of Information Science and Technology, 41(1), 643-681.

Tomassini

, & Luthi

(2007). Empirical analysis of the evolution of a scientific collaboration network. Physica A: Statistical Mechanics and its Applications, 385(2), 750-764.

Van Nguyen

, Kirley

, & García-Flores

(2012). Community evolution in a scientific collaboration network. 2012 IEEE congress on evolutionary computation,

Williams

, Michalska

, Cohen

, Szomszor

, & Grant

(2023). Exploring the application of machine learning to expert evaluation of research impact. Plos one, 18(8), e0288469.

, Liu

, Bu

, Sun

, Zhang

, Acuna

D. E.

, Gray

, Meyer

, & Ding

(2024). The impact of heterogeneous shared leadership in scientific teams. Information Processing & Management, 61(1), 103542.

, Xia

, Zhang

, Wei

, Keogh

, & Chen

(2021). Familiarity-based collaborative team recognition in academic social networks. IEEE Transactions on Computational Social Systems, 9(5), 1432-1445.

Zhang

X.-S.

, Wang

R.-S.

, Wang

, Qiu

, Wang

, & Chen

(2009). Modularity optimization in community detection of complex networks. Europhysics Letters, 87(3), 38002.

Zhang

, Pan

, Wang

, & Su

(2023). Community detection in attributed collaboration network for statisticians. Stat, 12(1), e507.

Zhao

, Karypis

, & Fayyad

(2005). Hierarchical clustering algorithms for document datasets. Data mining and knowledge discovery, 10, 141-168.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

1 Introduction

Table 1. The basic characteristics of mathematicians’ collaborative networks.

2 Methodology

2.1 Collaboration network

Figure 1. Data characteristics of mathematicians.

Figure 2. Network characteristics.

2.2 Metrics of centrality

3 Findings

3.1 The influence of elite mathematicians in communities

Table 2. The t-test of the diﬀerence of centrality metrics between awardees and other mathematicians.

3.2 Imbalanced distribution of talent mathematicians within communities

Figure 4. Distributed characteristics of mathematicians and mathematical awardees within communities.

3.3 Community and research direction

Table 3. NMI analysis between true ﬁeld labels and detected community labels.

Table 4. Linear regression analysis on the number of awardees.

4 Discussion

Appendix

1 Data collection

2 Community detection

2.1 Greedy Modularity Maximization

2.2 Infomap based on Map equation

3 Comparative evaluation

Acknowledgement

Funding information

Data availability statement

Author contributions

References