Research Papers

Extended Lorenz majorization and frequencies of distances in an undirected network

  • Leo Egghe ,
Expand
  • Hasselt University, Diepenbeek 3590, Belgium
†Leo Egghe (Email: ).

Received date: 2023-11-25

  Accepted date: 2023-12-28

  Online published: 2024-01-30

Abstract

Purpose: To contribute to the study of networks and graphs.

Design/methodology/approach: We apply standard mathematical thinking.

Findings: We show that the distance distribution in an undirected network Lorenz majorizes the one of a chain. As a consequence, the average and median distances in any such network are smaller than or equal to those of a chain.

Research limitations: We restricted our investigations to undirected, unweighted networks.

Practical implications: We are convinced that these results are useful in the study of small worlds and the so-called six degrees of separation property.

Originality/value: To the best of our knowledge our research contains new network results, especially those related to frequencies of distances.

Cite this article

Leo Egghe . Extended Lorenz majorization and frequencies of distances in an undirected network[J]. Journal of Data and Information Science, 2024 , 9(1) : 1 -10 . DOI: 10.2478/jdis-2024-0007

1 Introduction

Let G=(V,E) be an undirected network, where V= (vk)k=1,…,N denotes the set of nodes or vertices and E denotes the set of links or edges. As collaboration (of scientists, universities, countries), bibliographic coupling (of articles, books), and co-citation (of articles, books) are all examples of undirected networks, it goes without saying that the study of these networks is of great importance for bibliometrics (Rousseau et al., 2018).
We assume that #V = N > 1. A chain in a network is a sequence of different nodes one by one connected by edges. The distance d between two nodes is equal to the number of links situated on a shortest chain (often called a shortest path) between these two nodes. Consequently, the distance between two nodes connected by an edge is equal to one. Each network studied in this article is assumed to be connected, i.e. there is a chain between any two nodes. Hence, for each node, there exists another node at a distance one. The total number of distances between any pair of nodes in this network is equal to N(N-1)/2, where, for v1,v2 ∈ V, d(v1,v2)=d(v2,v1) is considered only once.
Notation. We denote by αj, j= 1,…, N-1, the number of times distance j occurs in network G. The array A=(α1, α2,…,αN-1) is called the α-array or distance array of the network G.
Some immediate properties
1) $\sum_{j=1}^{N-1} \alpha_{j}=\frac{N(N-1)}{2}$
2) α1=#EN-1
3) αN-1=0 or 1
4) αj=0 ⇒ ∀ kj: αk=0
We give a short proof of
5) $\alpha_{2} \leq \frac{(N-1)(N-2)}{2}$
Indeed, if $\alpha_{2}>\frac{(N-1)(N-2)}{2}$ then $\frac{N(N-1)}{2}-\sum_{j=1}^{N-1} \alpha_{1} \geq \alpha_{1}+\alpha_{2}>(N-1)+\frac{(N-1)(N-2)}{2}$$\frac{N(N-1)}{2}$, which is a contradiction.
6) In a chain of length j there are 2 chains of length j-1, 3 chains of length j-2, and k chains of length j-k+1 (0 < k< j).
7) The distance frequency array of a complete N-node network, KN, is $\left(\frac{N(N-1)}{2}, 0, \ldots, 0\right)$
We recall the definition of the majorization order (Hardy et al., 1934). Let X = (xj) and Y= (yj), j=1,…, N-1 be two (N-1)-sequences of non-negative numbers, ordered decreasingly then X majorizes Y, denoted as X ⋟ Y, if
$\forall i=1, \ldots, N-2: \sum_{j=1}^{i} x_{j} \geq \sum_{j=1}^{i} y_{j}$
And
$\sum_{j=1}^{N-1} x_{j}=\sum_{j=1}^{N-1} y_{j}$
We recall that if X ⋟ Y then the (standard) Lorenz curve of X (Lorenz, 1905) is situated above the Lorenz curve of Y. We now extend the notions of majorization and Lorenz curve by removing the requirement to be arranged in decreasing order.
Definition. Extended majorization order
Let X = (xj) and Y=(yj), j=1,…, N-1 be two (N-1)-sequences of non-negative numbers, then X majorizes Y (in the extended sense), denoted as X ⋟ Y (we keep the same notation), if
$\forall i=1, \ldots, N-2: \sum_{j=1}^{f} x_{j} \geq \sum_{j=1}^{f} y_{j}$
And
$\sum_{j=1}^{N-1} x_{j}=\sum_{j=1}^{N-1} y_{j}$
Definition. Extended Lorenz curve
Let X = (xj) be an (N-1)-sequence of non-negative numbers and let $s_{i}=\sum_{j=1}^{i} x$ be the jth partial sum. Hence sN-1 = TOT denotes the total sum of the numbers in X; s0 is set equal to 0. Now plot the points $\left(\frac{i}{N-1}, \frac{s_{i}}{T O T}\right)_{i=0, \ldots, N-1}$ and connect them by line segments to obtain a curve joining the origin (0,0) with the point (1,1). We refer to this curve as the extended Lorenz curve. Contrary to the standard Lorenz curve this curve is not necessarily concave (but of course still increasing). An example is shown in Figure 3. If X is decreasing then the extended Lorenz curve coincides with the classical Lorenz curve. If X ⋟ Y then the extended Lorenz curve of X is situated above the extended Lorenz curve of Y.
Figure 3. Extended Lorenz curves of K7, G0 and C6 (the chain of length 6).

2 The main result

Given the number of nodes, N, we next show a majorization result between the frequency sequence of a complete network KN, that of a general network G=(V, E), denoted as A, and the frequency sequence C of a chain.
Theorem
Given a network G=(V, E) with N nodes, then,
$\left(\frac{N(N-1)}{2}, 0, \ldots, 0\right) \succ A=\left(\alpha_{1}, \alpha_{2}, \ldots, \alpha_{N-1}\right) \succ C=(N-1, N-2, \ldots, 1)$
By (8) and (9) the second inequality means that
$\forall i=1, \ldots, N-2: \sum_{j=1}^{i} \alpha_{j} \geq \sum_{j=1}^{i}(N-j)$
And
$\sum_{j=1}^{N-1} \alpha_{j}=\sum_{j=1}^{N-1}(N-j)$
We moreover prove that
$\alpha_{i} \leq \sum_{j=i}^{N-1} \alpha_{j} \leq \frac{(N-i+1)(N-i)}{2}=: \beta_{i}$
Proof. By (1), we already know that
$\alpha_{i} \leq \sum_{j=i}^{N-1} \alpha_{j} \leq \frac{(N-i+1)(N-i)}{2}=: \beta_{i}$
which proves (11). Assume now that αi=0, for i > 1, then we already know that ∀ ki:αk=0 and thus
$\sum_{j=1}^{i} \alpha_{j}=\sum_{j=1}^{N-1} \alpha_{j}=\frac{N(N-1)}{2} \geq \sum_{j=1}^{i}(N-j)$
Assume now that for some i > 1, αi≠0, then we will prove that also in this case $\sum_{j=1}^{i} \alpha_{j} \geq \sum_{j=1}^{i}(N-j)$. This will be done in several steps. First, we show that $\sum_{j=1}^{i} \alpha_{j} \geq \frac{i(i+1)}{2}$. Indeed: there exists in V at least one chain of length i, connecting nodes to which we refer as u1, u2, …, ui+1. Then, by property 6, we know that ∀j=1,…, i: αjij+1, and hence $\sum_{j=1}^{i} \alpha_{j} \geq\left(\sum_{j=1}^{i}(i-j+1)\right)=\frac{i(i+1)}{2}$. In the next step, we show that this inequality can be refined to $\sum_{j=1}^{i} \alpha_{j} \geq\left(\frac{i(i+1)}{2}\right)+i$. Indeed, as the network under study is connected, there exists a node in the network, denoted as ui+2, connected to at least one node of the chain u1, u2, …, ui+1. This point ui+2 has a distance d, 0 < d ≤ i to at least i points in the chain. Now, adding the i distances involving the point ui+2 we obtain $\sum_{j=1}^{i} \alpha_{j} \geq\left(\frac{i(i+1)}{2}\right)+i$. We further note that ∀ji, each point in the set S ={u1, u2, …, ui+2} has at least j points in the set S at a distance 0 < d ≤ j.
Now we continue in this way. Assuming that we have a set T of i+n+1 connected nodes {u1, …, ui+1, ui+2, …, ui+n+1} from which we already derived that $\sum_{j=1}^{i} \alpha_{j} \geq\left(\frac{i(i+1)}{2}\right)+(n i)$ and for which we know that ∀ji each point in the set T, has at least j points in the set T at a distance 0 < d ≤ j. We again apply connectedness to get a new node ui+n+2 at a distance d, 0 < d ≤ i to all points in T, leading to $\sum_{j=1}^{i} \alpha_{j} \geq\left(\frac{i(i+1)}{2}\right)+(n+1) i$. Again we observe that ∀ji, each point in T* = {u1, …, ui+1, ui+2, …, ui+n+2} has a distance d, 0 < d ≤ j, with at least j points in the set T*. This procedure ends with n = N-i-2 for which $\sum_{j=1}^{i} \alpha_{j} \geq\left(\frac{i(i+1)}{2}\right)+(N-i-1) i$ =$N i-\sum_{j=1}^{i} j=\sum_{j=1}^{i}(N-j)$ which proves the inequality in the case αi≠0, and hence (10).
Now we prove (12). Using (10) and (11) we have:
$\forall i=1, \ldots, N-1: \alpha_{i} \leq \sum_{j=i}^{N-1} \alpha_{j}=\sum_{j=1}^{N-1} \alpha_{j}-\sum_{j=1}^{i-1} \alpha_{j} \leq \frac{N(N-1)}{2}-\sum_{j=1}^{i-1}(N-j)=\beta_{i}$
where we still have to prove the final equality in (12). For this, we first observe that:
$\beta_{i}-\beta_{i+1}=\frac{(N-i+1)(N-i)}{2}-\frac{(N-i)(N-i-1)}{2}=(N-i)$
Now,
$\begin{array}{c} \frac{N(N-1)}{2}-\sum_{j=1}^{i-1}(N-j)=\frac{N(N-1)}{2}-\sum_{j=1}^{i-1}\left(\beta_{j}-\beta_{j+1}\right) \\ =\frac{N(N-1)}{2}-\left(\beta_{1}-\beta_{i}\right)=\beta_{i} \end{array}$

3 Remarks and consequences

1) It follows immediately from the previous theorem that for a given number N and α-array A=(α1, α2, …, αN-1) the median of A is smaller than or equal to the median of the chain of length N-1 (N nodes).
2) It is always possible to find a network with N (N > 3) nodes such that αi< βi, and this for each number i=1,…, N-1. Indeed, consider for N > 3, a network for which ∀i, i=1,…,N–3, αi ≠0; αN2=1 and αN1=0. In this case αN1=0<βN1=1; αN2=1< βN2=3 and, using (12), ∀i, i=1,…, N–3, $\alpha_{i}<\sum_{j=i}^{N-2} \alpha_{j}=\sum_{j=i}^{N-1} \alpha_{j} \leq \beta_{i}$. Such a network may look like shown in Figure 1.
Figure 1. An example of a network for which $\forall i=1, \ldots, N-1: \alpha_{i}<\beta_{i}$.
3) The inequality $\alpha_{i} \leq \beta_{i}$ cannot be made more precise as for a chain of length i, αi=βi=1, for each i=1,…, N-1.
4) If N > 2, then it is impossible that ∀i=1,…,N–1,αi=βiIndeed: $\beta_{1}=\frac{N(N-1)}{2}$ and if α1 = $\frac{N(N-1)}{2}$ then automatically αi=0, i=2,.., N–1, while this is not the case for βi.
5) If A is an array of length N-1, consisting of non-negative natural numbers such that
$\left(\frac{N(N-1)}{2}, 0, \ldots, 0\right) \succ A=\left(\alpha_{1}, \alpha_{2}, \ldots, \alpha_{N-1}\right) \succ C=(N-1, N-2, \ldots, 1)$
then the components of A do not have to be frequencies of distances in a network. Indeed, let N = 4 and let A = (4,1,1), then (4,1,1)⋟C=(3,2,1). Yet, there does not exist a network with (4,1,1) as distance frequencies: the third component is equal to one indicating that the network must be a chain but for a chain with 4 nodes, α1= 3 and not 4.
Even if the last component of A is zero a counterexample is possible. Indeed, with N=5, we have (4,3,3,0) ⋟C=(4,3,2,1). Such a network must have at least one chain of length three (connecting four nodes). The fifth node must be connected to the second or the third node in the chain. Hence A must necessarily be (4,4,2,0) and cannot be (4,3,3,0). These examples lead to the open question of finding the conditions under which such an array A is the frequency array of the distances in a (connected) network.
6) From the above and the main theorem we see that max {Md; Md is the median distance in an N-node network} is strictly smaller than max {$\bar{d}: \bar{d}$ is the average distance of an N-node network}. Although Md ≤ $\bar{d}$ is not always true: a star with a center and N-1 rays (N>4) is an example (Md = 2 and $\bar{d}=\frac{2(N-1)}{N}<2$), we have that if the α – sequence of a network is decreasing then clearly Md ≤ $\bar{d}$. The reverse of this result does not hold in general. This is illustrated by G0 (N=7) in Figure 2 below. Its α–sequence is not decreasing, namely (6,7,6,2,0,0) but yet Md = 2 < $\bar{d}$ = 46/21 ≈ 2.19.
Figure 2. An example of a network, G0, with seven nodes.

4 A result about the average distance in a network

If N is fixed and the array A=(α1, α2,…,αN1) denotes the frequencies of the distances in a network, then $\frac{2}{N(N-1)} \sum_{i=1}^{N-1} i \alpha_{i}$ denotes the average distance between nodes in this network, say $\bar{d}$.
Theorem. If $A^{(1)}=\left(\alpha_{1}^{(1)}, \alpha_{2}^{(1)}, \ldots, \alpha_{N-1}^{(1)}\right) \succcurlyeq A^{(2)}=\left(\alpha_{1}^{(2)}, \alpha_{2}^{(2)}, \ldots, \alpha_{N-1}^{(2)}\right)$ then $\overline{d_{1}} \leq \overline{d_{2}}$.
Proof. As $A^{(1)}=\left(\alpha_{1}^{(1)}, \alpha_{2}^{(1)}, \ldots, \alpha_{N-1}^{(1)}\right) \succ A^{(2)}=\left(\alpha_{1}^{(2)}, \alpha_{2}^{(2)}, \ldots, \alpha_{N-1}^{(2)}\right)$, we know that
$\forall i=1, \ldots, N-2: \sum_{j=1}^{i} \alpha_{j}^{(1)} \geq \sum_{j=1}^{i} \alpha_{j}^{(2)} \text { and } \sum_{j=1}^{N-1} \alpha_{j}^{(1)}=\sum_{j=1}^{N-1} \alpha_{j}^{(2)}$. Consequently: ∀i=2,…,N–2:$ \sum_{j=i}^{N-1} \alpha_{j}^{(1)} \leq \sum_{j=i}^{N-1} \alpha_{j}^{(2)}$.
$\begin{array}{l} \text{Now,} \overline{d_{1}}=\frac{2}{N(N-1)} \sum_{j=1}^{N-1} j \alpha_{j}^{(1)} \\ =\frac{2}{N(N-1)}\left[\sum_{j=1}^{N-1} \alpha_{j}^{(1)}+\sum_{j=2}^{N-1} \alpha_{j}^{(1)}+\ldots+\sum_{j=N-1}^{N-1} \alpha_{j}^{(1)}\right] \\ \leq \frac{2}{N(N-1)}\left[\sum_{j=1}^{N-1} \alpha_{j}^{(2)}+\sum_{j=2}^{N-1} \alpha_{j}^{(2)}+\ldots+\sum_{j=N-1}^{N-1} \alpha_{j}^{(2)}\right] \\ =\frac{2}{N(N-1)} \sum_{j=1}^{N-1} j \alpha_{j}^{(2)}=\overline{d_{2}} \end{array}$
Corollary. It follows from the previous theorem that the average distance between nodes in an N-node network is at most equal to the average distance in an N-node chain, namely $\frac{(N+1)}{3}$ (see the appendix for the simple calculation of this value).
Remark. If G(A) denotes the Gini index of the array A of distance frequencies, we have
$G(A)=\frac{1}{N-1}(N-2 \bar{d})$
Hence, the Gini coefficient respects the extended majorization order. From (13) one can express $\bar{d}$ as a function of G(A):
$\bar{d}=\frac{N}{2}-G(A) \frac{N-1}{2}$
The previous theorem shows that the operation of taking the average distance in an N-node network respects the opposite of the Lorenz majorization order, while the Gini coefficient respects this order.

5 The median distance and its relation with the average distance in a chain

Assume that we have an N-node chain, hence containing N-1 links. Then its set of distances contains $\frac{(N-1) N}{2}$ numbers and the median, Md, is either a natural number m or m-0.5. Then we have
$1+2+\ldots+(N-m) \geq \frac{N(N-1)}{4}>1+2+\ldots+(N-m-1)$
As for each natural number j, we have $\sum_{k=1}^{N-j} k=\frac{(N-j)(N-j+1)}{2}$, we can prove that m = [x] with
$\frac{(N-x)(N-x+1)}{2}=\frac{N(N-1)}{4}$
from which it follows that $x=\frac{(2 N+1)-\sqrt{2 N^{2}-2 N+1}}{2}$ and hence Md is either [x]-05 or [x]. For N large this leads to
$M d \approx N\left(1-\frac{\sqrt{2}}{2}\right) \approx 0.293 N$
Consequently, $\lim _{N \rightarrow \infty} \frac{M d}{\bar{d}}=3\left(1-\frac{\sqrt{2}}{2}\right) \approx 0.879<1$.
Moreover, we see that Md < $\bar{d} \Leftrightarrow N+\frac{1}{2}-\frac{\sqrt{2}}{2} \sqrt{N^{2}-N+\frac{1}{2}}<\frac{N+1}{3} \Leftrightarrow$ N2–13N+4>0⇔N>12.7. Hence, in practice: N ≥ 13. Checking this manually for N = 2, …,14 we find that also then Md < $\bar{d}$ except for N = 2, 5, 8, and 11 in which cases Md = $\bar{d}$.

6 Returning to the example G0 shown in Fig.2

The array of G0 is (6,7,6,2,0,0). Figure 3 shows its extended Lorenz curve, situated between the extended Lorenz curve of K7 (the complete network on 7 nodes) and the extended Lorenz curve of the chain of length 6. The average distances are respectively equal to 1, 2.19, and 2.67; the medians are 1, 2, and 2; while the corresponding Gini coefficients are: 0.833, 0.437, and 0.278.

7 Conclusion

In this article, we introduced the study of the distance distribution of a network. We showed that the distance distribution in an undirected network majorizes the one of a chain and is always smaller (in the sense of majorization) than the distribution of the corresponding complete N-network. The Gini coefficient respects the majorization order for such distributions, while the average distance behaves oppositely. As a consequence, the average and median distances in any such network are smaller than those of a chain.
We intend to use these results in the study of small worlds and the so-called six degrees of separation property (work in preparation).

Acknowledgement

The author thanks his colleague Ronald Rousseau for useful discussions.

Appendix

Proof that the average distance in an N-node chain is $\frac{(N+1)}{3}$
The average distance in an N-node chain is equal to
$\begin{array}{l} \frac{2}{N(N-1)} \sum_{i=1}^{N-1} i(N-i)=\frac{2}{N(N-1)}\left(\frac{N^{2}(N-1)}{2}-\frac{(N-1) N(2 N-1)}{6}\right)= \\ \frac{2 N(N-1)}{2 N(N-1)}\left(N-\frac{2 N-1}{3}\right)=\frac{N+1}{3} \end{array}.$
[1]
Hardy, G.H., Littlewood, J.E., & Pólya, G. (1934). Inequalities. London: Cambridge University Press.

[2]
Lorenz, M.O. (1905). Methods of measuring concentration of wealth. Journal of the American Statistical Association, 9, 209-219.

[3]
Rousseau, R., Egghe, L., & Guns, R. (2018). Becoming Metric-Wise. A bibliometric guide for researchers. Kidlington: Chandos (Elsevier).

Outlines

/

京ICP备05002861号-43

Copyright © 2023 All rights reserved Journal of Data and Information Science

E-mail: jdis@mail.las.ac.cn Add:No.33, Beisihuan Xilu, Haidian District, Beijing 100190, China

Support by Beijing Magtech Co.ltd E-mail: support@magtech.com.cn