Do Copied Citations Create Renowned Papers?

The Following is an article from The Annals of Improbable Research.

(Image credit: Flickr user Magnus Halsnes)

by M.V. Simkin and V.P. Roychowdhury
Department of Electrical Engineering, University of California, Los Angeles

Recently we discovered [see cond-mat/0212043] that the majority of citations in scientific papers are simply copied from the lists of references that appear in other papers. Here we show that a model, in which a scientist picks three random papers, cites them, and also copies a quarter of their references accounts quantitatively for empirically observed citation distribution. Simple mathematical probability, not genius, can explain why some papers are cited a lot more than the other.

Greatness? Or Just Simple Probability?
During the “Manhattan Project” (in which scientists created the first nuclear bomb), Enrico Fermi, the physicist, asked General Groves, the head of the project: “What is the definition of a ‘great’ general?”.¹ Groves replied that any general who had won five battles in a row might safely be called great. Fermi then asked how many generals are great. Groves said about three out of every hundred. Fermi conjectured that, considering that opposing forces for most battles are roughly equal in strength, the chance of winning one battle is 1/2, and the chance of winning five battles in a row is 1/2⁵=1/32.

“So you are right General,” said Enrico Fermi. “About three out of every hundred. Mathematical probability, not genius.”

General Groves pinning a medal on Enrico Fermi. (image source: Atomic Heritage Foundation)

The existence of military genius was also questioned on basic philosophical grounds by Tolstoy.²

Greatness in Science: Your Papers Are Cited a Lot
A commonly accepted measure of “greatness” for scientists is the number of times other people cite their papers.³ For example, SPIRES, the High-Energy Physics literature database, divides papers into six categories according to the number of citations they receive. The top category, “Renowned Papers” lists those with 500 or more citations.

Let us have a look at the citations to roughly 24 thousands papers, published in Physical Review D in 1975-1994.⁴ As of 1997 there where about 350 thousands of such citations: fifteen per published paper on the average. However, forty-four papers were cited five hundred times or more. Could this happen if all papers are created equal? If they indeed are, then the chance of being cited is one in 24,000.

What is the chance of be cited, purely at random, 500 times out of 350,000? The calculation for this is slightly more complex. (See Appendix A for details.) The answer is one in -- or, in other words, it is zero. One is tempted to conclude that those forty-four papers, which achieved the impossible, must be great.

Copying Those Who Copy Those Who Copy...
Recently we discovered⁵ that is very common that when scientists write their own list of citations, they copy many of those citations from the lists of references in other papers. In fact, we now know, this copying is a major component of the citation dynamics in scientific publication.

In this way, a paper that already is cited in one place automatically becomes likely to be cited again in another place. And after it is cited in a second place, it is even more likely to be cited in the future in still more places. In other words, “unto every one that hath shall be given, and he shall have abundance.”^{6, 7}

This phenomenon is so well known that it has several names: “the Matthew effect,” ⁶ “cumulative advantage,” ⁸ and “preferential attachment.” ⁹

Let’s Look at the Numbers
The effect of citation copying on the probability distribution of citations can be quantitatively understood within the framework of the model of random-citing scientists (RCS), which we will now describe.

When a scientist is writing a manuscript he picks up m random articles,^A cites them, and also copies some of their references, each with probability P.

This model resembles a couple of other models^8,9,13,14 (see Appendix B for the key differences B), and can be easily solved using methods developed to deal with multiplicative stochastic processes. ^8,14

The evolution of the citation distribution (here N_x denotes the number of papers that were cited K times, and N is the total number of papers) is described by the following rate equations:

which have the following stationary solution:

For large K it follows from (2) that:

Citation distribution follows a power law, empirically observed in several of our listed references.^10,11,12

A good agreement between the RCS model and actual citation data⁴ is achieved with input parameters m=3 and P=1/4 (see Figure 1).

Figure 1. Outcome of the model of random-citing scientists (with m=3 and P=1/4) compared to actual citation data. Mathematical probability rather than genius can explain why some papers we demonstrate an unpleasant truth.

Now what is the probability for an arbitrary paper to become “renowned”, i.e. receive more than five hundred citations? Iteration of Eq. 2 (with m=3 and P=1/4) shows that this probability is one in 600. This means that about 40 out of 24,000 papers should be renowned. Ergo, greatness is a matter of mathematical probability, not genius.

Greatness and Blasphemy
On one occasion¹⁵ NapoleonC said to Laplace “They tell me you have written this large book on the system of the universe, and have never even mentioned its Creator.” The reply was “I have no need for this hypothesis.” Now, Laplace was not against God. He simply did not need to postulate his existence in order to explain existing astronomical data. Similarly, the present work is not blasphemy. Of course, in some spiritual sense, great scientists do exist. It is just that even if they would not exist, citation data would look the same.

Notes
A. The analysis presented here also applies to a more general case when m is not a constant, but a random variable. In that case m in all of the equations that follow should be interpreted as the mean value of this variable.

B. These models, though introduced prior to the RCS, are more complicated and difficult to understand for a non-expert reader. This is why discussion of them has been removed into an Appendix.

C. Incidentally, Napoleon was the military commander whose genius was questioned in Reference [2].

D. Additional support for the plausibility of this conclusion comes from the findings of Ref. [5] that few citation slips repeat dozens of times, while most appear just once. Certainly no misprint is more seminal than the other.

References
1. See e.g. Out of the Crisis, W.E. Deming, MIT, Cambridge, 1986.

2. War and Peace, L. Tolstoy.

3. Citation Indexing, E. Garfield, John Wiley, New York, 1979.

4. SPIRES data, compiled by H. Galic, and made available by S. Redner.

5. “Read Before You Cite!”, M.V. Simkin and V.P. Roychowdhury ; Complex Systems, vol. 14, 2003, p. 269.

6. “The Matthew Effect in Science”, R. K. Merton, Science, vol. 159, 1968, p. 56.

7. Gospel According to Matthew, 25:29.

8. “A General Theory of Bibliometric and Other Cumulative Advantage Process”, D. de S. Price, Journal of American Society for Information Science, vol. 27, 1976, p. 292.

9. “Emergence of Scaling in Random Networks”, A.-L. Barabasi and R. Albert, Science, vol. 286, 1999, p. 509.

10. “Networks of Scientific Papers”, D. de S. Price, Science, vol. 149, 1965, p. 510.

11. “Citations and Zipf-Mandelbrot Law”, Z.K. Silagadze, Complex Systems, vol. 11, 1997, p. 487.

12. “How Popular is Your Paper? An Empirical Study of Citation Distribution”, S. Redner; European Physics Journal B, vol. 4, 1998, p. 131.

13. “Knowing a Network by Walking on It: Emergence of Scaling”, A. Vazquez; Europhysics Letters, vol. 54, 2001, p. 430.

14. “Organization of Growing Random Networks”, P.L. Krapivsky and S. Redner; Physical Review E, vol. 63, no. 066123, 2001.

15. A Budget of Paradoxes, A. de Morgan, The Open Court Publishing Co., Chicago, 1915. See vol. 2, p.1.

(Image credit: Flickr user Nic McPhee)

Appendix A
If one assumes that all papers are created equal then the probability to win M out of N possible citations when the total number of cited papers is N is:.

Appendix B
In the model introduced by Vazquez [13] a scientist does a recursive bibliography search. Specifically, when he is writing a manuscript, he picks up a paper, cites it, follows its references, and cites a fraction p of them. Afterwards he repeats this procedure with each of the papers that he cited. And so on.

In two limiting cases ( P=1 and P=0) the Vazquez model is exactly solvable [13]. Also in these cases it is identical to the RCS model (m = 1 case), which in contrast can be solved for any p.
Though theoretically interesting, the Vazquez model cannot be a realistic description of the citation process. In fact, the results of Ref. [5] indicate that there is essentially just one “recursion”, that is, references are copied from the paper at hand, but hardly followed. To be precise, results of Ref. [5] could support a generalized Vazquez model, in which the references of the paper at hand are copied with probability p, and afterwards the copied references are followed with probability R (the “reading” probability introduced in Ref. [5]). However, given the low value of this probability (R=0.2 according to Ref. [5]), it is clear that the effect of secondary recursions on the citation distribution is negligible.

For P<<1 effects of second and higher order recursions even in the original Vazquez model are negligible, and it becomes essentially identical to the RCS model. As we find a power law distribution for all non-zero p (see Eq. (3)), this casts doubt on the claim made in [13] that there is a phase transition from power law to exponential distribution around.

An interesting observation is that in the Vazquez model when P=1 in-component [14] essentially becomes in-degree. This is why Eq.6 of [13] is identical to Eq.59 of [14].

Also Refs [8], [9] by postulating that the probability of paper being cited is somehow proportional to the amount of citations it had already received (no mechanism for this was proposed) were able to explain the power law, which was observed [10], [11], [12] in the distribution of citations.

_____________________

This article is republished with permission from the January-February 2005 issue of the Annals of Improbable Research. You can purchase back issues of the magazine or subscribe to receive future issues, in printed or in ebook form. Or get a subscription for someone as a gift! Visit their website for more research that makes people LAUGH and then THINK.

Login to comment.

Click here to access all of this post's 0 comments