Zipf's law (/zɪf/; German pronunciation: [tsɪpf]) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the n-th entry is often approximately inversely proportional to n.
The best known instance of Zipf's law applies to the frequency table of words in a text or corpus of natural language:
It is usually found that the most common word occurs approximately twice as often as the next common one, three times as often as the third most common, and so on. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852).[2] It is often used in the following form, called Zipf-Mandelbrot law:
where and are fitted parameters, with , and .[1]
This law is named after the American linguist George Kingsley Zipf,[3][4][5] and is still an important concept in quantitative linguistics. It has been found to apply to many other types of data studied in the physical and social sciences.
In mathematical statistics, the concept has been formalized as the Zipfian distribution: A family of related discrete probability distributions whose rank-frequency distribution is an inverse power law relation. They are related to Benford's law and the Pareto distribution.
Some sets of time-dependent empirical data deviate somewhat from Zipf's law. Such empirical distributions are said to be quasi-Zipfian.
piant2014
was invoked but never defined (see the help page).fagan2010
was invoked but never defined (see the help page).Powers1998
was invoked but never defined (see the help page).zipf1935
was invoked but never defined (see the help page).zipf1949
was invoked but never defined (see the help page).