File:Heaps' Law on "War and Peace".svg

Page contents not supported in other languages.
This is a file from the Wikimedia Commons
From Wikipedia, the free encyclopedia

Original file(SVG file, nominally 662 × 491 pixels, file size: 156 KB)

Summary

Description
English: Verification of Heaps' law on War and Peace.

```python import nltk import urllib.request from collections import Counter import matplotlib.pyplot as plt import numpy as np

  1. Download the corpus

url = "http://www.gutenberg.org/files/2600/2600-0.txt" response = urllib.request.urlopen(url) long_txt = response.read().decode('utf8')

import random

  1. Tokenize the text

tokenizer = nltk.tokenize.RegexpTokenizer('\w+') tokens = tokenizer.tokenize(long_txt.lower()) tokens = tokens[940:]

  1. Prepare arrays to hold the counts of total words and unique words

total_words = np.arange(1, len(tokens) + 1) unique_words = np.zeros(len(tokens))

  1. Count unique words while progressing through the text

word_set = set() for i, token in enumerate(tokens):

   word_set.add(token)
   unique_words[i] = len(word_set)
  1. Fit Heap's law: unique_words = K * total_words ^ beta

log_total_words = np.log(total_words) log_unique_words = np.log(unique_words) beta, logK = np.polyfit(log_total_words, log_unique_words, 1) K = np.exp(logK)

  1. Print the estimated parameters

print('K:', K) print('beta:', beta)

  1. Plot total words vs. unique words

plt.figure(figsize=(8, 6)) plt.plot(total_words, unique_words, label='Empirical Data') plt.plot(total_words, K * total_words ** beta, '--', label=f'Heaps\' Law Fit: K={K:.2f}, beta={beta:.2f}')

  1. Tokenize the text

tokenizer = nltk.tokenize.RegexpTokenizer('\w+') tokens = tokenizer.tokenize(long_txt.lower()) tokens = tokens[940:] random.shuffle(tokens)

  1. Prepare arrays to hold the counts of total words and unique words

total_words = np.arange(1, len(tokens) + 1) unique_words = np.zeros(len(tokens))

  1. Count unique words while progressing through the text

word_set = set() for i, token in enumerate(tokens):

   word_set.add(token)
   unique_words[i] = len(word_set)
  1. Fit Heap's law: unique_words = K * total_words ^ beta

log_total_words = np.log(total_words) log_unique_words = np.log(unique_words) beta, logK = np.polyfit(log_total_words, log_unique_words, 1) K = np.exp(logK)

  1. Print the estimated parameters

print('K:', K) print('beta:', beta)

  1. Plot total words vs. unique words

plt.plot(total_words, unique_words, label='Shuffled Empirical Data') plt.plot(total_words, K * total_words ** beta, '--', label=f'Heaps\' Law Fit for shuffled data: K={K:.2f}, beta={beta:.2f}')


plt.xlabel('Total Words') plt.ylabel('Unique Words') plt.legend() plt.grid(True) plt.title('Verification of Heaps\' Law on "War and Peace"') plt.savefig("war and peace.svg", bbox_inches='tight', format='svg')

```
Date
Source Own work
Author Cosmia Nebula

Licensing

I, the copyright holder of this work, hereby publish it under the following license:
w:en:Creative Commons
attribution share alike
This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.
You are free:
  • to share – to copy, distribute and transmit the work
  • to remix – to adapt the work
Under the following conditions:
  • attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

Captions

Add a one-line explanation of what this file represents

Items portrayed in this file

depicts

18 July 2023

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeThumbnailDimensionsUserComment
current22:54, 18 July 2023Thumbnail for version as of 22:54, 18 July 2023662 × 491 (156 KB)Cosmia NebulaUploaded while editing "Heaps' law" on en.wikipedia.org
The following pages on the English Wikipedia use this file (pages on other projects are not listed):

Metadata