Categories

Computer Science Personal Economics General Mathematics Linguistics Questions Teaching Physics Talks History Theology

Archive

Culturomics

I have really fallen in love with Google Books Ngram Viewer, so I thought I’d do a little “culturomics” myself. Here’s an image I made using Google’s data:

Numbers in Print

The brightness of the pixel at position $(x,y)$ is related to how frequently “$x$” appears in books published in the year $y$. Specifically, if $p$ is the number of times “$x$” appears in print during year $y$, divided by the number of times any number less than 2100 appears in print during that year, then $(1 - p)^{1500}$ is the brightness of the pixel at $(x,y)$.

The dark, diagonal edge along the right hand side appears because in year $y$ there are many published appearances of numbers near $y$.

Dark diagonal edge

World events have left their mark on the numbers appearing in books! For example, 1914 is still being talked about long after 1914, as evidenced by the darker line above 1914.

If we look at numbers just above 1000 and turn up the contrast a bit,

Around one thousand

we see an echo of the dark diagonal, from people writing (or more likely, the OCR software reading) zero instead of nine in the year. There’s a dark column for the Norman conquest in 1066; a number like $2^{10} = 1024$ was not so important until the 20th century.

If we look at numbers just above 1300,

Above 1300

we can see an diagonal line from 1800s being read as 1300s, and a dark vertical line above 1453 (the “end” of the middle ages). In the 18th century,

Above 1700

1776 is quite visible. And finally, a puzzle:

Why 2044

Why was “2044” so significant until the 1920s?

2043,2044,2045 in Google ngrams viewer

I’d love to know the answer to this question. The only thing I can guess that might relate the year 1919 to the year 2044 is solar eclipses.

Clustering texts with an obvious grouping.

It was pointed out to me by Kenny Easwaran that I ought to try clustering texts that already have a natural grouping.

So I ran the clustering program on 15 texts written by three authors, and here is the result:

Clustering Jane Austen, Shakespeare, and Sir Arthur Conan Doyle.

The largest eigenvalue is 25 times bigger than the next largest eigenvalue, and picks out the author pretty well. The top pile consists of Jane Austen’s books (with Emma split into three volumes). The middle pile consists of Sir Arthur Conan Doyle’s books, with the Sherlock Holmes mysteries (Valley of Fear, Sign of Four, and Hound of the Baskervilles) grouped closer than the others. The bottom pile are five of Shakespeare’s plays.

Of course, these people are all pretty different. As requested below by Theo, let’s run it one more time, using 12 books from George Eliot, Jane Austen, and the Brontë sisters.

Three from the same period.

Well, that didn’t quite work. The books by the Brontë sisters (Wuthering Heights, Villette, The Professor, Jane Eyre) have been separated from the others, but George Eliot and Jane Austen are getting mixed together. Admittedly, if you just project to the y coordinate, the authors are sitting in disjoint intervals. Nevertheless, this isn’t as nice as I might hope; so let’s run it again, just on the eight books written by the two authors that aren’t being sufficiently separated:

Just two from the same period.

I suppose this is somewhat better, though it’s basically just a stretched out and inverted version of the previous image. Jane Austen’s books (Sense and Sensibility, Pride and Prejudice, Mansfield Park, Emma) are all up on top, and George Eliot’s books still aren’t piled together.

You might have guessed that I have Project Gutenberg to thank for the text files (including the Shakespeare plays).

Clustering Shakespeare.

I ran my clustering program (which I just ran on the New Testament) on Shakespeare’s plays—which were conveniently packaged into a text file by Open Source Shakespeare.

The result was the following graph:

Clustering of Shakespeare’s Plays

I know little about Shakespeare, so I can’t say too much about the above image. I’d love to know what you think: does this arrangement of his plays make any sense?

Given that modern processors are so good at vector and matrix calculations, I’m surprised that this sort of visualization tool doesn’t appear in more places. For instance,

  • Your blogs and email could be organized this way. Imagine lasso-ing a bunch of similar emails to reply to them all at once!
  • News could be organized into nice piles.
  • Your desktop and personal files could be arranged automatically into relevant piles.

Then again, maybe the idea of piles appeals to me more than most people—just look at how I organize the papers and books on my desk!