Lost in Transcription Has Moved!!: March 2013

Tuesday, March 26, 2013

We're moving!!

So, we're moving!

That's not the exclusive "we" of me and my meat-space family. I mean the inclusive "we," you and me, imagined reader, we're moving off of blogger, and over to a new Wordpress-powered site at jonfwilkins.com

This is a part of a borg-like consolidation of my online presence. Over at the new site, you'll find not only the ongoing saga of this blog, but also other fun stuff, like my CV. (WOO!!)

So let us go then, you and I / and update our bookmarks in the sky / of the browser window, by which I / mean the bookmark bar. The Lost in Transcription blog will now live at http://jonfwilkins.com/blog/

If you subscribe to the feedburner feed at http://feeds.feedburner.com/LostInTranscription you should be fine, as the blog feed will still redirect there.

Friday, March 8, 2013

What a Remarkable Paperback!!

So, guess what came in the mail yesterday? That's right! It's the paperback edition of Remarkable, by Lizzie K. Foley. The hardback came out last April under the Dial imprint of Penguin. The paperback is through Puffin (also part of Penguin), and has a completely new cover. Here's a stack of them:

Foreground: The nineteen best books ever written. Background: Our new kitchen wall color.

And here's a close-up of the cover, so that you can really see the awesome cover art by Fernando Juarez, which has a bit of a Dr. Seuss-ey vibe:

Pictured: Jane, The Pirate Ship Mozart Kugeln, Lucky the Lake Monster, The Mansion at the Top of Remarkable Hill, the Bell Tower (under construction). Not pictured: the nefariously identical Grimlet Twins, Melissa and Eddie, Remarkable's School for the Remarkably Gifted, Ebb, Jeb, Flotsam, Madame Gladiola, Penelope Hope Adelaide Catalina, Anderson Brigby Bright Doe III, Lucinda Wilhelmina Hinojosa, Mad Captain Penzing the Horrific, and more.

That means that, yes, you can now get this excellent book in paperback form, which is both more affordable and more bendable than the original!

Should you buy it? Yes! Why? Let me tell you!

Here are the pull quotes from just a few of the positive reviews Remarkable has received:

From the New York Times:

A lot of outlandish entertainment.

From Booklist:

A remarkable middle-grade gem.

From Kirkus Reviews:

A rich, unforgettable story that's quite simply — amazing.

The story centers on the town of Remarkable, where all of the residents are gifted, talented, and extraordinary. Everyone in the town is a world-class musician, or writer, or architect . . .

Except for Jane.

In fact, she is the only student in the entire town who attends the public school, rather than Remarkable's School for the Remarkably Gifted. But everything changes when the Grimlet Twins join her class and pirates arrive in town. Plus, there's a weather machine, a psychic pizza lady, a shy lake monster, and dentistry.

The book is both funny and thoughtful. You can enjoy it as a goofy adventure full of wacky characters and wordplay. It's for ages eight and up, but if you're an grown up who likes kids' books at all, you'll find that there is a lot here to engage the adult reader.

Speaking of which, you can also read it as a subversive commentary on a culture that pushes children towards excellence rather than kindness and happiness. As Jane's Grandpa John says near the end of the book:

The world is a wonderfully rich place, especially when you aren't trapped by thinking that you're only as worthwhile as your best attribute. . . . It's the problem with Remarkable, you know. . . . Everyone is so busy being talented, or special, or gifted, or wonderful at something that sometimes they forget to be happy.

Now, I know, you're thinking to yourself that you should take my endorsement with a grain of salt. After all, Lizzie Foley is my wife, and I can't be trusted to provide an honest, unbiased assessment of her book . . .

Or can I?

I'm gonna give you some straight talk on correlation versus causation. You might assume that I like this book because I'm married to the person who wrote it. You would not be more wrong. In fact, if I did not know Lizzie Foley, and I read this book, I would track her down and marry her.

So, yes, you should run out right now and get yourself a copy of this book. You should give it to your ten year old, or you should read it with your eight year old, or you should just curl up with it yourself. Just remember, she's already married. I'm looking at you, Ryan Gosling!

Monday, March 4, 2013

How Many English Tweets are Actually Possible?

So, recently (last week, maybe?), Randall Munroe, of xkcd fame, posted an answer to the question "How many unique English tweets are possible?" as part of his excellent "What If" series. He starts off by noting that there are 27 letters (including spaces), and a tweet length of 140 characters. This gives you 27¹⁴⁰-- or about 10²⁰⁰ -- possible strings.

Of course, most of these are not sensible English statements, and he goes on to estimate how many of these there are. This analysis is based on Shannon's estimate of the entropy rate for English -- about 1.1 bits per letter. This leads to a revised estimate of 2^{140 x 1.1} English tweets, or about 2 x 10⁴⁶. The rest of the post explains just what a hugely big number that is -- it's a very, very big number.

The problem is that this number is also wrong.

It's not that the calculations are wrong. It's that the entropy rate is the wrong basis for the calculation.

Let's start with what the entropy rate is. Basically, given a sequence of characters, how easy is it to predict what the next character will be. Or, how much information (in bits) is given by the next character above and beyond the information you already had.

If the probability of a character being the i^th letter in the alphabet is p_i, the entropy of the next character is given by

– Σ p_i log₂p_i

If all characters (26 letter plus space) were equally likely, the entropy of the character would be log₂27, or about 4.75 bits. If some letters are more likely than others (as they are), it will be less. According to Shannon's original paper, the distribution of letter usage in English gives about 4.14 bits per character. (Note: Shannon's analysis excluded spaces.)

But, if you condition the probabilities on the preceding character, the entropy goes down. For example, if we know that the preceding character is a b, there are many letters that might follow, but the probability that the next character is a c or a z is less than it otherwise might have been, and the probability that the next character is a vowel goes up. If the preceding letter is a q, it is almost certain that the next character will be a u, and the entropy of that character will be low, close to zero, in fact.

When we go to three characters, the marginal entropy of the third character will go down further still. For example, t can be followed by a lot of letters, including another t. But, once you have two ts in a row, the next letter almost certainly won't be another t.

So, the more characters in the past you condition on, the more constrained the next character is. If I give you the sequence "The quick brown fox jumps over the lazy do_," it is possible that what follows is "cent at the Natural History Museum," but it is much more likely that the next letter is actually "g" (even without invoking the additional constraint that the phrase is a pangram). The idea is that, as you condition on longer and longer sequences, the marginal entropy of the next character asymptotically approaches some value, which has been estimated in various ways by various people at various times. Many of those estimates are in the ballpark of the 1.1 bits per character estimate that gives you 10⁴⁶ tweets.

So what's the problem?

The problem is that these entropy-rate measures are based on the relative frequencies of use and co-occurrence in some body of English-language text. The fact that some sequences of words occur more frequently than other, equally grammatical sequences of words, reduces the observed entropy rate. Thus, the entropy rate tells you something about the predictability of tweets drawn from natural English word sequences, but tells you less about the set of possible tweets.

That is, that 10⁴⁶ number is actually better understood as an estimate of the likelihood that two random tweets are identical, when both are drawn at random from 140-character sequences of natural English language. This will be the same as number of possible tweets only if all possible tweets are equally likely.

Recall that the character following a q has very low entropy, since it is very likely to be a u. However, a quick check of Wikipedia's "List of English words containing Q not followed by U" page reveals that the next character could also be space, a, d, e, f, h, i, r, s, or w. This gives you eleven different characters that could follow q. The entropy rate gives you something like the "effective number of characters that can follow q," which is very close to one.

When we want to answer a question like "How many unique English tweets are possible?" we want to be thinking about the analog of the eleven number, not the analog of the very-close-to-one number.

So, what's the answer then?

Well, one way to approach this would be to move up to the level of the word. The OED has something like 170,000 entries, not counting archaic forms. The average English word is 4.5 characters long (5.5 including the trailing space). Let's be conservative, and say that a word takes up seven characters. This gives us up to twenty words to work with. If we assume that any sequence of English words works, we would have 4 x 10¹⁰⁴ possible tweets.

The xkcd calculation, based on an English entropy rate of 1.1 bits per character predicts only 10⁴⁶ distinct tweets. 10⁴⁶ is a big number, but 10¹⁰⁴ is a much, much bigger number, bigger than 10⁴⁶ squared, in fact.

If we impose some sort of grammatical constraints, we might assume that not every word can follow every other word and still make sense. Now, one can argue that the constraint of "making sense" is a weak one in the specific context of Twitter (see, e.g., Horse ebooks), so this will be quite a conservative correction. Let's say the first word can be any of the 170,000, and each of the following zero to nineteen words is constrained to 20% of the total (34,000). This gives us 2 x 10⁹¹ possible tweets.

That's less than 10⁴⁶ squared, but just barely.

10⁹¹ is 100 billion time the estimated number of atoms in the observable universe.

By comparison, 10⁴⁶ is teeny tiny. 10⁴⁶ is only one ten-thousandth of the number of atoms in the Earth.

In fact, for random sequences of six (seven including spaces) letter words to total only to 10⁴⁶ tweets, we would have to restrict ourselves to a vocabulary of just 200 words.

So, while 10⁴⁶ is a big number, large even in comparison to the expected waiting time for a Cubs World Series win, it actually pales in comparison to the combinatorial potential of Twitter.

One final example. Consider the opening of Endymion by John Keats: "A thing of beauty is a joy for ever: / Its loveliness increases; it will never / Pass into nothingness;" 18 words, 103 characters. Preserving this sentence structure, imagine swapping out various words, Mad-Libs style, introducing alternative nouns for thing, beauty, loveliness, nothingness, alternative verbs for is, increases, will / pass prepositions for of, into, and alternative adverbs for for ever and never.

Given 10000 nouns, 100 prepositions, 10000 verbs, and 1000 adverbs, we can construct 10³⁸ different tweets without even altering the grammatical structure. Tweets like "A jar of butter eats a button quickly: / Its perspicacity eludes; it can easily / swim through Babylon;"

That's without using any adjectives. Add three adjective slots, with a panel of 1000 adjectives, and you get to 10⁴⁷ -- just riffing on Endymion.

So tweet on, my friends.

Tweet on.

C. E. Shannon (1951). Prediction and Entropy of Written English Bell System Technical Journal, 30, 50-64

Blages