language

The Readability of Blogs

You need 11.9 years of formal education to easily understand this site. Well, that's if you believe a readability test called the Gunning-Fog Index. The Gunning-Fog Index is basically an algorithm that analyzes text for sentence length, syllables per word, and word complexity. After crunching the numbers it comes up with a readability score that is supposed to predict how easily people will be able to digest the text. The Wikipedia article for the Gunning-Fog Index mentions that comic book text typically has a score around six, Reader's Digest typically scores around eight, Newsweek scores around ten, and so on. This puts onfocus.com on par with the readability of Time magazine.

The first time I ran into a demo of readability tests was at this page: Juicy Studio Readability Test. You can plug in a URL, and get back a Gunning-Fog Index score, and some other scores. I thought it was interesting and moved on. But for some reason it's been sticking in the back of my mind.

I'm bringing this up because I've been thinking quite a bit about the ways we measure blogs. And most of our measurement tools are fairly blunt. If you ask blog-measurement site Technorati what it "thinks" about your favorite blogs, you'll get machine answers like the number of inbound and outbound links. You'll get some info about traffic over time and Technorati's computed rank compared to other blogs. You'll see post-frequency and a list of common topics culled from RSS categories and Technorati tagging.

On the other hand, if I were to ask you some questions about your favorite blogs, you could probably tell me exactly why you like them. And it wouldn't have anything to do with inbound links or the other machine-based metrics. I'm guessing most of your answers would involve the writing style, tone, the topics the author covers, the fact that everyone else reads it, or maybe your personal relationship with the author.

You can't quantify something like tone, so you can't put computers to work analyzing tone. (I'd love to have a snark score for blogs.) But readability scores are a step toward a more human-style metric, and the scores can be crunched, analyzed, graphed, and averaged by computers. And I like the idea that the readability scores are laying there dormant within the sentences themselves, waiting to be tapped.

I'm not a linguist so I don't know how accurately these scores reflect readability. But I was interested enough in readability as a metric to do some digging around. A search on CPAN turned up the module Lingua::EN::Fathom which accepts arbitrary text and returns the Gunning-Fog Index score, along with several other scores including Flesch Reading Ease score, and the Flesch-Kincaid grade level. I thought it might be fun to plug in the top ten or so English language blogs as reported on Technorati popular to see if there's a "sweet spot" reading level among the most popular blogs. Of course many factors go into a blog's success, but I thought readability could be a reason some blogs hit the top of the tail and others don't. If nothing else, I figured I could find out if blog readers are more of a Reader's Digest sort of audience, or more of a Time magazine sort of audience.

So I cooked up a little Perl script that takes a list of RSS feeds, loops through the posts, strips out HTML, and calculates readability scores. If you want to run it yourself, you can grab the code here:

reading_levels.pl

In addition to the Lingua::EN::Fathom module, you'll need LWP::Simple for fetching feeds, XML::RSS::Parser for parsing them, and Math::Round::Var for rounding the scores. Add a list of feed URLs you want to analyze to the top of this file, and then run it on the command line, like this:

perl reading_levels.pl > reading_levels.txt

Once finished, the file reading_levels.txt will have a report with the individual reading levels for the sites, and an average for the group.

Caveats: this isn't a very robust feed parser, some feeds only have excerpts rather than full posts, and some feeds simply don't work with this script. I used the full feed posts if multiple feeds were available, and I skipped any sites that didn't parse.

So, what did I find? Well, here's the report for the top several English-language blogs as reported today by Technorati:

reading_levels.txt

(I skipped Post Secret because there's not much text to analyze.) The average Gunning-Fog Index score was off the "wide audience" charts at 14. That means the average person would need over 14 years of formal education to understand these blogs easily. The average Flesch Reading Ease score was 46.9, on a scale of 100. That's on par with state insurance form requirements. (seriously!) And the Flesch-Kincaid grade-level score was 11.8, meaning that it's appropriate for high school seniors, high on the scale. The most "ideal" site for a wide audience was Daily Kos, with a low Flesch Kincaid Grade level (9.05) and an above average Flesch Reading Ease score of 56.48.

So, what does this mean? I have no idea. My prediction that the most popular blogs would have very good readability scores didn't quite hold up. I can't pinpoint a "sweet spot", but maybe blog readers enjoy more densely layered text. (Think Time instead of Newsweek, but not quite Harvard Law Review.) I might take a look at sentence length and percentage of complex words next and see how those measure up.

I still think measuring readability has promise. Earlier today Anil was talking about TL;DR syndrome, and I think the popular blogs capitalize on this with short, frequent posts. But I also wonder if text density plays a roll. So in addition to saying, "too long; didn't read," I think there's the possibility of "too dense; didn't read". (insert joke here.)
  • John Battelle has a great idea about storing data in info-privacy friendly countries. But I'd go a step further and say that big data stores should also store data in an encrypted format, so only someone with a key can make the data useful.
    filed under: privacy, law
  • put in some text, and see if this script can guess the author's gender based on word usage. (I was looking for a Perl module that does this, but no luck.)
    filed under: language, writing, psychology, gender

WOTD

Word of the day:
procrustean
adj.

Producing or designed to produce strict conformity by ruthless or arbitrary means.
[via]

WOTD

Word of the day:
echolalia
n.

Psychiatry. The immediate and involuntary repetition of words or phrases just spoken by others, often a symptom of autism or some types of schizophrenia.
[via]

common sense isn't

I think the phrase common sense should be phased out. I've been hearing it more and more, and I don't think it means anything. Everyone has their own individual sense of what common sense about any particular topic is. Someone can make an outrageous claim and call it common sense to give it legitimacy. Or someone can say they take a common sense approach to something without giving details about their position. Try searching for "common sense" across hot news topics and you'll find hundreds of results: Maybe we could graph the "common sense" index of various stories to see where the phrase is being abused. When someone uses the phrase, I think of it as a red flag code word meaning: more investigation required.

Or as Stephen Hawking put it when I heard him speak years ago: "Common sense is just another name for the prejudices we've been taught all our lives."
Newer posts »