• Flash upload progress thingy for web apps. [via mathowie]
    filed under: design, development, flash, programming
  • a quick, straightforward explanation of data portability and why companies like Google should support it. [via battelle]
    filed under: amazon, google, internet, privacy
  • Flickr applies for a patent on "interestingness" as a way of determining which media objects are getting the most attention from users. [via kottke]
    filed under: flickr, future, law, tagging

The Readability of Blogs

You need 11.9 years of formal education to easily understand this site. Well, that's if you believe a readability test called the Gunning-Fog Index. The Gunning-Fog Index is basically an algorithm that analyzes text for sentence length, syllables per word, and word complexity. After crunching the numbers it comes up with a readability score that is supposed to predict how easily people will be able to digest the text. The Wikipedia article for the Gunning-Fog Index mentions that comic book text typically has a score around six, Reader's Digest typically scores around eight, Newsweek scores around ten, and so on. This puts onfocus.com on par with the readability of Time magazine.

The first time I ran into a demo of readability tests was at this page: Juicy Studio Readability Test. You can plug in a URL, and get back a Gunning-Fog Index score, and some other scores. I thought it was interesting and moved on. But for some reason it's been sticking in the back of my mind.

I'm bringing this up because I've been thinking quite a bit about the ways we measure blogs. And most of our measurement tools are fairly blunt. If you ask blog-measurement site Technorati what it "thinks" about your favorite blogs, you'll get machine answers like the number of inbound and outbound links. You'll get some info about traffic over time and Technorati's computed rank compared to other blogs. You'll see post-frequency and a list of common topics culled from RSS categories and Technorati tagging.

On the other hand, if I were to ask you some questions about your favorite blogs, you could probably tell me exactly why you like them. And it wouldn't have anything to do with inbound links or the other machine-based metrics. I'm guessing most of your answers would involve the writing style, tone, the topics the author covers, the fact that everyone else reads it, or maybe your personal relationship with the author.

You can't quantify something like tone, so you can't put computers to work analyzing tone. (I'd love to have a snark score for blogs.) But readability scores are a step toward a more human-style metric, and the scores can be crunched, analyzed, graphed, and averaged by computers. And I like the idea that the readability scores are laying there dormant within the sentences themselves, waiting to be tapped.

I'm not a linguist so I don't know how accurately these scores reflect readability. But I was interested enough in readability as a metric to do some digging around. A search on CPAN turned up the module Lingua::EN::Fathom which accepts arbitrary text and returns the Gunning-Fog Index score, along with several other scores including Flesch Reading Ease score, and the Flesch-Kincaid grade level. I thought it might be fun to plug in the top ten or so English language blogs as reported on Technorati popular to see if there's a "sweet spot" reading level among the most popular blogs. Of course many factors go into a blog's success, but I thought readability could be a reason some blogs hit the top of the tail and others don't. If nothing else, I figured I could find out if blog readers are more of a Reader's Digest sort of audience, or more of a Time magazine sort of audience.

So I cooked up a little Perl script that takes a list of RSS feeds, loops through the posts, strips out HTML, and calculates readability scores. If you want to run it yourself, you can grab the code here:

reading_levels.pl

In addition to the Lingua::EN::Fathom module, you'll need LWP::Simple for fetching feeds, XML::RSS::Parser for parsing them, and Math::Round::Var for rounding the scores. Add a list of feed URLs you want to analyze to the top of this file, and then run it on the command line, like this:

perl reading_levels.pl > reading_levels.txt

Once finished, the file reading_levels.txt will have a report with the individual reading levels for the sites, and an average for the group.

Caveats: this isn't a very robust feed parser, some feeds only have excerpts rather than full posts, and some feeds simply don't work with this script. I used the full feed posts if multiple feeds were available, and I skipped any sites that didn't parse.

So, what did I find? Well, here's the report for the top several English-language blogs as reported today by Technorati:

reading_levels.txt

(I skipped Post Secret because there's not much text to analyze.) The average Gunning-Fog Index score was off the "wide audience" charts at 14. That means the average person would need over 14 years of formal education to understand these blogs easily. The average Flesch Reading Ease score was 46.9, on a scale of 100. That's on par with state insurance form requirements. (seriously!) And the Flesch-Kincaid grade-level score was 11.8, meaning that it's appropriate for high school seniors, high on the scale. The most "ideal" site for a wide audience was Daily Kos, with a low Flesch Kincaid Grade level (9.05) and an above average Flesch Reading Ease score of 56.48.

So, what does this mean? I have no idea. My prediction that the most popular blogs would have very good readability scores didn't quite hold up. I can't pinpoint a "sweet spot", but maybe blog readers enjoy more densely layered text. (Think Time instead of Newsweek, but not quite Harvard Law Review.) I might take a look at sentence length and percentage of complex words next and see how those measure up.

I still think measuring readability has promise. Earlier today Anil was talking about TL;DR syndrome, and I think the popular blogs capitalize on this with short, frequent posts. But I also wonder if text density plays a roll. So in addition to saying, "too long; didn't read," I think there's the possibility of "too dense; didn't read". (insert joke here.)
  • Six Apart's social blogging platform Vox has been released to the public today. Its combination of public and private spaces based on friend networks makes it sorta like LiveJournal for the rest of us.
    filed under: weblogs, software
  • Using a bunch of Amazon metrics to track the popularity of game systems. [via AWS blog]
    filed under: visualization, amazon, hacks, webservices, games

Graphing for Mortals

A month or so ago I was at the Future of Web Apps conference listening to Cal Henderson talk about lessons he's learned from building Flickr. (You can snag audio of his talk and his powerpoint slides at the FOWA site.) One of his slides mentioned graphing the hell out of everything so you can get a visual sense of what's happening with your application. He mentioned Cacti as a great app for visualization. I took a look at it, but it looked so complex that I dismissed it as a tool for large-scale apps.

I manage a very small setup with a couple of servers. But I've never been able to get a good "snapshot" of what's happening. I've been relying on server logs (analyzed with analog) and Google Analytics. Analog does nice text reporting, but isn't strong on graphs. And Google Analytics is always a couple days behind. My ISP doesn't offer bandwidth usage reports, and I've had to take their word on usage. So I completely understand Cal's point about getting a handle on what's happening on your servers.

When I got back from SF, I decided to bite the bullet and learn how to graph this stuff. And after a month or so, I finally have some nice graphs giving me a better sense of what's happening on my servers right now. I thought I'd share my personal crash course in graphing stuff, in case anyone else out there manages their own servers and doesn't have a web ops department.

Step 1: Learn RRDtool. heh, well, at least get to the point where you understand what round-robin databases are and how to create graphs from that data. I knew I'd read about RRDtool somewhere before, and sure enough Hack #62 in Spidering Hacks is called Graphing Data with RRDtool. It showed how to graph the Amazon sales rank of a book over time. I followed the example, and then tweaked it a bit to track books I'm interested in. I came up with this:

Amazon Salesrank of pb Books graph
Lower is better on this graph.

So on a daily basis I can see how the books I've helped put together are doing on Amazon, and then step back and get the view over several weeks. (Not always a good thing.) I also put together graphs of individual books, plotted a couple books together, and generally learned how to control RRDtool graphs a little. Knowing a bit about RRDtool helps a ton once you get to step 3.

(And speaking of books, I really wanted a book about graphing with RRDtool at this point. It seems that whenever I want to tackle a technology that is new to me, I want to run away from my computer and sit down somewhere with a book—coming back to the computer armed with more info. There aren't any books specifically about RRDtool, and I think a great PDF that explains some of the high-level concepts would be an improvement over the current documentation that focuses on walk-throughs.)

Step 2: Learn SNMP. Again, impossible to tackle in a few days, but you can get the gist of it fairly quickly. SNMP is a protocol for monitoring network equipment. I grabbed Net-SNMP and followed the tutorials for configuring it. I especially found the tool snmpwalk helpful for making sure everything was up and running properly.

I should probably pick up Essential SNMP. I looked at it in a bookstore, and quickly scanned the chapter about RRDtool and Cricket. I realized that I should be using a front-end for creating graphs instead of hand-coding monster RRDtool command lines. Which led me to...

Step 3: Install Cacti. Cacti is a PHP/MySQL application for generating graphs with RRDtool and SNMP. You can use some built-in templates for tracking network usage, or create your own data sources with some simple scripts. Once you see how tedious it is to create your own RRDtool graphs, you'll appreciate how quickly you can build graphs with Cacti.

I'm running Cacti on Windows, and it took a while to get everything configured properly. Here are some essential tips that I picked up from the forums if you're in the same boat:
  • Enable the SNMP and sockets extensions in php.ini.
  • Disable strict mode in MySQL's my.ini.
  • Use Cacti-approved builds of RRDtool.
  • Stroll through the Cacti db to get a sense of what's happening.
  • The Cacti log file is useful. Go there first if you're having a problem.
The real power of Cacti is setting up your own data sources. I followed the detailed walk-through available in the manual—Simplest Method of Going from Script to Graph—and that has been the best way to get to know the application for me. Since then, I've created a few custom data sources that are tracking stuff at ORblogs.

There's also a bunch of data sources and graph templates that the Cacti users share in the forums. Take a look at this post: Complete List of Cacti Scripts and Templates to get a sense of what's available. I plugged in a WMI SQL Server monitor, and it just worked without much fiddling around.

I also recommend running the app in debug mode. As you create graphs, take a look at the RRDtool commands that generate the graphs. You'll start to get a better feel for RRDtool simply through osmosis. And if you took the time to learn about RRDtool in Step 1, seeing the raw commands helps you diagnose problems with your graphs.

So what I've ended up with after all of this work is a page filled with pretty graphs like this, giving me a look at what's happening on my servers and my sites in real time:

Cacti screenshot
These graphs have been scrubbed a bit.

I'm just beginning my data visualization journey, but I can already tell this going to help me make decisions going forward. (Thanks, Cal!) There's something about seeing information in a graph that makes it more concrete than numbers flowing by in a log. This step into the arcane world of network graphing already has me thinking about the real world differently. I'm walking around looking at things thinking, "I could graph that!"
  • a hyperlocal blog/news aggregator headed by Steven Johnson [via kottke], with some thoughts about the project from SBJ: Introducing outside.in.
    filed under: community, weblogs, startup, geo
As you know I'm a huge fan of hyperlocal info, and I'll be watching this closely. They have some interesting ideas (tying geography to posts rather than blogs), and I wonder how that will work practically. And how will this scale without an army of editors? How is a blog chosen for inclusion? How do they plan to deal with spam? How do you weed out posts that have nothing to do with the locality (is it solely keyword/category sifting)? Do they plan to ask weblog authors to include metadata with each post? I have dozens of questions for them about hurdles I've run into running a local community aggregator.
« Older posts  /  Newer posts »