Finding Lost URLs

A week or so ago, a page by Professor Solomon called The Twelve Principles made the link rounds. The prof lays out a 12-step plan for finding any lost object. Most of the principles are mental tricks to get you back to the place you lost a physical object: your keys, your glasses, your cellphone, etc.

Unfortunately, the principles don't translate well to digital objects like URLs. You didn't stick that URL for the Xbox hacking How-To in your junk drawer, and it's not likely to be stuck in the "Eureka Zone" under your keyboard. But I lose URLs all the time. I remember something I saw on the web a couple weeks ago and I can't figure out how to get there again.

I don't have anything close to a 12-principle system for finding lost URLs, but I thought it'd be fun to examine my haphazard ways of re-finding web things. These are probably obvious, but I thought collecting them together would help me start a system for finding those lost pages, blog posts, and other digital artifacts that I'd like to see again.

1. Google - As you already know, Google is great at finding things, and I can usually get back to old URLs by remembering keywords for the document. Even if I don't find exactly what I was after, I can sometimes find good substitute information on the same subject. Unfortunately, a query like "SQL Remove Duplicates" will bring up thousands of documents, and if I'm looking for a specific bit of code I found once for removing duplicate records in a database the search has to go to the next stage.

2. Browse Browser History - Ctrl-H in the browser will bring up your surfing history and it can be a lifesaver if I know I visited the URL within the last week or two. It's especially helpful if I can remember the approximate time I was visiting the page I want to find, and I sort the history by date. But because browser histories only show the domain and page title, it's not very useful if I simply remember the subject of the page. I don't think of pages in terms of the domains they're hosted on, I think in terms of the page's content. (Searching your browser cache with something like Google Desktop might be better because you can search the full text of your browsing history, but I haven't started using this regularly.)

3. Revisit Web Haunts - Chances are good that I probably found the link I'm looking for at one of the sites I read regularly. Since I follow hundreds of sites with the news reader Bloglines, this can be a big search. Unfortunately the "Search My Subscriptions" feature at Bloglines isn't working for me, so generally I'll try to narrow down which site would have had the URL and then go back in time for each site individually using the "Display items within the last x" feature. Then Ctrl-F can help me find specific keywords within past posts. Google can also come in handy here. If I know I spotted a link about SQL on O'Reilly Radar, I can use the site: keyword like this: site:radar.oreilly.com SQL.

4. Search People - del.icio.us just rolled out a feature called your network that lets you track other del.icio.us members. There's no search yet, but you can browse back in time to see what people you know bookmarked at del.icio.us. I think this'll be handy, and I have gone back into specific people's del.icio.us archives looking for a URL. Having them all in one place is good for browsing, and saves time if I can't remember exactly who posted the link I'm looking for.

del.icio.us leads into my primary strategy for finding lost URLs: make links more findable before they're lost. Here's how I do it.

1. Use Web-based Bookmarks - I use del.icio.us (my bookmarks), but there are a bunch of web bookmark systems out there. When I come across a URL I know I'm going to want to get back to at some point, I'll click the del.icio.us bookmarklet and tag it. Searching my del.icio.us bookmarks is easy, but like your browser history, you're only searching titles, tags, and notes, not the full text of the site you bookmarked. Yahoo's My Web, and Google's Personalized Search both do better on the searching front—which leads to...

2. Turn on Search History - Privacy implications aside, I've found Google's Personalized Search handy for finding lost URLs even though I have mixed feelings about it. Once enabled, Google will remember every query you make and every search result you clicked on. You can then search just those sites that you clicked on in the past. Of course, that means everything you've searched for and every site you've clicked on is stored in a digital archive somewhere. I go back and forth, but privacy usually trumps findability for me so I might remove this option from my toolbox soon.

I should echo Professor Solomon's 13th principle: sometimes you can't find what you're after and you have to give up. The Web is ephemeral and pages come and go all the time. Even though it's maddening not to be able to get back to a document I know I've seen, that's life. What strategies am I missing?

Add Camera Images to Flickr

When I'm browsing photos on Flickr, I use the More Properties link quite a bit. That's the link that takes you to the Exif data associated with a photo if it's available. Embedded Exif data is how Flickr knows what type of camera took a particular photo, what the shutter speed was, aperture setting, and a bunch of other technical details about the state of the camera at the time the photo was taken. The more properties link is to the right of a photo on Flickr, and looks like this when it's there:

More properties link

The first thing I look at on the More Properties page is the camera model. But unless I know a particular camera model number already, it doesn't tell me much. "Ahh yes, the EX-Z750," I tell myself. Of course I have no idea what that model number means. So if I really want to know what type of camera the photographer used, I have to copy the model number, go to Amazon or Google, paste it in, and sort through the results. I knew there had to be a better way.

So I wrote a (relatively) quick Greasemonkey script that does the work of looking up the camera model for me. It even inserts a picture of that particular model on the Flickr "More properties" page. Here's what it looks like in action.

More properties page before:

More properties before

More properties page after:

More properties with camera image

And you can click the camera image to view more info about the camera at Amazon. Bonus for me: if you buy the camera through that link, I'll get a little kickback through Amazon's Associates Program.

Here's how it works. The script grabs the camera model from the Flickr page, contacts the Amazon API looking for that model in the Camera & Photo category, then grabs the image of the first result. Then the script inserts the image and a link to the product page into the page at Flickr.

It's not perfect. Sometimes Amazon doesn't carry that particular camera but has accessories that include a description with the model number. So you'll see a flash or remote shutter release instead of a camera. And sometimes the first result from Amazon isn't the correct model number—especially with older cameras. I'll keep tinkering with it to see if I can get more accurate results from Amazon.

If there's no match at all on Amazon, the script makes the model number a link to Google search results for that phrase.

The script just gives me a quick look at the type of camera that took the photo. I've been surprised to see cameras that look like video cameras taking nice still photos. Anyway, it was fun to put together and I learned a bit more about JavaScript.

If you already have Firefox with Greasemonkey installed, you can install this script for youself here: Flickr Camera Images

Thanks to the author of Monkey Match for a solid Amazon E4X parsing example, and of course Dive Into Greasemonkey. For more fun hacking around with with these applications check out Flickr Hacks and Amazon Hacks. (disclaimer: as you probably know I worked on both of these books.)

Add a batch of dates to Google Calendar

I've always used several calendars to plan out my life. Until recently, I used a paper desk calendar to track work-related events like project milestones. I used an insanely hacked-up version of PHP Calendar to track daily appointments and travel plans. And I used a paper calendar hanging in the kitchen to track family events like birthdays and anniversaries. And to be honest, with all of the calendars I still wasn't very organized. The distinction between types of events and the calendars weren't as clear-cut as I'm describing them, and I'd often have a work project milestone on my kitchen calendar, or a birthday in PHP Calendar, not in their "proper" locations.

What I like about Google Calendar is the ability to lay several calendars on top of each other. So I can keep the family birthdays separate from the project milestones, but I can still show them all on one calendar if I need to. And with a click, I can remove the dates that aren't relevant for what I'm working on at the moment. The calendar list looks like this:

calendar controls

I decided to make Google Calendar my One Calendar To Rule Them All, and the switch has been very easy. The Ajaxy interface makes adding events insanely intuitive—click a day to add an event on that day. And I love the ability to click and drag several days to add weeklong events like conferences. The other big advantage to going digital is the ability to share calendars with other people. I can't easily send all of the data on my paper calendars to friends and family without Xerox and Fedex involved.

The one issue I ran into during the conversion was with family events. I had over 50 birthdays and anniversaries I wanted to add to a calendar, and the thought of clicking Create Event and adding data for each one, or worse—hunting and pecking to find a particular day to click—wasn't appealing. So I thought I'd share my method for dumping a bunch of dates into Google Calendar. You just need a little time to get your dates together, some Perl, and a Google Calendar account.

Import/Export

The Google Calendar doesn't have an API (yet), but it does have a hacker's little friend called import/export. Google accepts two types of calendar formats for import: iCalendar and Outlook's Comma Separated Values (CSV) export. So if you already have calendar data in Outlook or iCal you can simply import/export at will. (Yahoo! Calendar also exports to the Outlook CSV format, so switching is fairly painless.) But I didn't know the first thing about either of these formats, I simply had a list of dates I wanted to dump.

Gathering Dates

I had a head start because I already had a list of family birthdays and anniversaries in a text file. I massaged the list a little to get it into a data-friendly format, and ended up with a file full of dates that looked like this:
4/18/1942,Uncle Bob's Birthday
4/28/1944,Aunt Sally's Birthday
7/23/1978,Lindsay and Tobias' Anniversary
8/10/1989,Cousin Maeby's Birthday
...
(obviously not real data.)

If you're building a list of dates from scratch you can use Excel. Just put dates in the first column in mm/dd/yyyy format, descriptions in the second. When you're done, save the file in CSV format, ignoring all the warnings about compatibility.

I called the file family_dates.csv. Yes, this is a comma-separated value list too, but not the format Google Calendar is expecting. Plus you don't want to add an event on April 18th, 1942. You want to add a full day event for April 18th, each year going forward. This is where I turned to Perl to massage the data.

The Code

This simple Perl script: calendar_csv.pl transformed the simple CSV list of dates and titles into the Outlook CSV format that Google likes to see. As you run the script it converts the year of the event into the current year, and adds an event for the next several years.

You'll need to customize the script a bit before you run it. Change $datefile to the name of your simple CSV file, in my case family_dates.csv. You can change $importfile to your preferred name of the output file, the default is import.csv. And you can set the number of years into the future that you'd like the date to appear by adjusting the value of $yearsahead, the default is 5. (If your events should only be added in the current year, set this to 1.)

Keep in mind that the larger the amount of data in your calendar, the longer it will take Google to load that calendar when you fire up Google Calendar. I originally set the $yearsahead value to 10, but with over 500 events, the calendar was noticably slowing the Google Calendar startup.

In addition to Perl, you'll need the standard Date::Calc module.

And if you're not in the US and would prefer dd/mm/yyyy format, simply change this bit: my ($month, $day) = to this: my ($day, $month) =. Instant internationalization!

Once everything is set, run the script from a command prompt, like this:

perl calendar_csv.pl

A new file called import.csv will magically appear with your dates formatted as Outlook CSV events. With the file in hand you can head over to Google Calendar.

Importing Data

Over at Google Calendar, click Manage Calendars under your calendar listing on the left site. Choose Create new calendar, and give your calendar a name and any other details. Click Create Calendar, and you'll see the new calendar in your list. Now click Settings in the upper right corner of the page, and choose the Import Calendar tab. Click Browse..., choose import.csv from your local files, set the calendar to your new calendar, and click Import.

That's all there is to it. You'll get a quick report about the number of events Google was able to import. Go back to your main view, and you should see your imported dates on the calendar, in the color of your newly created calendar. With one import, my view of April went from this:

calendar pre import

To this view with family birthdays the rust color:

calendar post import

(The details have been removed to protect the innocent.)

And once you have your calendar in Google, you can invite others to view and even help maintain the dates. Where I think this batch importing will be useful is for very large data sets. Imagine a teacher who wants to track the birthdays of students. It wouldn't be too hard to add the dates by hand. But a principal who wants to track the birthdays of everyone in a school will have an easier time putting together a spreadsheet than entering the days by hand. And even for my 50+ dates, writing a Perl script was preferable to entering the dates by hand.

So far I'm enjoying Google Calendar, and I haven't found any major problems beyond the limited importing ability. But now I really don't have an excuse for not sending out birthday cards.

Update (4/20): Google just released their Google Calendar API. I'll bet there are scores of hackers rushing to build bulk-import tools. Using the Calendar API would be a more stable way to import dates quickly. And wow! Hello, lifehackers!

Bloglines Update

Great news, Bloglines addressed the "onfocus/nofocus" problem and the Greasemonkey script I wrote isn't needed anymore. I got an email from Paul at Bloglines letting me know that, "Our anti-XSS code was being too aggresive and attempting to filter attribute values, in addition to attribute keys." Thanks, Paul! I'm very happy they took time out to address the problem because I think it's a great service and I didn't want to move to another reader. If you installed the Greasemonkey script, you can get rid of it. I deleted it from my server.

Flickr Hacks Code

There's a nice review of Flickr Hacks over at MyMac.com: Hack Your Way Into Flickr, and the reviewer mentioned that the code for all of the hacks wasn't available online. O'Reilly has remedied the situation, and you can grab all of the code from the book in one zip file: Flickr Hacks Code. Carpal tunnels everywhere are rejoicing. (And don't forget about the color figures gallery at Flickr—another way to view parts of the book.)

Bloglines Greasemonkey Script

In January I posted about a peculiar problem between this site and Bloglines: Bloglines filtering. Basically, Bloglines filters out the word "onfocus" from links to avoid cross-site scripting (XSS) attacks. The filter isn't smart enough to realize that "onfocus.com" is perfectly ok, and not a threat. This means that anytime someone links to my site, or I link to images on my site, the Bloglines filter changes the domain from onfocus.com to nofocus.com. When people click on a link to my site within Bloglines, they get a 404 error page at nofocus.com. (System administrators over at nofocus.com must wonder why they get some strange 404 errors showing up in their logs.)

Anyway, I've emailed Bloglines about the problem several times and now I'm getting silence. I don't blame them, this is an obscure issue that only affects one of the millions of sites that flow through their system. But it still bugs me, so I wrote a quick Greasemonkey script to solve the problem. If you use Bloglines and Firefox and Greasemonkey, I encourage you to install this script: fix-bloglines-onfocus.user.js. (Of course, if you're reading this from within Bloglines, you'll need to visit onfocus.com directly to get the script.) The script changes any instance of "nofocus.com" to "onfocus.com". This script is as blunt as Bloglines' XSS filter, but it's my attempt to fix the issue from this end.

Many thanks to Mark Pilgrim for his Greasemonkey Patterns—it's a great resource for building scripts.

Update: Bloglines fixed their XSS filter.

Music Personality Score

Since talking with Gabriel at MusicStrands the other day, I've been thinking more about how we share our musical tastes with others. I was making the point to him that there should be a way to quickly relate the type of music you're interested in without forcing people to wade through months of listening data like the current social music services require. For example, you can see that my top two artists at Last.fm are Bob Marley and Mozart based on frequency of plays, but that doesn't mean that my top two genres are Reggae and Classical. (I wouldn't place those as my top two if someone asked me.) You have to wade through the entire list to see that I also like classic rock, indie rock, electronic music, and lots of other genres.

What I was trying to say to Gabriel, but couldn't quite articulate, is that there should be a Myers-Briggs style scoring system for musical taste. When I see that someone is an ENFP, I have one instant measure of their personality. If you could do the same for music, you'd have a way to instantly relate your musical interests. I'm not sure what the criteria would be—maybe I'm an ISAE (indie structured ambient electronic), or MECR (mainstream eclectic classic rock). And this would go hand in hand with a service like MusicStrands because they can analyze the last 1,000 songs I actually listened to. With the score in hand, I could paste it into the dozen or so social network sites I belong to, giving people a more nuanced look at my preferences than my top 5 bands or something.

The iTunes Signature Maker is one stab at this concept. This application wades through your iTunes collection and creates a short audio signature based on the music it finds. When listening to others' signatures I guess you could listen for electronica vs. distorted guitars, but it doesn't really give you a sense of music preference. This is more of a fun hack than a useful way to share your musical identity. It'd be much more accurate to analyze what you're actually listening to, and then do a bit of categorization based on meta info about those tracks.

I [heart] NY

sk and I just got back from a week in New York—here are some snapshots. We ended up spending quite a bit of time just walking around New York City. Our first trek was through Central Park.

central park

Meg and Jason had a beautiful wedding, the reason for our trip.

meg and jason

On Sunday we took a bus to New Paltz, NY about an hour and half north of New York City to visit sk's aunt and uncle.

on the bus

We went on some great hikes in the area. Here's a picture sk snapped of me, apparently happy to be hiking.

pb hiking

The hiking highlight was a rock scramble up the side of a cliff, with great views from the top.

crag view

I'm hoping to make it back to New York City in the not too distant future. There's so much to do there and I feel like we barely scratched the surface.

no standing

You can see more photos from the trip (mostly from my cell phone) at Flickr, tagged with nyc and new paltz.

Musicstrands

Today I chatted with someone from Musicstrands and found out a bit about the company. They're based here in Corvallis, Oregon and employ somewhere around 30 people locally. It's fun to learn that a little piece of Web 2.0 is being built right here in my backyard. I use their competitor Last.fm (my profile), but I don't feel too bad because I've been sending my listening habits there since Audioscrobbler appeared several years ago. Sharing music seems so natural that I bet iTunes or YME will ship with more social features (like those MusicStrands provides) in the future.

If you want to see what Musicstrands is cooking up, check out MusicStrands Labs. They even have a tool for people like me that gives music recommendations for Last.fm users. (Thanks in part to the Audioscrobbler API, I assume.) Also fun: MusicStrands patents.

eJournal USA mentions onfocus

The US Department of State mentioned this site in their monthly eJournal, an issue called Media Emerging. It was in an articled about online photo journals, and you can see the article here: Online Albums. Click Enter Album to see all of the photoblogs mentioned. They also have an article about blogs: Bloggers Breaking Ground in Communication. It's great to be mentioned as a photoblogger even though I don't necessarily think of myself in that category anymore. But it's a good reminder that I should keep posting photos. They contacted me about the article a week or two ago and it was strange to see an email in my inbox with the subject, request from U.S. Dept of State.

Mechanical Turk

ETech has been over for a week, and one presentation is still nagging at me on a regular basis. Amazon has a Web Service called Mechanical Turk (named after this Mechanical Turk), and Felipe Cabrera from Amazon spent 15 minutes or so talking about MTurk during one of the ETech morning talks.

The talk focused on the idea that artificial intelligence hasn't materialized, and there are still some tasks that are easy for humans but impossible for computers. For example, a human can look at a picture of a chair and answer the question: Is this a picture of a chair or a table? A computer would have a tough time with that.

MTurk farms out these sorts of questions to real live humans and wraps their decisions (or HITs in MTurk parlance) into a Web Services API so they can be used in computer programs. Cabrera called this process of tapping humans to make decisions for machines Intelligence Augmentation (IA) as apposed to Artificial Intelligence (AI). The talk was good, and MTurk is definitely a clever hack, but the idea has been bothering me.

I can imagine a world where my computer can organize my time in front of the screen better than I can. In fact, I bet MTurk will eventually gather data about how many HITs someone can perform at peak accuracy in a 10 hour period. Once my HIT-level is known, the computer could divide all of my work into a series of decisions. Instead of lunging about from task to task, getting distracted by blogs, following paths that end up leading nowhere, the computer could have everything planned out for me. (It could even throw in a distraction or two if that actually increased my HIT performance.) If I could be more efficient and get more accomplished by turning decisions about how I work over to my computer, I'd be foolish not to.

I guess this idea of people being managed and controlled by machines is nothing new, and it was the bread and butter of science fiction books I read as a kid. But MTurk puts this dystopia in a new, immediate context. Machines are smarter than ever, and control of human decision-making could be highly organized.

MTurk is only a few months old, and there's nothing inherently wrong with it. But I can't stop projecting the ideas behind the system ahead a few years, and that's what's bothering me. I can't even fully articulate why it's bothering me. I don't have any conclusions, or even concrete hypotheticals of MTurk gone awry—so I'm just using my blog as therapy. Obviously my computer didn't ask me to write this.

slashdot topic feeds

Matt was looking over my shoulder while I was reading feeds at the airport yesterday, and he noticed that I have a feed for Google-related posts at Slashdot. I told him I was scraping it together because Slashdot doesn't offer topic feeds (and I don't want to see everything at Slashdot), and Matt thought I should share the rss-generating love with the world. I agreed, and here we are.

Here's the script I'm using to scrape Slashdot. It's in Perl, and you'll need a couple modules: LWP::Simple and XML::RSS::SimpleGen. Once installed, grab the code: slashfeed.pl.

You'll also need the numeric topic ID for any Slashdot topic you want to track. They're easy to find. Those big icons in any Slashdot post link to a topic page. Click on one of those, and look for a number in the URL. For example, the Slashdot Google Topic Page is here:

http://slashdot.org/search.pl?tid=217

Note the tid=217 in the URL. That's your Slashdot topic ID for posts about Google. You can browse the directory of all available Slashdot topics at the top of the Slashdot Search page.

To generate an RSS feed full of Slashdot Google goodness, run the script from a command prompt, passing in a topic ID like this:

% perl slashfeed.pl 217

The script will spit out a file called slashdot_217.xml that contains the latest Google-related posts, RSS style. Just make sure the script saves this file to a publicly addressable web folder (you might need to tweak the output file path on line 55). The final URL should look something like:

http://example.com/feeds/slashfeed_217.xml

Throw your new URL in your feed reader, and run the script on a regular basis with cron or Windows Task Scheduler. That's all there is to building a topic-specific Slashdot feed.

Scaping is notoriously brittle, so if Slashdot changes their HTML this script will break. If that happens, view source on the Slashdot topic page and rewrite the regular expressions on line 39 or so of the script. That's the only labor-intensive bit in this script.
« Older posts  /  Newer posts »