Five short links

Streetbarchart
Photo by Broken Simulcra

Email Data Source – These guys had a cunning idea – listen in to commercial mailing lists by subscribing to them. They then analyze all of the data they gather to build a detailed picture of different industries and companies email marketing. It surprised me at first, but a lot of the companies I talk find their email lists are their most effective marketing channels despite their distinct lack of trendiness, so I'm pleased to see someone innovating around them.

Brien Lane, Melbourne – This Australian alley has been covered with charts representing real demographic information from the area. I love seeing visualization like this out in the real world, it makes me want to visit. Here's some more photos.

Clue is a renewable resource – This reminds me so much of my experiences at Apple. I spent over a year battling their legal department to honor an agreement we'd made when I joined, to allow me to just fix bugs in the same open-source project that had got me hired. A good friend spent a lot longer trying to get them to sign off on an Objective C mode he'd built for Emacs, and as far as I know still hasn't succeeded in releasing that simple config into the wild with the company's blessing. And Apple is actually one of the good guys when it comes to open-source, so I can only imagine what some other places must be like.

Chartbeat for the ChatRoulette site – I've been using Chartbeat on one of my own sites recently, but actually seeing it running on a site with serious numbers of visitors makes its power a lot clearer.

Official Seattle crime map – While it's nowhere near as slick as others like the San Francisco Crimespotting map, I'm impressed to see a city government produce one of these for themselves. Hopefully more official bodies will see the advantages of making data available in an easy-to-use form like this.

Five short links

Congoportrait

Portraits from the Congo at 50 – An astonishing collection of photos showing people living in the DR of Congo, together with short stories talking about their lives. Anyone who’s read In the Footsteps of Mr Kurtz will understand what a hell-on-earth the Congo has been for the last hundred years, but the tenacity of people determined to keep living their lives is amazing.

Conspiratorial Thinking – The best explanation I’ve seen of why otherwise-smart people can go spectacularly wrong when they only have a superficial understanding of a domain. The other side of any argument rarely consists of idiots and crazy people, so when I find myself asking “how could they be so dumb?”, it’s usually a sign I’m missing something important.

Mountain Lion Kittens in the Santa Monica Mountains – Liz was lucky enough to see the back end of a lion disappearing down a trail when we lived in LA. I never saw one myself, but always felt amazed to be living in a place so wild it still had them roaming free.

Data sets for data mining – A good list of high-quality sources of large data sets

Goin’ down that road feeling bad – At the start of this song Woody Guthrie talks about its creator, how “he wrote this song… or got it started”. The dominant model of the 20th century was the ‘auteur theory’, trying to find a single person to focus on as the sole driving force behind any project, but I felt the way Woody phrased it there captures a lot more of the reality of the creative process. Everything worthwhile I’ve been involved in has taken both a crazy person to start things rolling and a lot of people to join in and actually build it. I feel a post about “folk coding” coming on, it feels like the open source world has a lot in common with the way traditional music was passed around and improved.

Five short links

Fivehands
Photo by Search Engine People

Informed Consent in Information Technology – An awesome PhD thesis on the problems with those ridiculous license agreements we all click through without reading, and even better with some practical suggestions on how to fix those problems. Apparently Catherine's now looking for more funding to continue her work – am I allowed to dream that Apple or Microsoft might want to bring her on board to fix their EULAs?

TravellerMap – I was never quite cool enough to play the Traveller role-playing game back in the 80's, but they built a fascinating background universe. I stumbled across this site by accident, but the author has built a beautifully detailed interactive map for exploring the whole galaxy, and I'm in awe of this as a labor of love.

Analysis of the 'Flash Crash' – I've always been hooked on odd events, and May's sudden stock-market drop and recovery is one of the oddest I've come across. I don't have enough financial world chops to understand everything in this paper, but it's a detailed technical post-mortem of what actually happened.

Wikiposit – Another rich collection of public data sets, mostly financial, with the site code released under the GPL

Swarm Light – This art installation sends shivers down my spine every time I watch it, and it's a technical masterpiece too, using hundreds of CPUs to control the lights. Make sure you go to 1'30'' in the video, that's where it really starts to take off.

Don’t shave that yak – God loves lazy programmers

Yakshaving
Photo by Liminal Mike

I just wasted four days of my life on something completely worthless. It started off innocently enough, I wanted to take a cloud of data points and display them as a nice heatmap.

The Story

Hmmm, sounds just like the mesh creation I used to do when I was back in games, so let's dust off that Computational Geometry textbook and write a Delaunay Triangulator. Sweet, there's even an example in Javascript I can adapt to Actionscript. Awesome, it all works on my test data. Oh, it's O(n^2) in complexity, so it doesn't work so well on my larger test of 1,000 points, and it will take the lifetime of the universe to process the 30,000 points I need it to.

No worries, I'm a clever chap, I'll dig through the literature and find a better algorithm. They all seem to be either too complex to implement easily or don't have the performance characteristics I need. I don't need a strict Delaunay arrangement, so I should be able to brew up my own divide-and-conquer version that uses the exact algorithm on sub-sets of the points and then stitches them back together with Delaunay-like strips.

Huh, looks like there's a bug in the convex hull creation code I wrote. And another. And another. Arrrggh! I need a better way of visualizing what's going on, so I'll build a canvas-based web page that lets me view the output of my algorithm. And then I need to…

What? It's Saturday afternoon? Where did my week go! And why am I still banging my head against this code? What was I trying to do again? Oh yes, display a heatmap of these points. So why have I been debugging convex hull merging code for the last two days? There must be a simpler way.

<… two hours pass …>

There we are, I just adapted some point blob rendering code I was already using, so I don't have to worry about triangulating a massive cloud of points, I just throw blobs at the screen and build up the image. Works great. Now I just have to write a blog post to remind myself once again – *Never Shave a Yak!*

The Lesson

Yak shaving is a term I first ran across in the Jargon File, and it stuck in my head because it's so common and so dangerous in programming. It's when you're working on a task because you need to do it to get something else done, which you're doing to complete another job, and so on up a long dependency chain towards your real goal. This happens a lot when you're coding, and just like my story each step is very logical but you ultimately end up wasting massive amounts of time on something that has very little effect on what you really want to do.

I swear that the biggest reason I'm a more effective programmer now than when I was 20 is that I'm better at spotting when I'm shaving a yak, and finding another way. The biggest clue is when I'm working too hard. I went into a serious deep dive for the last few days, staying up late, skipping dog walks and getting way behind on my emails (sorry to anyone waiting for a reply!). If I'm making progress on something core this sort of crunch time is actually an energizing process as long as I don't keep it up too long. In this case I was feeling frustrated, and looking back it was largely because it was a peripheral task that at some level I knew didn't have to be solved.

There's already lots of other reasons to embrace lazy programming. Fewer lines of code means fewer bugs. The best route to an easy life is writing solid code that doesn't require constant maintenance, and is documented enough so people don't bug you with questions. Harness your inner laziness to spot yak shaving too, and find a simpler way when you're spinning your wheels on a peripheral task.

Five short links

Sinclairc5
Photo by Grant Mitchell

Delegate co-memberships – A network map showing which groups Republican and Democratic convention delegates belong to and how large ones like the NRA and the Sierra Club are connected to each other

A short note on random load balancing – Interesting algorithm that has a lot of the advantages of doing completely random assignments of tasks to buckets, but with a lot less variance in the work assigned to each bucket

Politicosphere – A network map of political news sites. I’m impressed by the presentation, it’s actually a pretty useful way of exploring the political blogosphere

Myths and Fallacies of “Personally Identifiable Information” – Another great post by Arvind, this time dissecting why “PII” is not a very helpful concept, since it encourages people in the mindset that a simple “anonymization” of obvious identifiers is enough to safeguard people’s identity

Tell them no, just never use that word – The hardest part of being an engineer is bridging the gap between users’ expectations and the limits imposed by technology. At the start of my career if I was asked for something impossible I’d say no and explain why. It took me a while to learn to shut up and let them explain what they wanted in more detail and think of creative ways of satisfying their underlying requirements

Five short links

Fivespot
Photo by Ken-ichi

Timetric – A large collection of data sets, complete with online tools to chart and analyze them via Pete Forde

Jsonduit – These JSON streams could be really powerful for building mashups, since they bypass the same-origin policy that makes combining data so hard via Pete Forde

USA Election Atlas – The presentation is a bit old-school, but you can find almost anything you'd want to know about past and present American election results

SimpleDB essentials – Hard-won wisdom on getting the most out of SimpleDB, from Sid Anand who's using it heavily at Netflix. My own experiments with it stalled because it was so hard to upload large amounts of data reliably. My code is available for anyone who wants to pick up where I left off

TrendsMap – An intriguing geographical prototype, showing Twitter trending topics on a map via Régis Gaidot

Some closure on my collision with Facebook

In response to my last post Bret Taylor, the CTO of Facebook, announced that they will be altering their robots.txt to whitelist particular crawlers rather than trying to enforce their new terms of service. This makes me very happy because it makes it much less likely that other companies will try to impose restrictions this way, leaving crawlers free to obey robots.txt without fear of litigation.

On a personal level I'm hoping that this helps me put this whole episode behind me too. I truly didn't set out to take on Facebook, I was just trying to build my product and rather naively stumbled into a minefield. The last few months of only dealing with their lawyers was very frustrating, and led me to conclude that there was a nefarious plan behind their attempt to impose a terms-of-service agreement beyond robots.txt. I believe Bret when he says it was just a lapse of judgment, and that admission should help me move on. I've always prided myself on being on the straight-and-narrow, and what rankled most was Facebook's legal team treating me like a shady spammer.

Thanks for everyone's support as I've been working through this, all you blog readers and commenters, and especially Liz for putting up with me pulling my hair out and banging my head on the desk as I tried to deal with it all!

Now to get back to building stuff, since I don't want to just be known as that guy who got sued by Facebook…

Facebook employee responds on robots.txt controversy

I received a comment from Blake Ross on the legal changes Facebook have recently made, and I wanted to highlight that response. I recommend you look at the document itself, along with both of our commentaries on it, and make up your own mind.

——————————-

Hey Pete,

I work for Facebook, but this comment should not be
construed as an official company statement.

Your interpretation
of this document isn't correct and frankly doesn't make much sense. If
our goal were to make it difficult for startups to succeed using
Facebook data, we wouldn't have launched an open API that provides
access to all of our data; we wouldn't have launched the fbFund to fund
startups that are built on top of this API; and we wouldn't host an
annual developer conference to help startups use this API. The very
future of our platform is predicated on the notion that we can help
other companies improve their products by leveraging the social graph.

This crawling document exists because we've had problem where shady
companies would try to scrape user information in aggregate and use it
for malicious purposes. For instance, these companies would scrape page
by page from http://www.facebook.com/family/
and then try to resell these bulk lists.

Blake Ross

——————————-

Hi Blake,
thanks for taking the time to comment. I might be willing
to give Facebook the benefit of the doubt on this, if it wasn't for the
$14,000 legal bill I just paid.

The fact is that you've made
all the information on the profile pages public, complete with
micro-format data to help crawlers. There's some simple technical fixes
you could add to solve the specific problem you mention, starting with
amending robots.txt and changing your ID formats:
http://petewarden.typepad.com/searchbrowser/2010/…

You've chosen to leave all that information out in the open so
you can benefit from the search traffic, and instead try to change the
established rules of the web so you can selectively sue anyone you
decide is a threat.

I'm really pretty bummed about this because
I've been a long-time fan of Facebook, you can see me raving about your
XHProf work here:
http://petewarden.typepad.com/searchbrowser/2009/…

The sad fact is, your leadership has decided to change the
open rules that have allowed the web to be such an interesting and
innovative place for the past decade.

Facebook has always been a closed system where developers are expected to live in a culture of asking permission before doing anything, and existing at the whim of the company's management. The web I love is an open world where you are free to innovate as long as you stick to the mutually agreed rules. This is a land grab by Facebook, they've moved into the open web for the commercial benefits they'll reap, but want to change the rules so they can retain absolute control.

Flying to Santa Rosa Island with the CIA

Santarosatail

Don't worry, Facebook haven't had me spirited away to Gitmo, but I did just take a trip with Channel Islands Aviation. They kindly donated a free flight to seven of us who'd previously volunteered with the NPS on the Channel Islands, and neither me nor Liz had been able to visit Santa Rosa Island before, so we were thrilled to have the chance to explore a whole new wilderness, even just for a day.

Santarosaplane

We'd looked into camping there before but there's very few scheduled boat trips to the outer islands, so we'd never been able to find enough time to take the four days off we'd need. I didn't know that the CIA planes had just started to offer regular excursions too. It was quite an adventure, I'd never been in an eight-seater before, and the landing on Santa Rosa's dirt strip seemed daunting, especially in the strong winds the island is known for. Our pilot, Mark Oberman, made it look easy though with a gentle touch-down.

Santarosacoast

Once we landed, we had the afternoon to explore the beautiful coast and hills, guided by Carolyn Greene, an interpretive volunteer for the park. It's amazing how much of the work of the NPS is taken on by volunteers, it makes me want to retire too so I can really get stuck into some of their projects!

Santa Rosa is the second largest island in the chain, at 85 square miles only a little smaller than Santa Cruz, and until very recently it was used for cattle ranching. There's still a population of deer and elk that were introduced for hunting, all of which have left the native vegetation pretty beleaguered. There's lots of signs of recovery though, and the rare Torrey Pines seem to be thriving, with numbers up from less than a thousand to over four thousand in just a few years.

Santarosapinecone

I'm always amazed how the Channel Islands draw people in, so many of those working there fell in love with their beauty and arranged their whole lives around them. Luluis, one of the rangers, actually grew up on Santa Rosa when it was a cattle ranch. Our pilot has been flying to them for 35 years, and was married in Santa Cruz Island's chapel, with the old owner Harry Stanton as a witness. Brent, another ranger, used to work the boats that ferried tourists out there before he got his current job. I guess you can add Liz and I to that list, since we've made trips all the way from Colorado twice in the last six months to visit the islands.

Santarosabeach

Finally we headed back to the plane and civilization, along the sort of beautiful beach that makes me wonder what California was like before all the development.

If you're interested in visiting Santa Rosa Island, you could make a day trip either by boat with Island Packers or flying with CIA, or head out for some camping. I don't imagine the camping would be easy though, the winds can be strong and the whole habitat is still recovering from the cattle, so you don't find much shelter. There's a developed campground with water, or if you're really hardy, you can sometimes backpack over to the south beach. I can't imagine anywhere more remote that's only a few dozen miles from Los Angeles, it would be an amazing experience despite the hardships.