Five short links

Highfivedog
Photo by Dave M Barb

quantiFind – A commercial company with a similar philosophy to the DSTK. Take in unstructured data, and use statistical approaches to extract something structured.

WorryDream – I love the infographics approach of this site, especially the ‘latitudes I have lived at’.

Goliath – Intriguing lightweight event-based web server, written in Ruby. Sinatra has been a pleasure to use, but despite its maturity, requiring a layer like Passenger to serve parallel web requests still doesn’t sit right with me. I don’t have a well-reasoned argument behind this, so please go easy, but it has felt like too many moving parts.

Graphite – Log arbitrary events to a central server, get instant graphs of them over time. It’s a simple concept, but one I could see being very powerful. One of my favorite profiling tools when I was a game programmer was altering the screen border color as each code module executed. For 90’s era single-threaded games synced to the refresh rate, you’d end up with a stacked colored column along the side, showing the proportion of the frame devoted to AI, rendering, etc. The simpler the profiling interface, the more likely people are to actually use it and learn the real characteristics of their systems.

Diffbot – A different take on the unstructured-to-structured approach to data. One API lets you watch a web page and get a stream of changes over time, either as a simple RSS feed or a more detailed XML format. Another does a boilerpipe-like extraction of an article’s text or a home page’s link structure.

Facts are untraceable

Serialnumber
Photo by Brian Hefele

As we share more and more data about our lives, there's a lot of discussion about what organizations should be allowed to do with this information. The longer I've spent in this world, the more I think that this might be a pointless debate. Controlling what happens with our data requires punishing people who are caught misusing it. The trouble is, how do you tell where they got it from?

If an organization has your name, friends and interests, here's just a few of the places that information could have come from:

– Your Facebook account, via the API or hacking.

– Your email inbox, through a browser extension or hacking, gathering a list of the people you mail and purchase confirmations.

– Your phone company, analyzing the calls you get and the URLs you navigate to on your smart phone.

– Your credit card company. They'd have trouble with the friends, though theoretically spotting split checks and simultaneous purchases should be a strong clue.

– Retailers sharing data with each other about their customers.

There are now so many ways of gathering facts about your life, that it's usually impossible to tell where a particular set of data came from. You can inject fake Mountweazel values into databases to catch unskilled abusers, but as soon as there's multiple independent sources for any given fact, you can avoid them by only including values that are present in more than one of them. If I know your postal address, can you prove that I hacked into the DMV, rather than just getting it from a phone book or one of your friends?

In practice this means that creepy marketing data gathered by underhand means can be easily laundered into openly-sold data sets, since nobody can prove it has murky origins. This has always been theoretically possible, but what has changed is that there's now so many copies of our personal data floating around, it's far easier to gather and harder to trace. From a technical point of view I don't see how we can stop it, as long as we continue to instrument our lives more and more.

I'm actually very excited by the new world of data we're moving into, but I'm worried that we're giving people false assurances about how much control they can keep over their information. On the other hand the offline marketing world has gathered detailed data on all of us for decades without raising much public outrage, so maybe we don't really care?

Should You Talk to Journalists?

Pressplate
Photo by Danger Ranger

I've been helping to arrange some interviews for a reporter, and one of the friends I approached asked "Is there any benefit to the interviewee?". This is actually a very perceptive question, most people jump at any chance to talk to a journalist, but there's real costs to that decision. Speaking as someone who has both written and been written about for money, I know a journalist's job is to persuade you to talk to them, whether or not that's actually in your interest. After I thought about it, I told him it really came down to what your goals are.

Good things that may happen

– Your work might be covered and publicized.

– He may approach you for quotes about related stories in the future.

– He might introduce you to other people in the area.

Bad potential side effects

– You lose valuable time you could spend actually building things.

– He could garble or misquote your points, leading to negative publicity.

– Other publications may decide not to publish stories if you're seen as giving an exclusive to a rival.

What may happen if you don't talk

– A competitor does provide the needed quotes, and gets the publicity.

– The journalist covers you in a negative way. This is very rare, but it's always there as a threat.

Most people radically over-estimate the dangers of being mis-quoted, but also have unrealistic expectations of the power of good publicity. A lot of it boils down to networking and exposure, and how much that benefits you depends on what you're trying to do. If you're focused on research or making technical progress, it's probably a distraction you should ignore. If the startup/fundraising side is higher on your priority list, being able to point to articles can really help in establishing the ever-desired perception of traction.

It's worth thinking about how you'll deal with interview requests before they come up. I've always loved talking to people about what I do, and my frustration at not being able to discuss my work was a big part of why I left Apple, so I've ended up working on projects where my tendency pays off. Your situation may well be different though. Unless you're clear-headed about your goals, you'll end up wasting your time. It's also worth pondering which publications reach an audience you actually care about. It might be that comparatively-obscure industry journals will let you talk to the decision makers in your market a lot more effectively than a mainstream outlet, which should affect which journalists you spend time on.