The American Way of Dating

Dating
Photo by Brandon Warren

With the (mostly) shared language, it's easy to for people from the UK to think that America is basically like Britain, apart from the funny accents. I had a little of that attitude when I moved here, but rapidly learned how wrong I was. With Valentine's coming up, I was reminded of one of the best examples of the alienness lurking under the surface; dating. As Kira Cochrane amusingly chronicled in The Guardian, the British standard is "go to a party, down some drinks, make eye contact with a person you fancy, proceed to kissing and often much more, wake up the next morning to find that you have magically become one half of a couple". It seems like the goal was to avoid any unambiguous declarations of interest, so that at any point either person can end the process without the other losing face.

This isn't how it usually works in the US, at least in the mainstream. The formality and rituals surrounding courtship feel like something out of a Noh play. The very idea of actually asking a near-stranger for a date, explicitly and with no particular preamble, in the full knowledge that you may be turned down, seems nothing short of revolutionary compared to the system I grew up with.

Kira ended up avoiding the rules when she was over here, but even she acknowledges there's a need they're filling. Maybe it's because American culture is so varied that the system has to be so explicit about intentions, since people growing up with radically different backgrounds will never be able to communicate using the subtle signs that the British rely on. There's also something refreshingly honest about the whole procedure. A friend was telling me about her travels in Ireland, and being romanced by a hopeful local man. She discovered he was married, with kids, so she asked if it was an open relationship? "Don't be disgusting, woman!" was the reply.

Eighteen Short Links

Eighteen
Photo by Laura Thorne

With my book launch, BigDataCamp and Strata, I’ve accumulated a backlog, so here’s five short links, plus 13!

Gluecon – Eric Norlin knows how to put on a great conference on emerging topics, and the world of integrating different web services, APIs and data sources is one that’s close to my heart. I’m looking forward to seeing the tribe that he gathers in Colorado, and if you’re part of it, you should think about taking up this opportunity to demo your application.

Big Data with Ken Krugler – Ken’s off-the-cuff talk on the pre-electronic US Census was one of the highlights of BigDataCamp for me. This covers a lot of the same ground, but in much more depth. O’Reilly folks, you need to pull this guy on board somehow!

Mapfluence Data Catalog – A well-chosen set of geo and demographic data sets from UrbanMapping. It’s all commercial, which I have no objection to at all, but the lack of obvious pricing means you’ll have to invest time in negotiation with them to decide whether it’s for you. An unlabeled graph doesn’t count.

pipe2py – An intriguing open-source project that takes data flows built in Yahoo Pipes, and converts them into pure Python code. There’s also a quick tutorial available describing how to run the results on Google’s AppEngine.

PeopleSearch – A simple but effective hack, using Google’s custom search APIs to find people’s profiles on major services.

$3m Heritage Health Prize – A fantastic idea, using a Netflix-style data competition at Kaggle to research better ways to predict healthcare needs. There’s some questions around how to best preserve anonymity, but this is such an important goal that it’s worth accepting some small risks on the privacy front.

The O’Reilly Stylesheet – I love reading through stylesheets from different publishers. There’s been a few rules in here I’ve struggled to follow, like referring to a company as ‘it’ rather than ‘they’.

GroundCrew – A simple but effective service for organizing volunteers using cell phones.

Walkshed – There’s a lot of promise in visualizing attributes like walkability and accessibility across cities. A lot of these attributes are really hard to understand unless you devote serious time to exploring the neighborhoods, which made it tough to chose a location when I had to move to San Francisco as an outside.

Map of Scientific Collaboration – A beautiful view of the citation networks in research papers, presented geographically. The next step is to make these interactive and explorable.

Chequered Airwaves – How the high-brow Czech language radio stations ceded the battle for minds to the less scrupulous German broadcasters in the run-up to the Second World War. This struck me as relevant when we consider the right approach to ignorant populist diatribes, in the debate I keep having with myself about how sensational to go.

Ruby Geocoder – The most recent version of the original Perl Tiger/Line US geocoder, rewritten in Ruby and able to ingest the latest shapefiles.

Hacking Lottery Scratchcards – There’s a whole world of statistical data hacking out there, revealing information that publishers never believed they could possibly be exposing.

Small Business Innovation Research Grants – There’s a massive world of US government money available to startups. The main drawbacks are the almost overwhelming barriers to getting through the initial paperwork, the pernicious influence of managing to please federal managers instead of real customers, and in this case becoming part of the military-industrial complex.

Where the Ladies At? App – I may not like it, but this is probably the future of location-based services. After all, Facebook basically started as a way to stalk fellow students at Harvard.

How the O’Reilly Animals are Chosen – I still have no idea how I got a bull for my cover, but given my childhood in a farming village I can’t complain.

Strata Interview – I talk about the Data Source Handbook on camera. I wasn’t happy with this one, I should have talked about all the cool maps people are building with OpenHeatMap instead of going off into an abstract ramble.

Europe vs the US on Privacy – There’s a strong tradition in Europe of assigning a higher value than the US to privacy relative to freedom of expression and innovation. There’s going to be an increasing clash over this as more and more data sources merge and reveal increasing amounts of personal-but-public information.

Discover public data with the Data Source Handbook

I’m pleased to announce that the Data Source Handbook is now available from O’Reilly. It’s a compact ebook guide to the most useful APIs and bulk data sets I’ve found, packed with examples and advice. These are hand-picked services that I’ve actually spent time using during my own work, and I chose them because they add insights and information to data you’re already likely to be dealing with. You can check out the table of contents below, and I’ve also included a couple of excerpts.

It’s organized by the kind of data that you want to look up information on, from websites to locations, email addresses to ISBNs. There’s a whole new world of free or cheap public data out there, I’ve been having a blast exploring it myself, so I hope you’ll enjoy it as much as I have. A big thanks to everyone who helped me compile this too, from my editors Mike Loukides and Teresa Elsey to all the helpful people on Quora, along with the many friends who emailed me ideas. Keep the suggestions coming, I’ll be working on an updated edition soon.

Websites:

  •   WHOIS
  •   Blekko
  •   bit.ly
  •   Compete
  •   Delicious
  •   BackType
  •   PagePeeker

People by email:

  •   WebFinger
  •   Flickr
  •   Gravatar
  •   Amazon
  •   AIM
  •   Friendfeed
  •   Google Social Graph
  •   MySpace
  •   Github
  •   Rapleaf
  •   Jigsaw

 People by name:

  •   WhitePages
  •   LinkedIn
  •   GenderFromName

 People by account:

  •   Klout
  •   Qwerly
  •   Search terms
  •   BOSS
  •   Blekko
  •   Bing
  •   Google Custom Search
  •   Wikipedia
  •   Google Suggest
  •   Wolfram Alpha

Locations:

  •   SimpleGeo
  •   Yahoo
  •   Google Geocoding API
  •   CityGrid
  •   Geo-Coder-US
  •   Geodict
  •   GeoNames
  •   US Census
  •   Zillow Neighborhoods
  •   Natural Earth
  •   US National Weather Service
  •   OpenStreetMap
  •   MaxMind

 Companies:

  •   CrunchBase
  •   ZoomInfo
  •   Hoovers
  •   Yahoo Finance
  •   IP Addresses
  •   MaxMind
  •   Infochimps

Books, films, music and products:

  •   Amazon
  •   Google Shopping
  •   Google Book Search
  •   Netflix
  •   Yahoo music
  •   Musicbrainz
  •   The Movie DB
  •   Freebase

WHOIS

The whois unix command is still a workhorse, and I’ve found this web service a decent alternative too. You can get the basic registration information for any website. In recent years, some owners have chosen ‘private’ registration which hides their details from view, but in many cases you’ll see a name, address, email and phone number for the person who registered the site. You can also enter numerical IP addresses here and get data on the organization or individual that owns that server.

Unfortunately the terms-of-service of most providers forbid automated gathering and processing of this information, but you can craft links to the Domain Tools site to make it easy for your users to access the information.

<a href="http://whois.domaintools.com/www.google.com">Info for www.google.com</a>

There is a commercial API available through whoisxmlapi.com that offers a JSON interface and bulk downloads, which seems to contradict the terms mentioned in most WHOIS results. It costs $15 per thousand queries. Be careful though, it requires you to send your password as an non-secure URL parameter, so don’t use a valuable one.

curl "http://www.whoisxmlapi.com/whoisserver/WhoisService?\
domainName=oreilly.com&outputFormat=json&userName=<username>&password=<password>"
{"WhoisRecord": {
"createdDate": "26-May-97",
"updatedDate": "26-May-10",
"expiresDate": "25-May-11",
"registrant": {
"city": "Sebastopol",
"state": "California",
"postalCode": "95472",
"country": "United States",
"rawText": "O'Reilly Media, Inc.\u000a1005 Gravenstein Highway North
\u000aSebastopol, California 95472\u000aUnited States\u000a",
"unparsable": "O'Reilly Media, Inc.\u000a1005 Gravenstein Highway North"
},
"administrativeContact": {
"city": "Sebastopol",
...

Blekko

The newest search engine in town, one of Blekko’s selling points is the richness of the data it offers. If you type in a domain name followed by /seo you’ll receive a page of statistics on that URL

blekko0.png

They are also very keen on developers accessing their data, so they offer an easy-to-use API through the /json slash tag, which returns a JSON object instead of HTML.

http://blekko.com/?q=cure+for+headaches+/json+/ps=100&auth=<APIKEY>&ft=&p=1

To obtain an API key, email apiauth@blekko.com. Their terms of service are available at https://blekko.com/ws/+/terms, and while they’re somewhat restrictive, they are flexible in practice:

You should note that it prohibits practically all interesting uses of the blekko API. We are not currently issuing formal written authorization to do things prohibited in the agreement, but, if you are well behaved (e.g. not flooding us with queries), and we know your email address (from when you applied for an API auth key, see above), we will have the ability to attempt to contact you and discuss your usage patterns if needed.

Currently, the /seo results aren’t available through the JSON interface, so you have to scrape the HTML to obtain it. There’s a demonstration of that athttps://github.com/petewarden/pagerankgraph.

 

A fundamental bug in HTML5’s Canvas?

Stitchingscreenshot

I still get requests for an HTML5 implementation of OpenHeatMap, so I guess I've done a terrible job of telling people about the Canvas-based renderer I've had in there since it launched. The confusion comes about because I default to Flash if your browser has it installed, since it's usually faster and there's still one problem with the Canvas implementation that I haven't been able to fix.

If you look at the screenshot above, you'll see pale white lines within the states. Those are boundaries between the internal polygons that they're made of, and in the Flash version they don't show up. The fundamental problem is that if you render two polygons that share an edge, Canvas will show a visible join along that edge, whereas Flash will seamlessly meld the two together, with no difference visible if they're the same color. I've put together a minimal page here to show the issue:

http://web.mailana.com/labs/stitchingbug/

The source, along with a Flash project doing the same thing and producing the expected results, is here:

http://github.com/petewarden/stitchingbug

The fundamental issue is that it's impossible to do any complex polygonal rendering if you can't stitch polygons together without seams. I don't know exactly what Flash's fill rules are, but they produce the correct results, as do the 3D renderers I've used in the past. It's cross-browser, which makes it seem deliberate, so any references to the rules used would also be appreciated. Here's the Canvas code:

    var ctx = canvas.getContext('2d');
    ctx.fillStyle = 'rgb(0,0,0)';

    ctx.beginPath()
    ctx.moveTo(0, 0);
    ctx.lineTo(50.5, 0);
    ctx.lineTo(50.5, 100);
    ctx.lineTo(0, 100);
    ctx.closePath();
    ctx.fill();

    ctx.beginPath()
    ctx.moveTo(50.5, 0);
    ctx.lineTo(100, 0);
    ctx.lineTo(100, 100);
    ctx.lineTo(50.5, 100);
    ctx.closePath();
    ctx.fill();

Anybody have any insights on this? I'd love to deprecate the Flash version, but I need to understand what's going on here, and I'm at a dead end. I'd love to hear there's something obvious I'm doing wrong.

Update – Thanks for the suggestions. I purposely simplified the example to avoid alpha, but that's the issue that most of the 'stroke()' approaches I'd already tried hit. I've included code to demo that below. I'm really not trying to be a jerk about this, honestly I'd love to know that I'm an idiot, as long as I find a solution. The public humiliation will be worth it, I swear.

        ctx.fillStyle = 'rgba(0,0,0,0.5)';

        ctx.strokeStyle = 'rgba(0,0,0,0.5)'

        ctx.beginPath()

        ctx.moveTo(0, 0);

        ctx.lineTo(50.5, 0);

        ctx.lineTo(50.5, 100);

        ctx.lineTo(0, 100);

        ctx.closePath();

        ctx.fill();

        ctx.stroke();

        ctx.beginPath()

        ctx.moveTo(50.5, 0);

        ctx.lineTo(100, 0);

        ctx.lineTo(100, 100);

        ctx.lineTo(50.5, 100);

        ctx.closePath();

        ctx.fill();

        ctx.stroke();

 

Trunk.ly and Egypt

Egyptprotests
Photo by Al Jazeera

We have to be careful not to project our own obsessions onto the rest of the world. As one Egyptian commentator said; "it's not about you". Technology has clearly played a role in the protests, but I'm betting that cell phones and television are a lot more influential than social networks.

Still, it's been amazing to watch how Twitter has been used to spread the word, especially to the outside world. The only trouble is that there's so many links and comments, it's almost impossible to follow any one topic amongst the volume of messages. That's where the Trunk.ly curation service comes in.

As I was chatting to the founders Tim and Alex yesterday, they pointed me at one of their users, ExiledSurfer. Since the protests began he's been collecting a massive number of links and reports from inside the country and relaying them through his Twitter stream. When people get in touch asking for information, he points them at his trunk.ly stream that give his links "with no noise". With ability to search and categorize by tags, and a simple view showing just the links and snippets of information about them, it's a much better way of documenting the material than just Twitter.

With over 5,000 links in his archive, he's turned his messages from an ephemeral stream into something more like a library of reference material, all without having to do anything more than use Twitter the way he always has. This is a case where technology has enabled a single person to become vastly more effective as a news broker than we could have imagined just a few years ago.

I've been watching the progress of Trunk.ly since before it was a glimmer in Tim's eye, thanks to his wonderful weekly blog chronicling their startup progress as it happens. With a site that's headed into the top 20,000 worldwide according to Alexa, and over seven million links collected, it feels like there's a lot of people who like their curation model. I'm looking forward to seeing how it helps us all organize our knowledge, and maybe play a small part in spreading the word in situations like Egypt.

OpenHeatMap now supports Hong Kong and Canadian constituencies

Can_constituency

Thanks to some very helpful folks at the University of Hong Kong, I've been able to add support for parliamentary constituencies in both Hong Kong and Canada. Like all OpenHeatMap areas, all you need to do is list the names of the constituencies and the values you want to display in a CSV file, Excel spreadsheet, or a Google document. Upload it, choose your display options, and you can create your own interactive political maps.

For more information and example files to download, go to the documentation sections on Hong Kong or Canada. I look forward to seeing what you build with these, so let me know how you get on.

Hk_constituency

Five short links

Vbirds
Photo by Sam Teigen

Bixo Labs – A company building bespoke web data mining solutions, founded by Ken Krugle of the late, lamented Krugle code search engine. Interesting because it shows how web crawling is now moving out of the labs and big corporations, into the reach of ordinary companies. There’s a nice explanation of their crawling architecture too.

WireIt – Build your own Yahoo Pipes/Quartz Composer interface with this free Javascript library. I’m having a blast with it myself, thanks Russ for the tip.

St Louis Fed – A feast of free economics data, I can’t believe I didn’t know about this before. via Paul Kedrosky.

Deri Pipes – Talking of Yahoo Pipes, here’s an open-source clone. I see errors pop up when I try to use it unfortunately, but with Yahoo’s recent troubles I’m looking out for alternatives to their services.

Datamarket releases 13,000 data sets – Another lovely set of data appears on the web. I haven’t seen anyone make a pure data marketplace work as a business model, but I wish them luck and appreciate more sources of data. Via Paul Kedrosky.

Five short links

Letterv
Photo by Chris in Plymouth

Extraordinary Claims – An in-depth look at the methodology behind both Daryl Bem’s research claiming evidence of precognition, and the critical responses to it. I’m deeply sceptical that his claims are correct myself, but Peter clearly lays out how the critics are trying to change the rules to dismiss them, rather than having a fair fight. As the Climategate coverage shows, science isn’t just about getting the right answer. Like justice it has to be seen to be done, having a transparent and even-handed process for dealing with heretics is important.

25 Commandments for Journalists – I’ve been thinking a lot about ‘sensationalism’ in writing, engaging the reader and how to square that with truth, justice and the American Way (of journalism). It’s been one of the most controversial topics I’ve tackled, provoking some insightful push back from regulars like Emily Cunningham. This manifesto from Tim Radford articulates the British position far better than I’ve managed to so far, with key phrases like “Nobody has to read this crap” and “Words like ‘sensational’ and ‘trivial’ are not insults to a journalist”. The final commandment is the most important though, about the balancing act we all need to do.

25. Writers have a responsibility, not just in law. So aim for the truth. If that’s elusive, and it often is, at least aim for fairness, the awareness that there is always another side to the story. Beware of all claims to objectivity. This one is the dodgiest of all. You may report that the Royal Society says that genetic modification is a good thing, and that depleted uranium is mostly harmless. But you should remember that genetic modification was invented by people who were immediately elected to the Royal Society for their cleverness, by people already in there because they knew how to enrich uranium fuel rods and deplete the rest. So to paraphrase Miss Mandy Rice-Davies (1963) “They would say that, wouldn’t they?”

Cuatro años de ejecuciones en México (Four years of executions in Mexico) – This is exactly the sort of important story I hoped OpenHeatMap could help tell. A sobering read, even through a stuttering automatic translation.

Gremlin – A screencast introducing Gremlin, a Groovy-based graph processing language. Uses analysis of Grateful Dead playlists as the example, and makes dealing with graph traversal look easy. Thanks to Chris Diehl for the heads-up.

Wolfram Alpha’s API is free, but is it open? – Wolfram has assembled an awesome collection of knowledge and aims to make it ‘computable’, but their API only returns images and textual descriptions of their data. If we’re going to do more than just display supplemental search results to users, we’ll need a machine-readable version. Anybody know folks there that I can quiz about that? Email me if you do, thanks (or if you have any other thoughts too of course!).

Five short links

Eveportraits

Eve Online User-Generated Portraits – Just look at the quality of those pictures above – they’re all created by players using Eve Online’s character generation system. Back in 1997 I worked on a pool game with animated characters, and a generator we nicknamed the Barbie Fashion Show, but I never realized how far the technology had come since then. Just check out this video of the interface to see how easy it is to create amazing results. In a funny twist, one of my old colleagues from that pool game is now working for Eve out in Iceland as a senior designer.

Carpets for airports – A connoisseurs guide to airport flooring, revealing their secret meanings. The Da Vinci Code for carpets, with a funky flash interface.

Without adding context, a journalist with data can be dangerous – A fantastic example of something I’ve been struggling to get across to people. At the moment we’re incredibly susceptible to believing number’s people throw at us, in a way we wouldn’t with stories told in prose. As a society, we need to wise up and develop enough savvy to build an immune system to this sort of manipulation, and part of that has to be calling out distortions like the one David pounces on.

Lighting the dark continent – Africa’s lack of development can seem staggering, but as Jon points out, it’s also a massive opportunity.

The Quantified Self Conference – I didn’t know there was a grass-roots movement around this idea, I thought it would be something driven by the product folks, but it makes sense there’s people interested in instrumenting their lives. This looks like a great opportunity to get feedback if you are in the business of offering solutions around this area.

Spotlight your startup at Strata

Spotlight
Photo by Bryan Stevenson

Are you a data startup who'd love to be at Strata but can't afford the admission? You now have a chance to attend the conference and show off what you've been building, thanks to the Strata Startup Showcase. There's space for fifteen startups, and successful companies will be given two free passes and five minutes to show off their work in front of investors. It's a great opportunity, but the deadline for admissions is Friday, so you'll need to be quick.

Don't forget the free Big Data Camp Unconference on the Monday before the main event too, the price is specially tailored for starving entrepreneurs' wallets.