How to save an image to disk from PostGIS

PostGIS’s new raster support has been incredibly useful to me, but problems can be hard to debug. One of my favorite techniques is saving out intermediate images so I can see what’s going on, so I sat down and tried to figure out how to do that in PostGres. I found some handy functions like ST_AsPng(), ST_AsJpeg() and ST_AsTiff() to convert rasters to bitmap images in memory, but was surprised that none of the examples showed how to save the results as files from the pgsql client. As it turns out, that’s because it’s very awkward to do! I spent some time Googling, and managed to piece together a solution from various different sources, so here’s a trail of breadcrumbs for anyone else hitting the same problem.

The key problem is that the postgres client has no way to write out plain-old untouched binary data. You might think that the BINARY modifier to COPY would do the trick, but that actually writes out a header and other fields as well as the data we want. It turns out that your best bet is to encode the data as hexadecimal text, and then use an external program to convert it back to the raw binary data you want. Here’s the recipe:

First, in the psql client run this (with the table and file names changed as appropriate):

COPY (SELECT encode(ST_AsPNG(raster), ‘hex’) AS png FROM table_with_raster) TO ‘/tmp/myimage.hex’;

If you cat the /tmp/myimage.hex file, you’ll see a stream of hex numbers in text, like this: “89504e470d0a1a0a0000…“. You’ll now need to open up a normal unix terminal and run the xxd command to stitch the text back into a binary file:

xxd -p -r /tmp/sf_colors.hex > /tmp/sf_colors.png

Voila! You now have a file representing your raster, so you can actually see what’s going on with your data. Any suggestions on how to improve this are very welcome in the comments section, it doesn’t seem like it should be this hard.

Five short links

russianfive

Photo by Leo-setä

Are your recommendations any good? – Does the output of your machine learning pass the ‘So what?’ test? We’re in the enviable position of being able to predict a lot of things we couldn’t before, but we need to be careful that the predictions actually matter.

Meet Marvin Heiferman – “1.2 billion photographs are made per day. Half-a-trillion per year.”  I sometimes feel like I’m taking crazy pills trying to explain how mind-blowingly world-changing humanities photo-taking habit is, so it was great to run across Marvin’s deep thinking on the subject.

Parsing C++ is literally undecidable – I spent over a decade working in C++ almost exclusively, and the arcana involved in the language design still amazes me. I’ve lost more neurons than I’d like to remembering Koenig’s lookup rules, this example just reminds me how much more insanity can emerge from a language defined so complexly.

Evidence-based data analysis – If we were able to define best practices for the data analysis pipeline, then we’d be able to compare and trust the results of data science a lot more. We’d no longer be able to pick and choose our techniques to get artificially good results.

Transmitting live from the ocean below the Antarctic ice – Bring a little bit of the worlds strange vastness to your desk.

Five short links

fivekids

Picture by Wackystuff

Colors of the Internet – The palettes of the most popular 1,000 websites worldwide.

Compression bombs – Create a tiny compressed file that expands into terabytes when it’s expanded, persuade a user to open it in their browser, and sit back as their drive fills up and their machine becomes unresponsive.

What is the best comment in source code you have ever encountered? – ‘stop(); // Hammertime!‘ might be my favorite.

Unix at 25 – How AT&T’s house rules of ‘no support, no bug fixes, and no credit‘ ended up fostering an incredible community ethos of collaboration. ‘The general attitude of AT&T toward Unix–“no advertising, no support, no bug fixes, payment in advance”–made it necessary for users to band together.

‘User’ Considered Harmful – Break the habit of referring to the wonderful human beings who take the time to interact with your software as ‘users’.

Five short links

grainyfive

Photo by N-ino

Word2Vec – Given a large amount of training text, this project figures out words show up together in sentences most often, and then constructs a small vector representation that captures those relationships for each word. It turns out that simple arithmetic works on these vectors in an intuitive way, so that vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’). It’s not just elegantly neat, this looks like it will be very useful for clustering, and any other application where you need a sensible representation of a word as a number.

Charles Bukowski, William Burroughs, and the Computer – How two different writers handled onrushing technology, including Bukowski’s poem on the “16 bit Intel 8088 Chip”. This led me down a Wikipedia rabbit hole, since I’d always assumed the 8088 was fully 8-bit, but the truth proved a lot more interesting.

Sane data updates are harder than you think – Tales from the trenches in data management. Adrian Holovaty’s series on crawling and updating data is the first time I’ve seen a lot of techniques that are common amongst practitioners actually laid out clearly.

Randomness != Uniqueness – Creating good identifiers for data is hard, especially once you’re in a distributed environment. There are also tradeoffs between creating IDs that are decomposable, since that makes internal debugging and management much easier, but also reveals information to external folks, often more than you might expect.

Lessons from a year’s worth of hiring data – A useful antidote to the folklore prevalent in recruiting, this small study throws up a lot of intriguing possibilities. The correlation between spelling and grammar mistakes and candidates who didn’t make it through the interview process was especially interesting, considering how retailers like Zappo’s use Mechanical Turk to fix stylistic errors in user reviews to boost sales.

Five short links

marblefive

Photo by Adolfo Chavez

TI Calculator Simulator – A geeky labor of love, simulating one of the seminal 1970’s calculators. I’m awestruck that a machine with only 320 words of ROM and a CPU that’s minimal beyond belief could be so incredibly useful.

The Mighty Dictionary – A detailed explanation of how the Python dictionary works under the hood. The hash-attack freakouts are a good reminder of how useful understanding even the most black-box parts of your stack can be.

A Gauss machine gun – Sometimes I wish I was a physical engineer.

The Boolean Trap – “Code is written once, but read (and debugged and changed) many times” – words to live by.

Frak – Impressive abuse of the regular expression engine to match arbitrary sets of strings. It’s evil, but I’ve resorted to similar hacks myself to get good text-matching performance out of scripting languages

Five short links

tally

Photo by ahojohnebrause

Telepath logger – An open source keylogger, with a lot of fun extras like webcam snapshots, light-level, and music-monitoring.

A failure of authorship and peer review – An inadvertent #overlyhonestmethods moment in a chemistry paper.

Datahacker – Some beautiful experiments with 3d maps in the browser.

Black holes, lifeloggers, and space brainAdam Becker’s been doing some great stories for New Scientist, one of my favorite mags, and it’s well worth following his blog too.

Comments on the Mill CPU -I’m a hopeless romantic when it comes to off-beat CPU architectures (I still pine for the clockless ARM Amulet, despite knowing about the problems that have emerged over the decades), so I was excited to read the original announcement of the Mill approach. It’s good to see some in-depth analysis from an expert, explaining a bit more of the history and challenges the Mill architecture faces, especially from dog-slow memory.

 

TechRaking – mining the news

techraking

One of my favorite conferences is TechRaking, a gathering of journalists and technologists hosted by Google. It’s invite-only unfortunately, but they seem to gather a good selection of reporters interested in using technology, and engineers interested in creating tools for journalists. In the past I’ve met some fascinating people like LyraMcKee for the first time, and caught up with long-time friends like Pete Skomoroch, Somini Sengupta and Burt Herman.

This time, I’m going to live blog the talks, though there’s also an excellent set of notes as a Google Doc here.

Michael Corey from CIR started with an introduction. Tech + muck-raking is what it’s all about, hence the name of the conference. We’re focused on “mining the news” this time, which means turning bits and bytes into stories. We’re going to try to raise the bar on what we can do in this area, by bring people together.

Phil Bronstein, chair of the CIR. Held up a t-shirt – “I can’t tell you how long it’s been since journalists got to have swag!”. Why is the CIR important? What distinguishes us is the idea of innovation. Journalists suffer from “the higher calling disease”, we don’t want to interact with the messy public. We’ve been discovering that technology can enable us to reconnect with our audience.

Technologists want to create and build something that people use. Journalists have traditionally stopped at “here’s the story, see you later”. The CIR is looking for results from their stories. They do everything from live theater to poetry to get at the the emotional truth of the stories, so it’s more than just technology. I’m hoping that this event will be even better than the first one, because we ended up with engineers in one room and reporters in another, just because we’re not familiar with each other.

Richard Gingras, Google. “I have been around since the age of steam-powered modems”. “Why are we here? We’re all here because we care about the relationship between citizens and media, and particularly citizens and journalism.” Our colonial predecessors had a very limited interaction with media, most of their daily lives were dominated by personal interactions. Over the last hundred years, mass media became vastly more important, and now social media has exploded. It’s not just an increase in volume, the nature has changed too. We’re all able to be producers now, we’re no longer the passive audience we were. We’re able to pick the voices we listen to, share what we like, talk back to the authors. Media is no longer an adjunct, it’s a central part of our lives, many of our most important relationships are exclusively online.

How can we integrate quantitive thinking into the daily workflow of journalism, make it the rule, not the exception? We’ve made great strides in the last few years. Internet Archives closed-caption analysis is exciting, useful, and compelling. ProPublica built a dialysis tracker, that finds nearby centers for patients, and estimates how good they are. These kind of evergreen investigative projects are the way reporting should be going.

The Google Fellowship program is aiming to produce both quant-friendly journalists and journalist-friendly quants. Our society’s need for credible journalism knowledge is more urgent than ever. Can we find truth in the numbers? I suspect we can.

Now we shift to a panel discussion. The members are Richard Gingras, Google, Rodney Gibbs, Texas Tribune, Robyn Tomlin – digital first media, Vanessa Schneider, Google Maps, Sree Balakrishnan, Google Fusion.

“What is your fantasy data set?”. Robin, a database of  gun deaths across the US. Vanessa, a magical database of shape files at all the admin levels. Power outage data would be extremely useful too. Sree is very interested in local data, by ZIP code shows exactly what happens to your property tax. Rodney wants property taxes too, there are 900 tax assessors in Texas, some do well, but a lot require you to appear in person to get the data.

“Everyone knows the present is mobile, mobile, mobile. What does mobile do today for storytelling?”. Vanessa is excited by how readers can contribute in real time from their perspective. Robin loves being able to tune into the interests of people depending on where they are. She had someone build a database of pet licenses, and was able to serve up geo-located data on what the most popular pet names and types were nearby. Rodney also loves the ability to personalize the make the results location-based too, so you can look at average salaries of public servants and school rankings nearby. It is a problem to set up the CMS to make it mobile friendly though. Sree sees a lot of requests from people who want to use Fusion Tables from mobile devices, and that’s an important use case. Local is very compelling to a lot of people, and if we could geo-tag stories they would really help. Richard says over half of their traffic is coming from mobile, so a lot of new projects are mobile first.

“How do smaller organizations deal with this new world of data?” Robin deals with over 75 newsrooms, so she’s tried to create a community of data so that all of those teams can turn to a hub of expertise. They also work with non-profits like the CIR and ProPublica to get help from their specialists. Every newsroom should have somebody who’s dabbling and can reach out to others! Rodney has had a lot of success encouraging reporters to play with the new tools. One example is a youth prison that had a high rate of injuries in their data, and it turned out that the staff were running fight clubs. That persuaded a lot of people that the data was worth digging into. We’re training reporters in CSS and Javascript too. 90% of the journalists are excited by it. Vanessa is really interested in how journalists want to use the tools.

“What is the future of Fusion Tables? What’s the next big thing?” – Sree is looking at bringing more of the structured data in the world into Fusion Tables.

“How do we revolt against the tyranny of government data?”. Sree – We’re looking at getting governments to adopt standards, working with organizations like OKFN. The activities of journalists in finding data, cleaning it up, and putting it back out there is incredibly useful. Robin – news organizations have been fighting for a long time to get data. Commonly there are format problems, such as PDF versions of Excel files. Rodney says that Texas is actually very progressive on making data available. He’s been working with school districts, who have been helpful, but for example they just were given a spreadsheet with 100,000 rows and 105,000 columns! They’ve developed skills in dealing with messy data, to the point that the state comptroller wanted to pay them to deal with some of their internal data. Once governments see the potential, they become a lot more willing.

“What are the challenges standing in the way of data journalism?” Richard – it’s getting the proof of concepts. Once we have examples of how powerful these stories can be, people will understand how sexy it is. Robin says it’s also leadership, news organizations need to emphasize how important it is. Sree mentions a couple of examples – a map of gun ownership, the Nottingham Post in the UK created a map of where the most parking tickets are issued.

“What about the blowback from that gun story?” The paper posted a map of all the gun owners in the area, and lots of states passed laws making the ownership data private. Robin says that a lot of journalists were critical, because there wasn’t a good story behind it, they just rushed to get it up there. Sree says that Google has the conversation about the balances all the time. He says the only way forward is to focus on the benefits, and to guard against abuses. Rodney tells a story where his prisoner database had incorrectly marked a few hundred prisoners as child abusers, but it turned out the state had incorrectly entered a code in their data!

Five short links

chalkfive

Photo by Leo Reynolds

San Francisco aerial views from 1938 – A startlingly high-resolution zoomable map of the city from 75 years ago. The detail is amazing, as are the sand dunes across what’s now the Sunset neighborhood. It’s an incredible resource for anyone interested in our local history.

One weird kernel trick – Ad spam for ML researchers – “the secret to learning any problem with just ten training samples”!

Encoding messages in public SSH keys – Pimp your public RSA key with a custom message!

Why we need a compiler with a long-term support release – Dependencies are the devil, and compilers are at the root of most dependency trees. I can’t even imagine how much wrangling it takes to keep a big software platform compiling across dozens of architectures, but this heartfelt plea from the trenches gives me a flavor of the suffering involved. In a few decades we’ll be stuck searching for complete VMs to run old software, since most of the frameworks they depend on will have rotted off the internet.

Archaeology and Github – Liberating data from hard-to-OCR tables at the end of papers, and into a simple shared environment. I don’t know if Github is the right space for this, but like wikis, programmers are the early adopters of collaboration tools, so it’s a good place to start.