Five short links

ape5roverred

Picture by H. Michael Karshis

The spread of American slavery – A compelling use of animated maps to get across the fact that slavery was spreading and dominating the places it existed, right up until the Civil War. A map that matters, because it punctures the idea that slavery would have withered away naturally without intervention from the North.

Snapchat and privacy and security consent orders – On the surface FTC consent orders look pretty toothless, so why do companies worry about them so much? This article does a good job of what they mean in practice, and it looks like they operate as jury-rigged regulations tailored for individual corporations, giving the FTC wide powers of oversight and investigation. The goals are often noble, but the lack of consistency and transparency leaves me worried the system is ineffective. If these regulations only apply to companies who’ve been caught doing something shady, then it just encourages others to avoid publicity around similar practices to stay exempt from the rules.

Maze Tree – I have no idea what the math behind this is, but boy is it pretty!

A suicide bomber’s guide to online privacy – The ever-provocative Peter Watts pushes back on David Brin’s idea of a transparent society by reaching into his biology training. He makes a convincing case that the very idea that someone is watching you is enough to provoke fear, in a way that’s buried deep in our animal nature. “Many critics claim that blanket surveillance amounts to treating everyone like a criminal, but I wonder if it goes deeper than that. I think maybe it makes us feel like prey. ”

Data-driven dreams – An impassioned rant against the gate-keeping that surrounds corporate data in general, and the lack of access to Twitter data for most research scientists in particular. Like Craigslist, Twitter messages feel like they should be a common resource since they’re public and we created them, but that’s not how it works.

Everything is a sensor for everything else

sinkhole

Photo by Paretz Partensky

Everything is a sensor for everything else

I love this quote from David Weinberger because it captures an important change that’s happening right now. Information about the real world used to be scarce and hard to gather, and you were lucky if you had one way to measure a fact. Increasingly we have a lot of different instruments we can use to look at the same aspect of reality, and that’s going to radically change our future.

As an example, consider the humble pothole. Before computerization, someone in a truck would drive around town every year or so, see which roads were in bad repair, and write down a list on a clipboard. If a citizen phoned up city hall to complain, that would be added to another list, and someone who knew the road network would then sort through all the lists and decide which places to send crews out to.

The first wave of computerization moved those clipboard lists into spreadsheets and GIS systems on an office desktop, but didn’t change the process very much. We’re in the middle of the second wave right now, where instead of phone calls our cell phones can automatically report pot holes just from accelerometer data.

Using sensors to passively spot holes takes humans out of the loop, and means we can gather tens or hundreds of times the number of reports that we would if a person had to take time out of their day to submit one manually. This is only the beginning of the tidal wave of data though.

Think about all the different ways we’ll be able to detect potholes over the next few years. Police and other public workers are increasingly wearing cameras, patrol cars have had dashboard cameras for a while, and computer vision’s at the point where analyzing the video to estimate road repair needs isn’t outlandishly hard. We’re going to see a lot more satellites taking photos too, and as those get more frequent and detailed, those will be great sources to track road conditions over time.

Beyond imagery, connected cars are going to be transmitting a lot of data, and every suspension jolt can be used as a signal that the driver might have hit a hole, and even small swerves to avoid a hazard could be a sign of a potential problem. Cars are also increasingly gaining sensors like LIDAR, radar and sonar. Their job is to spot obstacles in the road, but as a by-product you could also use the data they’re gathering to spot pot holes and even early cracks in the road surface.

There will be a even more potential sources of data as networked sensors get cheap enough to throw into all sorts of objects. If bridges get load sensors to spot structural damage, the same data stream can be analyzed to see when vehicles are bouncing over holes. Drones will be packed with all sorts of instruments, some of which will end up scanning the road. As the costs of computing, sensing, and communicating fall, the world will be packed with networked sensors, some of which will be able to spot potholes even if their designers never planned for that.

With all of this information, you might have thousands or even millions of readings from a lot of different sources about a single hole in the road. That’s serious overkill for the original use case of just sending out maintenance crews to fix them! This abundance of data makes a lot of other applications possible though. Insurance companies will probably end up getting hold of connected-car data, even if it’s just in aggregate, and can use it to help improve their estimates of car damage likelihood by neighborhood. Data on potholes from public satellite imagery can be used by civic watchdogs to keep an eye on how well the authorities are doing on road repairs. Map software can pick cycling routes that will offer the smoothest ride, based on estimates on the state of the road surface.

These are all still applications focused on potholes though. Having this overwhelming amount of sensor information means that the same data set can be mined to understand apparently unrelated insights. How many potholes there are will be influenced by a lot of things; how much rain there was recently, how many vehicles drove on the road, how heavy they were, how fast they were going, and I’d bet there are other significant factors like earth movements, and nearby construction. Once you have a reliable survey of potholes with broad coverage and frequent updates, you can begin to pull those correlations out. The sheer quantity of measurements from many independent sources means that the noise level shrinks and smaller effects can be spotted. Maybe you can spot an upswing in the chemical industry by seeing that there are a lot more potholes near their factories, because the haulage trucks are more heavily laden? How about getting an early warning of a landslide by seeing an increase in road cracks, thanks to initial shifts in the soil below?

These are just examples I picked off the top of my head, but the key point is that as the sample sizes grow large enough, sensors can be used to measure apparently unrelated facts. There are only so many quantities we care about in the real world, but the number of sensor readings keeps growing incredibly rapidly, and it’s becoming possible to infer measurements that would once have needed their own dedicated instruments. The curse of ‘big data’ is spurious correlations, so it’s going to be a process of experimentation and innovation to discover which ones are practical and useful, but I’m certain we’re going to uncover some killer applications by substituting alternative sensor information in bulk for the readings you wish you had.

It also means that facts we want to hide, even private ones about ourselves, are going to be increasingly hard to keep secret as the chances to observe them through stray data exhaust grows, but that’s a discussion for a whole new post!

Why fixing the privacy problem needs politics, not engineering

surveillance

Photo by Canales

I just returned from a panel at UC Berkeley’s DataEdge conference on “How surveillants think“. I was the unofficial spokesman for corporate surveillance, since not many startup people are willing to talk about how we’re using the flood of new data that people are broadcasting about themselves. I was happy to stand up there because the only way I can be comfortable working in the field is if I’m able to be open about what I’m doing. Blogging and speaking are my ways of getting a reality check from the rest of the world on the ethics of my work.

One of the most interesting parts was an argument between Vivek Wadhwa and Gilman Louie, former head of In-Q-Tel, the venture capital arm of the US intelligence services. Apologies in advance to both of them for butchering their positions, but I’ll try to do them justice. We all agreed that retaining privacy in the internet age was a massive problem. Vivek said that the amount of data available about us and the technology for extracting meaning from it were advancing so fast that social norms had no hope of catching up. The solution was a system where we all own our data. Gilman countered with a slew of examples from the public sector, talking about approaches like the “Do not call” registry that solved tough privacy problems.

A few years ago I would have agreed with Vivek. As a programmer there’s something intuitively appealing about data ownership. We build our security models around the concepts of permissions, and it’s fun to imagine a database that stores the source of every value it contains, allowing all sorts of provenance-based access. At any point, you could force someone to run a “DELETE FROM corp_data WHERE source=’Pete Warden’;“, and your information would vanish. This is actually how a lot of existing data protection laws work, especially in the EU. The problem is that the approach completely falls over once you move beyond explicitly-entered personal information. Here are a few reasons why.

Data is invisible

The first problem is that there’s no way to tell what data’s been collected on you. Facebook used to have a rule that any information third-parties pulled from their API had to be deleted after 24 hours. I don’t know how many developers obeyed that rule, and neither does anyone else. Another example is Twitter’s streaming API; if somebody deletes a tweet after it’s been broadcast, users of the API are supposed to delete the message from their archives too, but again it’s opaque how often that’s honored. Collections of private, sensitive information are impossible to detect unless they’re exposed publicly. They can even be used as the inputs to all sorts of algorithms, from ad targeting to loan approvals, and we’d still never know. You can’t enforce ownership if you don’t know someone else has your data.

Data is odorless

Do I know that you like dogs from a pet store purchase, or from a photo you posted privately on Facebook, from an online survey you filled out, from a blog post you wrote, from a charitable donation you made, or from a political campaign you gave money to? It’s the same fact, but if you don’t give permission to Facebook or the pet store to sell your information, and you discover another company has it, how do you tell what the chain of ownership was? You could require the provenance-tagging approach, I know intelligence agencies have systems like that to ensure every word of every sentence of a briefing can be traced back to their sources, but it’s both a massive engineering effort, and easy to fake. Just pretend that you have the world’s most awesome dog-loving prediction algorithm from other public data, and say that’s the source. With no practical way to tell where a fact came from, you can’t assert ownership of it.

All data is PII

Gilman talked about how government departments spend a lot of time figuring out how to safely handle personally-identifiable information. One approach to making a data ownership regime more practical is to have it focus on PII, since that feels like a more manageable amount of data. The problem is that deanonymization works on almost any data set that has enough dimensions. You can be identified by your gait, by noise in your camera’s sensor, by accelerometer inconsistencies, by your taste in movies. It turns out we’re all pretty unique! That means that almost any ‘data exhaust’ that might appear innocuous could be used to derive sensitive, personal information. The example I threw out was that Jetpac has the ability to spot unofficial gay bars in repressive places like Tehran, just from the content of public Instagram photos. We try hard to avoid exposing people to harm, and don’t release that sort of information, but anyone who wanted to could do a similar analysis. When the world is instrumented, with gargantuan amounts of sensor data sloshing around, figuring out what could be sensitive is almost impossible, so putting a subset of data under ownership won’t work.

Nobody wants to own their data

The most depressing thing I’ve discovered over the years is that it’s very hard to get people interested in what’s happening to their data behind closed doors. People have been filling out surveys in magazines for decades, building up databases at massive companies like Acxiom long before the internet came along. For a price, anyone can download detailed information on people, including their salary, kids, medical conditions, military service, political beliefs, and charitable donations. The person in the street just doesn’t care. As long as it’s not causing them problems, nobody’s bothered. It matters when it affects credit scores or other outcomes, but as long as it’s just changing the mix of junk mail they receive, there’s no desire to take any action. Physical and intellectual property laws work because they build on an existing intuitive feeling of ownership. If nobody cares about ownership of their data, we’ll never be pass or enforce legislation around the concept.

Privacy needs politics

I’ve been picking on Vivek’s data ownership phrase as an example, and he didn’t have a chance to outline what he truly meant by that, but in my experience every solution that relies on constraining data inputs has similar problems. We’re instrumenting our lives, we’re making the information from our sensors public, and organizations are going to exploit that data. The only way forward I see is to focus on cases where the outcomes from that data analysis are offensive. It’s what people care about after all, the abuses, the actual harms that occur because of things like redlining. The good news is that we have a whole set of social systems set up to digest new problems, come up with rules, and ensure people follow them. Vivek made the point that social mores are lagging far behind the technology, which is true. Legislators, lawyers, and journalists, the people who drive those social systems don’t understand the new world of data we’re building as technologists. I think where we differ is that I believe it’s possible to get those folks up to speed before it’s too late. It will be messy, painful, and always incomplete process, but I see signs of it already.

Before anything else can happen, we need journalists to explain what’s going on to the general public. One of the most promising developments I’ve seen is the idea of reporters covering algorithms as a beat, just like they cover crime or finance. As black boxes make an increasing number of decisions about our lives, we need watchdogs who can keep an eye on them. Despite their internal complexity, you can still apply traditional investigative skills to the results. I was pleased to see a similar idea pop up in the recent Whitehouse report on Big Data too – “The increasing use of algorithms to make eligibility decisions must be carefully monitored for potential discriminatory outcomes for disadvantaged groups, even absent discriminatory intent.”. Once we’ve spotted things going wrong, then we need well-crafted legislation to stop the abuse, and like Gilman I’d point to “Do not call” as a great example of how that can work.

The engineering community is generally very reluctant to get involved in traditional politics, which is why technical solutions like data ownership are so appealing to us. The trouble is we’re now at the point where the mainstream world knows that the new world of data is a big threat to privacy, and they’re going to put laws in place whether we’re involved or not. If we’re not part of the process, and if we haven’t educated the participants to a reasonable level, they’re going to be ineffective and even counter-productive laws. I’m trying to do what I can by writing and talking about the realities of our new world, and through volunteering with political campaigns. I don’t have all the answers, but I truly believe the best way for us to tackle this is through the boring footwork of civil society.

Five short links

fivedollarshirt

Photo by Faraz

GrubHub’s Phasmid Websites – The latest evolution of websites that appear to be official, but are actually set up by a third-party to benefit from traffic. As the costs of hosting a site keeps dropping, there will be more and more of these competing for attention. Long-term this feels like just as much of a threat to the web model as mobile app stores, since we have to trust Google to win the arms race against fakers without down-ranking obscure-but-genuine sites.

Dizzying but invisible depth – In my lifetime we’ve gone from machines that I had a chance of understanding completely given decades to study them, to ones that no one person could ever hope to create a complete mental model of. Maybe this is the point at which CS truly needs to become a science, since we’re building increasingly blacker boxes we can only hope to comprehend by experimenting on?

Machine-learning on a board – Large neural networks are going to be showing up in your domain soon, I promise, so I’m keeping an eye out for interesting hardware approaches like this that may help accelerate them on otherwise-modest systems.

San Francisco Survival Guide – Short but packed with the essentials you need to know. A good reminder of some things it’s easy for me to get blasé about too, both good and bad – “The inequality will shock you and continue to shock you.

Pointer magic for efficient dynamic value representations – Bit-twiddling fun.

Hiking “Round the Mountain”, Tongariro National Park

P1010688

A few weeks ago, I was lucky enough to head to New Zealand for KiwiFoo and a few other work meetings. I only knew I’d be going about a month ahead of time, but I wanted to fit in a few days backpacking after the conference. After some research, I settled on the Round the Mountain trail, because it was between my work destinations of Auckland and Wellington on the North Island, but promised a secluded experience in the wilderness. I ended up having some wonderful moments on the hike, but it didn’t all go to plan. Since I enjoy being a cautionary example to others, here’s the story of how it went!

Preparation

Planning the route was comparatively easy, thanks to the internet. The official website was surprisingly helpful, but I also found quite a few professional guide pages, and some useful personal trip reports. Looking back, I got quite an accurate idea of what I’d be tackling from the research, especially the personal posts that covered the areas that had proved difficult. The ‘ratings’ on the professional sites weren’t helpful, ranging from easy to moderate to hard, and from a distance it was hard to estimate the time it would take without knowing the expected speed and experience they based it on. I ended up being overly optimistic hoping for a three-day trip, but left myself enough room in my schedule to let it stretch to five if I needed to. The one thing that I couldn’t find anywhere, even in Auckland book stores and outdoor shops, were paper maps of the area. I was set up with a GPS, but I didn’t feel ready until I had a traditional backup. From chatting to folks at the stores, the Topo50 physical maps are no longer being stocked, since anyone can download them and print them for free. This doesn’t give you a large water and tear-resistant map though, and it also isn’t easy to manage while you’re traveling, so I was happy when I found a good map at a gas station closer to the trail head.

I had knee surgery last year, so even though I’d been cleared to keep hiking I wanted to be cautious. I’d been biking a fair amount, and getting in occasional hikes, but it had been over a year since my last overnight trip, and several years since I’d done serious multi-day backpacking. I spent several weeks doing two short hikes up a steep hill every day with a backpack I’d weighted down as much as I could, in the hope of building up my stamina and fitness enough to keep up a strong pace, and stay safe if the footing got tricky. I went from woefully out of hiking condition, to reasonable-but-not-great. More variety in my training hikes and at least one mountainous overnighter would have left me in a better place, but I’m glad I at least was in decent shape when I faced the trail.

Monday

After driving down from Auckland and arriving late, I stayed at the Chateau Tongariro in Whakapapa Village. This was very close to the trailhead, but with breakfast, picking up some last-minute supplies, doing the final packing of my backpack, and checking in at the visitor center to pick up the hut passes I needed, along with any advice they had, I didn’t get out on the trail until noon. I knew there was bad weather due in a couple of days, but I was committed to tackling the hike as best I could, and figured I’d see how things looked as time went on. I felt well-prepared with equipment and experience, I wasn’t going to keep going if conditions left me feeling at all unsafe, but I also knew I was doing it solo which I normally wouldn’t recommend, especially in an unfamiliar area.

P1010652

The trail surface was beautifully-maintained on that section of the walk. Gray gravel, clearly-defined edges, and lots of markers left me feeling more confident. There were some steep sections, but even with my backpack on I managed the 8.5 mile walk to Waihohonu Hut in four hours, when the official estimate was five and a half. My hope had been to continue another 7.5 miles to Rangipo Hut, but with my late start that would have involved some trekking through the dark, so I decided to make camp. I had a bivouac tent, and set up in the campground below the main hut. I did pop in to say hi to the folks gathered there and check in with the warden, but after a long conference I wasn’t feel too social. I sensed that wasn’t culturally appropriate and that I was expected to make more of an effort to bond with the group, but I was after some time alone in the wilderness!

Tuesday

I knew this would be a long day, and after my relative lack of progress on Monday I needed to make an early start and would have a late finish. My goal was Mangahuehu Hut 12.5 miles further on, past Rangipo Hut. After the pace I’d kept up for the first section, I was hopeful this was reasonable. It was southern-hemisphere fall, so I only had 11 hours of daylight from my departure at 7am, but that seemed like it should be plenty. I soon discovered the trail was a lot tougher than I expected though. I’d left the section that was shared with the Tongariro ‘Great Walk’, and the route and condition of the trail became much worse.

P1010666

There were still frequent marker posts, but often there was no worked trail surface, just stream beds and fields of pumice. The trail wound up and down across a seemingly-endless series of valleys and ridges, and by lunchtime it was clear I was moving more slowly than I’d hoped, even falling behind the official estimates. On top of all that, this was waiting for me:

P1010677

I had prepared for the practical problems of being a solo hiker, but I hadn’t thought too much about the psychological ones. I knew the lahar valley was coming up, and had been looking forward to some interesting terrain and a tiny slice of danger when I researched it from the safety of home. When I got there, it was unexpectedly hard. The river always seemed to make a loud roaring noise, the trail was often hard to see, four hundred metres felt like a long way, the route was hard to follow, and it was technically very challenging. I got very focused on getting through as fast as I could, and wasn’t thinking clearly. As I was climbing out of the other side of the valley along the side of a cliff, I found the rock-ledge I was on narrowing. Rather than stopping, looking around, and thinking, I shucked off my backpack, left it on the shelf and inched along the ledge to the next section of trail, leaning into the cliff-face and trying to keep my balance. I heard a crash, and saw my backpack had fallen off the ledge. Thankfully it was only a twenty-foot drop, but it could easily have been me. Sobered, I finally took a good look at where I was, and realized that the current trail was down below me, near where my backpack had fallen, and I’d been following an old one that had been washed away. I carefully made my way down, retrieved my backpack (thankfully nothing was damaged), and made my way uneventfully out of the lahar zone.

P1010676

I left the valley chastened. I like to think I’m conservative about safety, but by not paying attention to the trail, and then forging ahead instead of backtracking when it became dangerous, I’d taken a very dumb risk. I was lucky to get away unharmed. Looking back, I could see I was so focused on the distant lahar danger that I’d lost perspective and freaked out. I don’t think I’d have made the same mistake with other people around, just the process of talking about what we were doing would have grounded me a lot more. The experience made me realize how insidious panic can be. I didn’t realize how badly my judgment had been skewed while I was in it, and it left me with a lot more compassion for the folks in news stories who do stupid things during a crisis.

Finally at around 1pm I made it to Rangipo Hut. By that point I was an hour behind schedule, tired, recovering from my scare, and not looking forward to the next section. I filled up on water, chatted to the folks staying in the hut, and heard they were planning on sitting out the coming storm for the next few days. The weather was still fine, with nothing serious expected until the next day, so I decided to press on to Mangaehuehu.

P1010681

I soon hit Waihianoa Gorge. My photo doesn’t do it justice, check out this one by another hiker to get an idea of the scale, but it was wide, deep, the trail was mostly loose scree, and I had to go all the way down it and back up the other side. The descent was treacherous, but not too dangerous when I did slip. I took a couple of falls but just got a few more scrapes and bruises. Heading up was a slog, but actually had a much more defined trail. I then headed across some very technical terrain, apparently lava flows and hills of ash and pumice, where the trail was hard to spot and seemingly little-used. I started to hit patches of forest, which made a pleasant change from the moonscapes I had been hiking, and had a much better tread, but also posed a challenge as sunset approached.

I put on my headlamp, and picked my way through the trees and streams for about an hour in the dark. I was so tired that I was just confused by the odd behavior of the moon. One minute it was full, lighting my way, and the next time I looked, it was just a sliver. Without thinking too much, I shrugged this off as more New Zealand quirkiness, much like their flightless birds and fondness for “Yeah, no.” Of course, it was actually a lunar eclipse! Thankfully I made Mangaehuehu Hut at around 7:30pm.

It was occupied by a group of three local deer hunters who’d been there for several days. They were just getting into bed, but there was a nice set of bunks free for me. I had been planning on tent-camping the whole time, but the lure of a roof over my head was too strong, and I didn’t want to spend time setting up and dismantling a shelter again. I had some decisions to make about the next day too. I was woken up several times during the night as the remnants of Cyclone Ita brought gale force winds to rock the hut. I’d checked the upcoming route, and if I was going to do it in one day it would involve as much hiking as I’d done today, with much of it along exposed ridgelines.

Wednesday

I woke up at 4am, and knew I had to abandon my hope of doing the full loop and instead hike out to the nearest town. There was a break in the weather, and the hunters were headed out too, so we all set off together before dawn.

P1010689

The hunters were friendly, but competitive and insanely fit. The pack in the photo was full to the brim with venison, at least seventy pounds worth, and the bearer is in his sixties, but I was still hard-pressed to keep up with him. We ended up doing the 5.5 mile trail out to the road in two and a half hours after they set a blistering pace. It was a good way to end the hike on a high note. They gave me a ride down to Ohakune, where there was a hope of transportation back to Whakapapa Village.

My interaction with the visitor center there was a funny one. I wandered in, said hello to the lady behind the information desk, and told her I was looking to get transport back to Whakapapa. “Well, how do you propose to do that?” was her reply! I told her I was hoping she could suggest a solution to my dilemma, and she consulted with colleagues, rummaged behind the desk, and finally appeared clutching the card of a local taxi firm. She wasn’t willing to hand it over to me though, so keeping one eye on me, she negotiated with the driver, sharing his apparent surprise that somebody would want to be driven from one place to another in return for payment, and finally informed me that it had been arranged. I thanked her gratefully, and had an uneventful ride back to my hire car. I unloaded my gear, crawled into the back seat, and watched sleepily as wild winds and monsoon rain lashed the parking lot.

Five short links

fivebus

Photo by Koeb

Right-sizing precision – A proposal to add more flexibility to floats by allowing the exponent and mantissa to be variable-length. The precision can reflect the believed accuracy of the value, which is useful information to have around. I’ve been doing a lot of neural network optimization recently by tweaking the precision of large arrays, because memory access is far slower than the unpacking of non-standard formats, so the idea of throwing more logic gates in the core to minimize RAM traffic is appealing.

What’s wrong with GNU make? – I wish I could find a good alternative, cmake is deeply painful too. Is the process of compiling complex projects across multiple platforms fated to be a time-consuming nightmare, or are we just terrible at building tools for ourselves?

What’s your C migration plan? – I’ve been using C for over twenty years and I love it dearly, but after the latest round of security flaws writing new code in the language does feel close to professional malpractice. I’m reading up on Go for when I need compiled code without a JVM.

Wire-tapping the ruins of Pompeii – Using thousands of cheap sensors to keep an eye on decaying architecture.

World airports Voronoi diagram – When I’m flying long distance, I always have a background anxiety process running, figuring out where we could glide too if the engines fall off. This handy SVG animation takes away all the guesswork!

The DeepBelief SDK now works with OpenCV

opencv

Photo by Richard Almond

One of the most-requested features for the DeepBelief object recognition SDK has been integration with OpenCV. We’re actually heavy users of the framework ourselves at Jetpac, and so I’m pleased to say that it’s now easy to use it with DeepBelief!

Another frequent request was desktop support, and so there’s now Linux x86-64 and OS X libraries, documentation and examples available. I’m still a big fan of Caffe and Overfeat, but DeepBelief can be handy when you need simple setup with few dependencies, a very small footprint, and when you want to train models from file system images to use later on mobile devices.

It’s a very exciting time for computer vision, there’s a lot of new and exciting applications we can all build with the deep learning approach, and I hope this helps a few more people dive into creating their own!