Making Maps @Mapbox

Sean Gillies

Fort Collins Python Web Development Meetup∙July 14, 2014

Hi! Good evening! Thank you very much for coming out to the Meetup. My name is Sean Gillies, I work at Mapbox. I've been in the GIS business since 1999 and I've been a Python programmer since 2001.

Mapbox is a platform for creating and using maps. That means that it’s a collection of tools and services that connect together in different ways: you can draw a map on the website, use it on an iPhone, and get raw data from an API. We quite frankly want to win over the users of the Google Maps platform. This evening I'm going to cover: - How Mapbox works - How Mapbox employees work - What the satellite team is up to - How we're using Python So, starting with a look at the products and the company, I'm going to bend back towards Python :)

How Mapbox works

Making maps
Map data
Using maps

Making maps: overlays

Fundamentally, maps are made up of overlays and baselayers. Overlays are additional data added on top of a map for some particular purpose. An overlay might be a marker with your business location, or something complex like earthquake visualizations. These sit on top of the baselayer - a combination of streets, countries, and terrain or satellite data that give people a basic sense of place.

Making maps: baselayers

The easiest way to make a map is with the Mapbox.com editor. You can customize parameters of the baselayer - choosing satellite or terrain, adjusting road or area colors, or changing the language of labels. It also provides a simple interface to draw, style, and import features like markers, lines, and areas. The Mapbox.com editor is great for getting started fast, and has some superpowers over any other way of making maps. With it, you can tweak the baselayer, and instantly, everywhere in the world acquires a new look, on the fly. But sometimes you want more control, or to add more data, and for that, there’s TileMill.

Making maps: tilemill

TileMill is a free desktop application that can import and style data to your heart’s content. It can help you create everything from heatmaps to 3D buildings and custom terrain. You can even export the features you drew on Mapbox.com and import them into TileMill for more styling power. And while the Mapbox.com editor tops out at 2,000 features and might get slow with complex geometries, TileMill stays super fast. Generating big maps in TileMill? You might want to check out our upgraded plans for more storage space. The TileMill approach is not without limitations - unlike the Mapbox.com editor, you can’t instantly style a map of the world. If you’re dealing with a large geographical area and high zoom levels, it can take a while to render and will result in big files. So TileMill is a desktop application - how does it connect to Mapbox? Well, we’ve made it so that it’s easy to upload straight from TileMill to Mapbox and integrate your TileMill maps with your Mapbox.com projects.

Mapbox map data

Streets: OpenStreetMap
Imagery: DigitalGlobe and other providers
Common to all customers

Let’s look at Mapbox’s data first. The baselayer that you customize in Mapbox.com comes from a variety of sources: the road network comes from OpenStreetMap, while satellite and terrain data are derived from DigitalGlobe and other image providers. We carefully choose and curate our sources to make sure that everything from country borders to the Sahara is accurate.

This week, the data team finished adding buildings in San Francisco to OpenStreetMap. Since January, we have added or updated over 152,022 building polygons based on San Francisco building footprint data and Bing imagery. The biggest difference between Google Maps and Mapbox isn't size. It's that Mapbox uses and gives back to OSM. We're big contributors to OSM. Edits made to OSM show up in Mapbox data in less than 10 minutes.

Added map data

Next, there’s data you add in Mapbox.com or stylize in TileMill. This can be any compatible source you choose - and there are several compatible formats for TileMill as well as for the editor. This should be data that you own or have a right to use. We make sure that you can always download it - either by exporting it from the web interface or keeping it on your local computer.

Local map data

Not uploaded to Mapbox
Combined with maps in realtime
Using API or SDK

There’s a third kind of data: local data. Local data can be combined with your maps in realtime, thanks to the power of our SDKs. This can be anything from dynamic database results to files that users upload. Unlike other kinds of data, this never has to touch Mapbox’s servers because it’s combined with the map in realtime. A good example of this kind of integration is Foursquare, a company that uses Mapbox’s custom maps and overlays their own custom data with custom software.

Using maps on the web

The simplest way of sharing a map is just sending a link: this directs someone to Mapbox.com where they can look at your map through a nice little interface. Additionally, you can use embeds, which let you add a map within your existing content, like you would add an image to a blog post.

JS, you say?

Mapbox.js is a great plugin for Leaflet.

And there are a bunch of related plugins for loading and mapping all kinds of files: CSV, KML, GPX. BTW, Leaflet's author works at Mapbox.

Using maps on mobile

Native devices call for their own code: so there’s the Mapbox iOS SDK on iOS, MBXMapKit for iOS 7+, and the Mapbox Android SDK for any flavor of Android, including AOSP. Just like Mapbox.js, we build all of these tools as free and open source projects: you can customize them, peek inside, and try them out. Regardless of how you access Mapbox, whether through an SDK, embeds, or using the core API, you’re building on the Mapbox cloud service, so your custom maps will be available and pixel-perfect.

I'm now going to explain how we work at Mapbox. It's like a lot of other offices: pizza, cola, and satellite launches. That's Landsat 8. From the Washington DC office. We also have an office in SF.

Distributed is how we work

30 in Washington, DC
15 in San Francisco
12 remote: US, Canada, Europe, South America

Including Fort Collins and, very recently, Boulder. Mapbox's funders, The Foundry Group, are in Boulder, too.

Distributed is really how we work

GitHub for everything
for code
for supplies
for travel planning
for business strategy

It feels a lot like working on a bunch of open source projects. Everybody has at least a couple job descriptions and we rotate around. Now, flat and open isn't a panacea. In fact, it can be a recipe for abuse if you're hiring people purely on their ambition and tech pedigrees. So we try not to be irrationally enthusiastic about lack of managers and mertitocracy. We're also trying to equalize the gender ratio at Mapbox. Being a woman in tech should be normal, we think.

We do lots of social stuff, too. We host meetups, hackathons, mapping parties, sponsor local women's orgs. We totally enjoy this.

Trolling the establishment is also a small part of what we do at Mapbox.

Python @Mapbox

So, I've been a part of the satellite team. We use a lot of Python.

Mapbox Cloudless Atlas

The Mapbox Satellite Team is making a Cloudless Landsat mosaic of the world. This is a base layer for maps created by Mapbox users. Its primary requirements are to be current and to look good. We'll probably be working on products that are better for science in the future. This is the project that drives a lot of Rasterio development. It has interesting technical problems.

From cloudy scenes

We're making the cloudless atlas from generally cloudy scenes. The first technical problem is getting the clouds out. Everywhere. With minimal operator intervention.

Declouding

Scene selection (season and cloudiness)
Atmospheric correction and reflectance normalization
Score every pixel of every source image
No cutlines
Pixel by pixel

Cutlines and other spatial interventions don't scale well. Cloudless Atlas needs to scale, because we want to run it almost continually as new data comes in. Pixel by pixel approaches scale trivially. They worked very well for MODIS imagery. They won't work for very high resolution imagery? Will they work for Landsat? At the outset we weren't sure.

High scoring pixels

This is our approach right now: score every pixel and sort the best ones to the top. The cloudless atlas tiles are formed from the best pixels. Shown here are the "best" by some measure (darkest in this case) pixels of all the Landsat scenes shown in the previous slide.

Low scoring pixels

These are the worst (brightest) pixels. Sorting is fairly cheap (no worse than NlogN - N being the number of input scenes), so this is going to scale reasonably well.

Earth is a cloudy planet

50-70% mean cloud cover over land
Landsat 8 doesn't yet have cloud-free imagery everywhere
So we also use Landsat 7

This is the lead in to our second technical problem. We have to use Landsat 7 because we don't yet have enough L8 coverage to make a global cloudless mosaic. And Landsat 7 has some problems.

Scan Line Correction

Landsat 7's scan line corrector, which accounts for the forward motion of the satellite as it scans, failed permanently on May 31, 2003. Imagery acquired after this time has zig-zaging lines which both overlap and miss portions of the landscape.

SLC-Off imagery has characteristic no-data strips, increasing in breadth toward the edges of the scene. Middles of the scenes can have full coverage.

Landsat SLC-Off Imagery

Since May 2003
Without it we don't have enough cloudless pixels
It complicates things
A lot

Many regions are cloudy enough that the best pixels are found only in SLC-Off scenes. Putting SLC-Off imagery into our declouding pipeline leads to our second interesting technical problem.

Bands

Instead of nodata bands we have bands of seasonal or interannual incongruity. It can be especially noticeable in agricultural regions: manifesting as alternating strips of vigorous and non-vigorous vegetation (or bare ground).

Debanding

Transform the image using a Fast Fourier Transform
Locate the artifact peaks
Remove the artifacts
Transform the image back using an inverse transform

This is our approach to the debanding problem. Locating the artifact peaks is quite difficult in practice when there are multiple SLC-off images going in to our product. Modeling the orientation of the SLC-off gaps helps a lot.

Solving the problems

We have no off-the-shelf solutions
Trial and error is involved

Trial and error isn't our only method, but it's an important one. It has a big impact on our requirements.

Cloudless Atlas Software Requirements

Rapid prototyping and iteration Proven algorithms and drivers Ability to deploy to 100s of servers We need to get new iterations of our algorithms into production quickly. We need robust numerical libraries. And we need to distribute the work across many servers. And also know the cost of doing it.

Software stack

1 and 3 argue for open source
1 argues for a high-level language with handy multi-dimensional array syntax
2 argues for LAPACK (&c) and GDAL
The fit: Linux, Python, Numpy, Scipy

GDAL Python bindings?

Very C++ style interface
Failure to keep SWIG and C issues in mind results in crashing programs
We can do better

The last thing the stack needs is software to read and write geospatial raster data files like GeoTIFFs. And reproject them. GDAL has Python bindings but they are not IMHO production grade. They have terrible gotchas. They also feel too much like C++. That's no fun. programming at a higher level and high productivity are our requirements, remember?

Scipy-style raster data library requirements

Read/write ndarrays from/to data files Python types, protocols and idioms instead of C/C++ ones Free programmers from having to think about C/C++ (memory management, pointers, &c)

Rasterio's Design

A Python package at the top
Extension modules (using Cython) in the middle
Fast loops, typed memoryviews, "nogil" blocks
GDAL shared library on the bottom

Cython takes care of a lot of C extension details for us. Rasterio uses Cython features like fast loops, typed memoryviews, and "nogil" blocks.

Reading data

import rasterio

with rasterio.open(path) as src:
    bands = [src.read_band(i) for i in src.indexes]

open() gives you a file-like dataset object
read_band() gives you a Numpy ndarray
Read windows of data with extended slice-like syntax

Rasterio is geared for band-interleaved data. New read() method coming in 0.9.

Writing data

kwargs = src.meta

with rasterio.open(path, 'w', **kwargs) as dst:
    for i, arr in zip(dst.indexes, bands):
        dst.write_band(i, arr)

Get keyword args needed to open a dataset for writing from another dataset
write_band() takes an ndarray
You can also write to windows of a dataset