Rasterio – Geospatial Raster Data Access for Programmers and Future Programmers

Rasterio – Geospatial Raster Data Access for Programmers and Future Programmers

0 1


Presentation for SciPy

On Github sgillies / scipy-2014-rasterio


Geospatial Raster Data Access for Programmers and Future Programmers

Sean Gillies @Mapbox

SciPy 2014 ∙ Austin ∙ July 9

Hi! Good morning! Thank you very much for coming to this sessiom. My name is Sean Gillies, I work at Mapbox. I've been a Python programmer since 2001 and a GIS analyst and programmer since 1999, with a séjour in the digital classics from 2006 to 2013. This is my first SciPy and I'm super happy to be here.

Programming with Raster data

  • What's the Mapbox Satellite team up to?
  • Why Linux, GDAL, and SciPy?
  • Why a new Python geospatial data library?
  • What is Rasterio and how do you use it?
My talk is organized around four questions. What's the problem? What's the approach? Why are we writing new geodata access libraries? How do you use it yourselves?

Mapbox Cloudless Atlas

The Mapbox Satellite Team is making a Cloudless Landsat mosaic of the world. This is a base layer for maps created by Mapbox users. Its primary requirements are to be current and to look good. We'll probably be working on products that are better for science in the future. This is the project that drives a lot of Rasterio development. It has interesting technical problems.

From cloudy scenes

We're making the cloudless atlas from generally cloudy scenes. The first technical problem is getting the clouds out. Everywhere. With minimal operator intervention.


  • Scene selection (season and cloudiness)
  • Atmospheric correction and reflectance normalization
  • Score every pixel of every source image
  • No cutlines
  • Pixel by pixel
Cutlines and other spatial interventions don't scale well. Cloudless Atlas needs to scale, because we want to run it almost continually as new data comes in. Pixel by pixel approaches scale trivially. They worked very well for MODIS imagery. They won't work for very high resolution imagery? Will they work for Landsat? At the outset we weren't sure.

High scoring pixels

This is our approach right now: score every pixel and sort the best ones to the top. The cloudless atlas tiles are formed from the best pixels. Shown here are the "best" by some measure (darkest in this case) pixels of all the Landsat scenes shown in the previous slide.

Low scoring pixels

These are the worst (brightest) pixels. Sorting is fairly cheap (no worse than NlogN - N being the number of input scenes), so this is going to scale reasonably well.

Earth is a cloudy planet

  • 50-70% mean cloud cover over land
  • Landsat 8 doesn't yet have cloud-free imagery everywhere
  • So we also use Landsat 7
This is the lead in to our second technical problem. We have to use Landsat 7 because we don't yet have enough L8 coverage to make a global cloudless mosaic. And Landsat 7 has some problems.

Scan Line Correction

Landsat 7's scan line corrector, which accounts for the forward motion of the satellite as it scans, failed permanently on May 31, 2003. Imagery acquired after this time has zig-zaging lines which both overlap and miss portions of the landscape.
SLC-Off imagery has characteristic no-data strips, increasing in breadth toward the edges of the scene. Middles of the scenes can have full coverage.

Landsat SLC-Off Imagery

  • Since May 2003
  • Without it we don't have enough cloudless pixels
  • It complicates things
  • A lot
Many regions are cloudy enough that the best pixels are found only in SLC-Off scenes. Putting SLC-Off imagery into our declouding pipeline leads to our second interesting technical problem.


Instead of nodata bands we have bands of seasonal or interannual incongruity. It can be especially noticeable in agricultural regions: manifesting as alternating strips of vigorous and non-vigorous vegetation (or bare ground).


  • Transform the image using a Fast Fourier Transform
  • Locate the artifact peaks
  • Remove the artifacts
  • Transform the image back using an inverse transform
This is our approach to the debanding problem. Locating the artifact peaks is quite difficult in practice when there are multiple SLC-off images going in to our product. Modeling the orientation of the SLC-off gaps helps a lot.

Solving the problems

  • We have no off-the-shelf solutions
  • Trial and error is involved
Trial and error isn't our only method, but it's an important one. It has a big impact on our requirements.

Cloudless Atlas Software Requirements

Rapid prototyping and iteration Proven algorithms and drivers Ability to deploy to 100s of servers We need to get new iterations of our algorithms into production quickly. We need robust numerical libraries. And we need to distribute the work across many servers. And also know the cost of doing it.

Software stack

  • 1 and 3 argue for open source
  • 1 argues for a high-level language with handy multi-dimensional array syntax
  • 2 argues for LAPACK (&c) and GDAL
  • The fit: Linux, Python, Numpy, Scipy

GDAL Python bindings?

  • Very C++ style interface
  • Failure to keep SWIG and C issues in mind results in crashing programs
  • We can do better
The last thing the stack needs is software to read and write geospatial raster data files like GeoTIFFs. And reproject them. GDAL has Python bindings but they are not IMHO production grade. They have terrible gotchas. They also feel too much like C++. That's no fun. programming at a higher level and high productivity are our requirements, remember?

Scipy-style raster data library requirements

Read/write ndarrays from/to data files Python types, protocols and idioms instead of C/C++ ones Free programmers from having to think about C/C++ (memory management, pointers, &c)

Rasterio's Design

  • A Python package at the top
  • Extension modules (using Cython) in the middle
  • Fast loops, typed memoryviews, "nogil" blocks
  • GDAL shared library on the bottom
Cython takes care of a lot of C extension details for us. Rasterio uses Cython features like fast loops, typed memoryviews, and "nogil" blocks.

Reading data

import rasterio

with rasterio.open(path) as src:
    bands = [src.read_band(i) for i in src.indexes]
  • open() gives you a file-like dataset object
  • read_band() gives you a Numpy ndarray
  • Read windows of data with extended slice-like syntax
Rasterio is geared for band-interleaved data. New read() method coming in 0.9.

Writing data

kwargs = src.meta

with rasterio.open(path, 'w', **kwargs) as dst:
    for i, arr in zip(dst.indexes, bands):
        dst.write_band(i, arr)
  • Get keyword args needed to open a dataset for writing from another dataset
  • write_band() takes an ndarray
  • You can also write to windows of a dataset
Rasterio is geared for band-interleaved data. New read() method coming in 0.9.


Rasterio follows the lead of pyproj
>>> import rasterio
>>> src = rasterio.open('rasterio/tests/data/RGB.byte.tif')
>>> src.crs
{'units': 'm', 'zone': 18, 'ellps': 'WGS84', 'proj': 'utm', 'no_defs': True}


  • rasterio.features.shapes() yields all the features of an array as GeoJSON-like objects
  • rasterio.features.rasterize() "burns" GeoJSON-like objects into an array
  • Dicts, iterators, tuples, ndarrays
  • No datasets or layers necessary
{'coordinates': [[(71.0, 6.0), ...]], 'type': 'Polygon'}, ...


  • rasterio.warp.reproject() maps elements of one array to another, using cartographic projections
  • No datasets or layers required
  • Data created in non-GIS programs can be reprojected for use with GIS programs


  • The Mapbox Satellite team has interesting problems
  • Linux, Python, Numpy, Scipy, GDAL, and Rasterio are big parts of the solutions
  • Rasterio aims to make GIS data more accessible to Python programmers
  • And help GIS analysts learn important Python protocols and idioms
In conclusion.



Debanding Blog Post

https://www.mapbox.com/blog/debanding-the-world/ by Amit Kapadia

Thank You!