Fiona and Rasterio – Data Access for Python Programmers and Future Python Programmers



Fiona and Rasterio – Data Access for Python Programmers and Future Python Programmers

1 7


foss4g-2014-fiona-rasterio

FOSS4G 2014 Presentation

On Github sgillies / foss4g-2014-fiona-rasterio

Fiona and Rasterio

Data Access for Python Programmers and Future Python Programmers

Sean Gillies @Mapbox

FOSS4G 2014 ∙ Portland ∙ September 10

Hi! Good morning! Thank you very much for coming to this sessiom. My name is Sean Gillies, I work at Mapbox. I've been a Python programmer since 2001 and a GIS analyst and programmer since 1999, with a séjour in the digital classics from 2006 to 2013. I've been coming to FOSS4G off and on since it was the MapServer User Meeting.

New Python packages

  • Fiona and Rasterio are Python interfaces to OGR and GDAL
  • Each are native GeoJSON speakers
  • Each embrace the good parts of Python
  • Each embrace the command line
My talk is about new Python geospatial software and a double challenge in developing it. Why are we writing new geodata access libraries? How do you use it yourselves?

A Double Challenge

  • Helping experienced Python programmers learn GIS concepts
  • Helping GIS experts learn to be better Python programmers
My talk is about new Python geospatial software and a double challenge in developing it. Why are we writing new geodata access libraries? How do you use it yourselves?

Mapbox Cloudless

The Mapbox Satellite Team is making a Cloudless Landsat mosaic of the world. This is a base layer for maps created by Mapbox users. Its primary requirements are to be current and to look good. We'll be working on products that are better for science in the future. This is the project that drives a lot of Rasterio development. It has interesting technical problems.

We have interesting problems

  • Off-the-shelf solutions are scarce
  • Trial and error is involved
  • We need new software
Trial and error isn't our only method, but it's an important one. It has a big impact on our requirements.

Software requirements

Allow rapid prototyping and iteration Use dependable algorithms Scale out to many computers We need to get new iterations of our algorithms into production quickly. We need robust numerical libraries. And we need to distribute the work across many servers. And also know the cost of doing it.

Fulfilling requirements

  • 1 and 3 argue for open source
  • 1 argues for a high-level language with handy multi-dimensional array syntax
  • 2 argues for LAPACK (&c) and GDAL
  • The fit: Linux, Python, Numpy, Scipy

GDAL Python bindings?

  • They've served us long and well
  • But they don't fit well with the good parts of Python
  • We can do better
We can do better while keeping the power of the GDAL and OGR libraries.

The good parts

What do I mean by the good parts of Python? I've made a Venn diagram of unique good features of three common languages. The "good parts" of each. It's not exhaustive, I'm leaving out some good parts. We can argue over whether I got these exactly right and I hope we will. The image is derived from http://commons.wikimedia.org/wiki/File:Venn_diagram_cmyk.svg. Anybody recognize the yellow language? I'll give you a hint: the other two are implemented in it. Right: C/C++. Anybody recognize the magenta language? Right, javascript. Anybody recognize the blue language? Right, Python. The good parts of Python are lacking in Javascript and C. Wrapper generators like SWIG give you the features in the overlap between Python and C, but have a hard time with the good parts of the dynamic language. Bottom line is that GDAL's Python bindings aren't a good match for the good parts of Python. I'm thinking about the design of D3 and Mike Bostock's explanation of what its user gain by D3's embrace of the DOM and browser standards. I'm convinced that spatial programmers can similarly benefit from software that embraces Python.

OGR Example

from osgeo import ogr

# Create ring
ring = ogr.Geometry(ogr.wkbLinearRing)
ring.AddPoint(1179091.1646903288, 712782.8838459781)
ring.AddPoint(1161053.0218226474, 667456.2684348812)
ring.AddPoint(1214704.933941905, 641092.8288590391)
ring.AddPoint(1228580.428455506, 682719.3123998424)
ring.AddPoint(1218405.0658121984, 721108.1805541387)
ring.AddPoint(1179091.1646903288, 712782.8838459781)

# Create polygon
poly = ogr.Geometry(ogr.wkbPolygon)
poly.AddGeometry(ring)
[source: http://pcjericks.github.io/py-gdalogr-cookbook/] This is the classic way to create an osgeo.ogr geometry object. It could be tightened up a bit.

Analogy

# C style
nums = []
for i in range(100):
    nums.append(i)
vs
# Python style
nums = list(range(100))
I've got an analogy to Python list construction. First, I'm showing the most imperative, C-like way of making a Python list of the first 100 numbers. Next, the most idiomaticly Python way of doing it.

Not just less code – faster

$ python -m timeit -s "nums=[]" "for i in xrange(1000):" "  nums.append(i)"
10000 loops, best of 3: 76.4 usec per loop
vs
$ python -m timeit -s "nums=list(xrange(1000))"
100000000 loops, best of 3: 0.0108 usec per loop
The idomaticly Python way is faster. ridiculously faster! Do this in your code. Provide APIs like this in your Python packages.

Literal syntax: {} for geometries

# Create polygon
poly = {
    'type': 'Polygon',
    'coordinates': [[
        (1179091.1646903288, 712782.8838459781),
        (1161053.0218226474, 667456.2684348812),
        (1214704.933941905, 641092.8288590391),
        (1228580.428455506, 682719.3123998424),
        (1218405.0658121984, 721108.1805541387),
        (1179091.1646903288, 712782.8838459781) ]]}
30X faster than OGR with AddPoint()

Pay only for what you eat

# This is "dumb" data, no methods.
data = {
    'type': 'Polygon',
    'coordinates': [[
        (1179091.1646903288, 712782.8838459781),
        ...
        (1179091.1646903288, 712782.8838459781) ]]}

# When you need spatial methods, bring in Shapely.
from shapely.geometry import shape
print shape(data).buffer(100.0).area

Winning the Double Challenge

  • Python programmers get GIS data access via familiar Python idioms (dicts, iterators, &c.)
  • Future Python programmers learn to do things in the fast and effective Python way

Design of Fiona and Rasterio

  • A Python package at the top
  • Extension modules (using Cython) in the middle
  • Fast loops, typed memoryviews, "nogil" blocks
  • GDAL shared library on the bottom
Cython takes care of a lot of C extension details for us. Rasterio uses Cython features like fast loops, typed memoryviews, and "nogil" blocks.

Reading raster data

import rasterio

with rasterio.open(path) as src:
    data = src.read()
  • open() gives you a file-like dataset object
  • read() gives you a Numpy ndarray
  • Read windows of data with extended slice-like syntax
Rasterio is geared for band-interleaved data. New read() method coming in 0.9.

Reading vector data

import fiona

with fiona.open(path) as src:
    first = next(src)
  • open() gives you a file-like iterator
  • next() gives you the next feature record from the iterator
  • Records are Python dicts

Writing raster data

kwargs = src.meta

with rasterio.open(path, 'w', **kwargs) as dst:
    dst.write(arr)
  • Get keyword args needed to open a dataset for writing from another dataset
  • write() takes an ndarray
  • You can also write to windows of a dataset

Writing vector data

kwargs = src.meta

with fiona.open(path, 'w', **kwargs) as dst:
    dst.write(record)
  • Get keyword args needed to open a dataset for writing from another dataset
  • write() takes a feature record dict

Georeferencing

Fiona and Rasterio follow the lead of pyproj
>>> import rasterio
>>> src = rasterio.open('tests/data/RGB.byte.tif')
>>> src.crs
{'units': 'm', 'zone': 18, 'ellps': 'WGS84', 'proj': 'utm', 'no_defs': True}
PROJ.4 items as a dict.

rasterio.features module

  • rasterio.features.shapes() yields all the features of an array as GeoJSON-like objects
  • rasterio.features.rasterize() "burns" GeoJSON-like objects into an array
  • Dicts, iterators, tuples, ndarrays
  • No datasets or layers necessary
{'coordinates': [[(71.0, 6.0), ...]], 'type': 'Polygon'}, ...

rasterio.warp module

  • rasterio.warp.reproject() maps elements of one array to another, using cartographic projections
  • No datasets or layers required
  • Data created in non-GIS programs can be reprojected for use with GIS programs

In the Beginning...

Was the Command Line

Grep is cool, but

$ gdalinfo tests/data/RGB.byte.tif | grep -c '^Band \d'
3
there's gotta be a better way

Command Line Fun

$ rio --help
Usage: rio [OPTIONS] COMMAND [ARGS]...

  Rasterio command line interface.

Options:
  -v, --verbose  Increase verbosity.
  -q, --quiet    Decrease verbosity.
  --help         Show this message and exit.

Commands:
  bounds     Write bounding boxes to stdout as GeoJSON.
  info       Print information about a data file.
  insp       Open a data file and start an interpreter.
  merge      Merge a stack of raster datasets.
  shapes     Write the shapes of features.
  transform  Transform coordinates.

rio info

$ rio info tests/data/RGB.byte.tif --indent 2 
{
  "count": 3,
  "crs": "EPSG:32618",
  "dtype": "uint8",
  "driver": "GTiff",
  "bounds": [
    101985.0,
    2611485.0,
    339315.0,
    2826915.0
  ],
  "height": 718,
  "width": 791,
  "nodata": 0.0
}

Single items

$ rio info tests/data/RGB.byte.tif --count
3
$ rio info tests/data/RGB.byte.tif --crs
EPSG:32618
$ rio info tests/data/RGB.byte.tif --bounds
101985.0 2611485.0 339315.0 2826915.0

rio.shapes | geojsonio

$ rio shapes /tests/data/shade.tif --precision 5 \
> | underscore extract features.112 \
> | geojsonio
Assists from underscore-cli and geojson-cli

New Releases

  • Fiona 1.2
  • Rasterio 0.13
  • Shapely 1.4

Thanks

  • Mapbox satellite team (Chris, Charlie, Amit, Bruno, Camilla)
  • Asger Petersen, Mike Toews, Brendan Ward, Kelsey Jordahl, René Buffat, Jacob Wasserman, Oliver Tonnhofer, Joshua Arnott, Phil Elson, Matt Perry
  • And especially Frank Warmerdam and Even Rouault

Links

THE END