Distributed Tile Processing – GeoTrellis and Spark – The Challenge



Distributed Tile Processing – GeoTrellis and Spark – The Challenge

0 0


talk-distributed-tile-processing

reveal.js slides for a talk entitled "Distributed tile processing with GeoTrellis and Spark"

On Github lossyrob / talk-distributed-tile-processing

Distributed Tile Processing

with

GeoTrellis and Spark

Rob Emanuele / @lossyrob

The Challenge

How do we work with very large raster data?

Specifically...

How do we work with the NASA NEX Down-sampled Climate Projections (NEX-DCP30) open data set?

What is NEX Climate Projection data?

Global Circulation Models

Models for predicting world temperature and precipitation.

IPCC Assessment Report

  • IPCC = Intergovernmental Panel on Climate Change
  • Assessment Report 5 (AR5) published in 2014.
  • More than 800 authors

3 Key Categories:

  • Model

    • 33 different models
    • Model Ensembling
  • Dataset

    • Temperature MAX
    • Temperature MIN
    • Precipitation
  • Scenario

    • Historical
    • Future RCPs

Representative Concentration Pathways

NEX Down-sampled Data

  • Monthly data over conterminous US
    • Historical from 1950 - 2006
    • 4 RCP scenarios from 2006 - 2099
  • 8190 netCDF files on S3 - s3://nasanex/NEX-DCP30
  • 15.3 TB in compressed GeoTiff tiles.
  • RCP 8.5, max for datatype/model combo: 90.92 GB

Our workflow for processing NEX data

The Tools

  • Scala library for doing all things geospatial.
  • framework for doing distributed raster processing on Akka and Spark.
  • Includes local, zonal, focal, and global operations on rasters.
  • Currently in incubation at
  • Fast and general engine for large-scale data processing
  • Does things Hadoop doesn't, like cache intermediate results in memory.
  • Written in Scala!
  • Also has bindings for Python and Java

Accumulo

  • Big table implementation
  • Has sorted indexing
  • Columnar database
  • Also used by GeoMesa, another Scala project at LocationTech

Strategies for working with Big Rasters

Tiles

Tiles

Indexing tiles

RasterRDD[K]

K is key type, based on tile indexing.
  • SpatialKey
  • TemporalKey
  • SpaceTimeKey

Data loading

Step 1:

Export the netCDF data into 512x512 GeoTiff tiles.

Step 2:

Ingest the data into Accumulo using GeoTrellis-Spark.

  • Ingest the GeoTiffs to Accumulo in parallel across a cluster.
  • Ingest consists of
    • reprojection
    • mosaicing to tile scheme (TMS)
    • pyramiding up zoom levels
    • Calculate index splits.

Analysis of NEX data

Live coding session...

Thanks!

Take it away Johan...

The GeoTiff File Format

with

GeoTrellis and Scala

Johan Stenberg / @johanstenbergg

How do you read GeoTiffs on the JVM?

  • GDAL, Geospatial C lib, fast!

  • GeoTools, Geospatial Java lib, speed?

Why yet another GeoTiff Reader?

  • GeoTools large dependency
  • GDAL Java bindings hard to install
  • Go-To raster file format at GeoTrellis
  • GeoTrellis is all about speed, everything optimized and benchmarked

What is the GeoTiff file format?

  • Extension to the Tiff File Format
  • Used for images with Geospatial Metadata
  • Adds a bounding box and the CRS through tags

Geodata?

  • Bounding Box easy to read
  • Coordinate Reference System horrible to read
  • Turn it into a proj4 string and use the proj4j lib to read

Compressions

  • Huffman, CCITT3, CCITT4, Packbits
  • LZW
  • Zip

Benchmark Time!

Benchmark Disclaimer

  • Ran on my development computer
  • Conducted with Caliper
  • Microbenchmarks, look at relative speed, not speed
  • GDAL is read through the Java bindings, into GeoTrellis rasters
  • GeoTools is also turned into GeoTrellis rasters

~same for CCITT3 and CCITT4

~same for CCITT3 and CCITT4

Sidenote about Speed

  • Scala slow when using functional mappings
  • Arrays, while loops and bit operations
  • Skip Big-O time complexity analyzation (O(n) - duh), use microbenchmarks

Future?

  • Tons of compressions, JPEG hard but needed
  • Keep up to date with custom tags
  • Add a shape file reader (GeoTools is fast!)

QUESTIONS?

Benchmarks found at https://github.com/geotrellis/benchmark

http://geotrellis.io