FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

48
DISTRIBUTED TILE PROCESSING WITH GEOTRELLIS AND SPARK / Rob Emanuele @lossyrob

Transcript of FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

Page 1: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

DISTRIBUTEDTILE

PROCESSINGWITH

GEOTRELLIS AND SPARK / Rob Emanuele @lossyrob

Page 2: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

THE

CHALLENGE

HOW DO WE WORK WITH VERY LARGERASTER DATA?

Page 3: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

SPECIFICALLY...HOW DO WE WORK WITH THE

NASA NEX DOWN-SAMPLED CLIMATE

PROJECTIONS (NEX-DCP30)OPEN DATA SET?

Page 4: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

WHAT IS NEX CLIMATEPROJECTION DATA?

Page 5: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

GLOBAL CIRCULATION

MODELSModels for predicting world temperature and precipitation.

Page 6: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

IPCC ASSESSMENT REPORT

IPCC = Intergovernmental Panel on Climate ChangeAssessment Report 5 (AR5) published in 2014.More than 800 authors

Page 7: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

3 KEY CATEGORIES:

MODEL

33 different models

Model Ensembling

DATASET

Temperature MAX

Temperature MIN

Precipitation

SCENARIO

Historical

Future RCPs

Page 8: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

REPRESENTATIVE CONCENTRATION PATHWAYS

Page 9: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

NEX DOWN-SAMPLED DATAMonthly data over conterminous US

Historical from 1950 - 2006

4 RCP scenarios from 2006 - 2099

8190 netCDF files on S3 -

15.3 TB in compressed GeoTiff tiles.

RCP 8.5, max for datatype/model combo: 90.92 GB

s3://nasanex/NEX-DCP30

Page 10: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

OUR WORKFLOW FOR

PROCESSING NEX DATA

Page 11: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

THE TOOLS

Page 12: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

Scala library for doing all things geospatial.framework for doing distributed raster processing onAkka and Spark.Includes local, zonal, focal, and global operations onrasters.Currently in incubation at

Page 13: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

Fast and general engine for large-scale data processing

Does things Hadoop doesn't, like cache intermediate

results in memory.

Written in Scala!

Also has bindings for Python and Java

Page 14: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

ACCUMULOBig table implementation

Has sorted indexing

Columnar database

Also used by GeoMesa, another Scala project at

LocationTech

Page 15: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

STRATEGIES FOR WORKINGWITH BIG RASTERS

Page 16: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

TILES

Page 17: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

TILES

Page 18: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

INDEXING TILES

Page 19: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

is key type, based on tile indexing.

RasterRDD[K]K

SpatialKeyTemporalKeySpaceTimeKey

Page 20: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

DATA LOADING

Page 21: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

STEP 1:EXPORT THE NETCDF DATA INTO

512X512 GEOTIFF TILES.Python code using GDAL and rasterio.AWS Auto scaling groups and SQS.Code: https://github.com/lossyrob/nex-chunker-worker

Page 22: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

STEP 2:INGEST THE DATA INTO ACCUMULO

USING GEOTRELLIS-SPARK.Ingest the GeoTiffs to Accumulo in parallel across acluster.Ingest consists of

reprojectionmosaicing to tile scheme (TMS)pyramiding up zoom levelsCalculate index splits.

Page 23: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

ANALYSIS OF

NEX DATALive coding session...

Page 24: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

THANKS!Take it away Johan...

Page 25: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

THE GEOTIFFFILE FORMAT

WITH

GEOTRELLIS AND SCALA / Johan Stenberg @johanstenbergg

Page 26: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

HOW DO YOUREAD GEOTIFFSON THE JVM?

Page 27: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

GDAL, GEOSPATIAL C LIB,

FAST!

GEOTOOLS, GEOSPATIAL

JAVA LIB, SPEED?

Page 28: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

WHY YET ANOTHER GEOTIFFREADER?

GeoTools large dependency

GDAL Java bindings hard to install

Go-To raster file format at GeoTrellis

GeoTrellis is all about speed, everything optimized andbenchmarked

Page 29: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

WHAT IS THE GEOTIFF FILEFORMAT?

Extension to the Tiff File Format

Used for images with Geospatial Metadata

Adds a bounding box and the CRS through tags

Page 30: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

GEODATA?

Bounding Box easy to read

Coordinate Reference System horrible to read

Turn it into a proj4 string and use the proj4j lib to read

Page 31: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

COMPRESSIONS

Huffman, CCITT3, CCITT4, Packbits

LZW

Zip

Page 32: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

BENCHMARK

TIME!

Page 33: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

BENCHMARK DISCLAIMER

Ran on my development computer

Conducted with Caliper

Microbenchmarks, look at relative speed, not speed

GDAL is read through the Java bindings, into GeoTrellisrasters

GeoTools is also turned into GeoTrellis rasters

Page 34: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 35: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 36: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 37: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 38: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

~SAME FOR CCITT3 AND CCITT4

Page 39: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

~SAME FOR CCITT3 AND CCITT4

Page 40: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 41: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 42: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 43: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 44: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 45: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 46: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 47: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Page 48: FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark