Taming Rich GML with Stetl - FOSS4G 2013 Nottingham

Post on 11-May-2015

430 views 2 download

Tags:

description

Presentation on sept 21, 2013 at FOSS4G 2013 in Nottingham (UK). Stetl, Streaming ETL, is a lightweight, geospatial ETL-framework written in Python, integrating transformation tools like GDAL/OGR, XSLT and PostGIS. Stetl targets ETL cases that involve XML and GML data, like INSPIRE data harmonization, but other transformations, even non-geospatial, can also be made. Stetl applies declarative programming: a configuration file specifies an ETL chain of input/filter/output modules. Stetl uses native calls to C-level libraries like libxml2 (via lxml) for speed. See more at http://stetl.org Watch this presentation video recording on FOSSLC: http://www.fosslc.org/drupal/content/taming-rich-gml-stetl-lightweight-python-framework-geospatial-etl

Transcript of Taming Rich GML with Stetl - FOSS4G 2013 Nottingham

Taming Rich GML with Stetl-

A lightweight Python Framework for Geospatial ETL

Just van den BroeckeFOSS4G Nottingham 2013

Sept 21, 2013www.justobjects.nl

1

About MeIndependent Open Source Geospatial Professional

Secretary OSGeo Dutch Local Chapter Member of the Dutch OpenGeoGroep

Just van den Broeckejust@justobjects.nl www.justobjects.nl

2

We have a Problem

3

The Rich GML Problem

4

Rich GML = Complex Mess

5

INSPIRE Dutch National Datasets

Germany: AFIS-ALKIS-ATKISUK: OS Mastermap

.

.6

“Semi GML” e.g. Dutch Addresses & Buildings (BAG)

ArbitraryNesting

7

The Street Name!

A Street Element in an INSPIRE Annex I Address..

8

Complex Model

Transformations

9

100+ MBGML Files

10

11

Millionsof

Objects

12

10s of Millionsof

<Elements>

13

MultipleTransformation

Steps

14

Solution is Spatial ETL

15

But How ?

16

FOSS ETL - DIY ? Maybe

17

FOSS ETL - High Level

18

FOSS ETL - Lower Level

Each powerful individually but cannot do the entire ETL

ogr2ogr

19

FOSS ETL - How to Combine?

=+ + ?ogr2ogr

20

Example - 2011 INSPIRE-FOSS

http://inspire.kademo.nl/doc/design-etl.html

Good ideas buthard to scale and reuse. Need Framework

21

FOSS ETL - Add Python to Equation

=+ + ?( )ogr2ogr

22

=+ +

Stetl

( )ogr2ogr

23

Stetl=

SimpleStreaming

SpatialSpeedy

ETL24

GML1

GML2

Stetl

From Barrels of GML to Maps

25

26

StetlConcepts

27

Process Chain

Input Filter OutputFilter

Stetl concepts

Source Target

28

Process Chain

Input Filter Outputgml

Filter

Stetl concepts

29

Example: GML to PostGIS

Reader ogr2ogr

gml

Stetl concepts

30

Example: INSPIRE Model Transform

ogr2ogr XSLT Writergml

Stetl concepts

Simple Features

Complex Features

31

Example: deegree Store

ogr2ogr XSLTdeegreeWriter

Stetl concepts

Or viaWFS-T

32

Process Chain - How?

Input Filters Output

Stetl concepts

33

Example: XML to Shape

XMLInput

XSLTFilter

ogr2ogrOutput

34

Example: XML to Shape

The Source

35

Example: XML to Shape

XMLInput

36

Example: XML to Shape

XMLInput

XSLTFilter

37

Example: XML to Shape

Prepare XSLT Script

38

Example: XML to Shape

XSLT GML Output39

Example: XML to Shape

XMLInput

XSLTFilter

ogr2ogrOutput

40

Example: XML to Shape

The Stetl Config File

ProcessChain

XMLInputXSLT

Filter

ogr2ogrOutput

41

Running Stetl

stetl -c etl.cfg

42

Result Shapefile viewed in QGIS

43

Installing Stetl

via PyPi

Deps•GDAL+Python bindings•lxml (xml proc)•psycopg2 (Postgres)

sudo pip install stetl

44

Speed: Streaming

Input Filter Output

gml

Stetl concepts

45

Speed: Going Native

Input Filter Outputgml

ogr2ogr StetlStetl

Native C Libs/Progs

Calls

Stetl concepts

46

Example Components

Input Filters Output

Stetl concepts

XMLFile XSLT GMLFile

ogr2ogr XMLAssembler ogr2ogr

LineStream XMLValidator WFS-T

deegree* FeatureExtractor deegree*

YourInput YourFilter YourOutput

47

Example: XsltFilter Pythonfrom util import Util, etreefrom filter import Filterfrom packet import FORMAT

log = Util.get_log("xsltfilter")

class XsltFilter(Filter): # Constructor def __init__(self, configdict, section): Filter.__init__(self, configdict, section, consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc)

self.xslt_file_path = self.cfg.get('script') self.xslt_file = open(self.xslt_file_path, 'r') # Parse XSLT file only once self.xslt_doc = etree.parse(self.xslt_file) self.xslt_obj = etree.XSLT(self.xslt_doc) self.xslt_file.close()

def invoke(self, packet): if packet.data is None: return packet return self.transform(packet)

def transform(self, packet): packet.data = self.xslt_obj(packet.data) log.info("XSLT Transform OK") return packet

48

[etl]chains = input_xml_file|my_filter|output_std

[input_xml_file]class = inputs.fileinput.XmlFileInputfile_path = input/cities.xml

# My custom component[my_filter]class = my.myfilter.MyFilter

[output_std]class = outputs.standardoutput.StandardXmlOutput

class MyFilter(Filter): # Constructor def __init__(self, configdict, section): Filter.__init__(self, configdict, section, consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc)

def invoke(self, packet): log.info("CALLING MyFilter OK!!!!") return packet

Your Own Components

Stetl concepts

Step 1- Define Class

Step 2- Config Class

49

Data Structures

Stetl concepts

• Components exchange Packets• Packet contains data and status• Data formats, e.g. :

xml_line_stream etree_docetree_element (feature)etree_element_arraystringany..

50

deegree Integration

Stetl concepts

•Input DeegreeBlobstoreInput•Output DeegreeBlobstoreInput DeegreeFSLoaderOutput WFSTOutput

51

Cases - The Netherlands

•INSPIRE Download Services publish to deegree store (WFS) generate GML files (for Atom Feed)

•National GML Datasets GML to PostGIS (Top10NL, BGT)

52

[etl]chains = input_sql_pre|schema_name_filter|output_postgres, input_big_gml_files|xml_assembler|transformer_xslt|output_ogr2ogr, input_sql_post|schema_name_filter|output_postgres

# Pre SQL file inputs to be executed[input_sql_pre]class = inputs.fileinput.StringFileInputfile_path = sql/drop-tables.sql,sql/create-schema.sql

# Post SQL file inputs to be executed[input_sql_post]class = inputs.fileinput.StringFileInputfile_path = sql/delete-duplicates.sql

# Generic filter to substitute Python-format string values like {schema} in string[schema_name_filter]class = filters.stringfilter.StringSubstitutionFilter# format args {schema} is schema nameformat_args = schema:{schema}

[output_postgres]class = outputs.dboutput.PostgresDbOutputdatabase = {database}host = {host}port = {port}user = {user}password = {password}schema = {schema}

# The source input file(s) from dir and produce gml:featureMember elements[input_big_gml_files]class = inputs.fileinput.XmlElementStreamerFileInputfile_path = {gml_files}element_tags = featureMember

Top10NL Extract

ParameterSubstitution

53

Top10NL+BAG (Dutch Topo + Buildings)

54

BGT - Dutch Large Scale Topo

55

Case: INSPIRE DL Services - Dutch Addresses

Source<GML>

NLExtractStetl deegree

WFS

INSPIRE<GML>

AtomFeed

INSPIREAddresses

DutchAddresses+

Buildings

deegreeblobstore

Stetl

56

Project Status - Sept 21, 2013

• v1.0.4 installable via PyPi• Documentation on www.stetl.org • Real world transforms done• Seeking feedback, support and contributors

57

Rich GML Problem Solved?

58