PCIC Data Portal 2.0

40
Demos Architecture Bonus PCIC Data Portal 2.0 Staff Meeting James Hiebert February 18, 2014 James Hiebert PCIC Data Portal 2.0

description

Presentation to the staff of the Pacific Climate Impacts Consortium on 2014/02/18 about its Computational Support Group's work on version 2.0 of the PCIC Data Portal.

Transcript of PCIC Data Portal 2.0

Page 1: PCIC Data Portal 2.0

DemosArchitecture

Bonus

PCIC Data Portal 2.0Staff Meeting

James Hiebert

February 18, 2014

James Hiebert PCIC Data Portal 2.0

Page 2: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Outline

1 Demos

2 ArchitectureMetadata DatabasePython BackendPydapncWMSBasemapsFront-end

3 BonusAutomated Testing

James Hiebert PCIC Data Portal 2.0

Page 3: PCIC Data Portal 2.0

Outline

1 Demos

2 ArchitectureMetadata DatabasePython BackendPydapncWMSBasemapsFront-end

3 BonusAutomated Testing

2014-02-18

PCIC Data Portal 2.0

Outline

1. Last week we deployed our 4th and hopefully final release candidatefor version 2.0 of the PCIC Data Portal. It’s been a four monthbeta period over which we have received and responded tofeedback, both from inside PCIC and from some external betatesters. Many of you have seen these at the various theme meetingsthat we had throughout the fall, but I’d like to take this opportunityto both introduce the rest of you to the data portal as well aselaborate on more on what is running behind the scenes and all ofthe work that has gone into producing it.

2. Typically in these presentations, I hold you captive with all of thetechnical details first and save the demo for the end. But in thiscase, I’ll start with the demo and then if you don’t care about howwe did it, you can just check out after that.

Page 4: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Raster Portal(s)

Coming soon!

James Hiebert PCIC Data Portal 2.0

Page 5: PCIC Data Portal 2.0

Raster Portal(s)

Coming soon!

2014-02-18

PCIC Data Portal 2.0Demos

Raster Portal(s)

1. The software that we have written are a variety of components togenerally handle the organization and presentation of raster data;that is gridded fields of spatiotemporal data. There are several setsof high value data, for which we have written a “raster portal”which can serve that data up.

Page 6: PCIC Data Portal 2.0

DemosArchitecture

Bonus

BCSD Downscale Canada

James Hiebert PCIC Data Portal 2.0

Page 7: PCIC Data Portal 2.0

BCSD Downscale Canada

2014-02-18

PCIC Data Portal 2.0Demos

BCSD Downscale Canada

1. You’ll see that the feature set is intentionally fairly sparse. Theapplication’s purpose is to allow the users to get the data theywant, and only the data they want, and then to send them on theirway. The main section of screen real estate is the map. The map isfor displaying the areas for which data exists and then to allow theuser to select an area for which to download.

2. In the top right, there is a tree selection which controls the datasetthat is displayed and that which will be downloaded. And finallythere are a couple options for selecting a time range and dataformat.

3. We only support formats which support multidimensional data,which isn’t very many right now. We’ll be adding Arc ASCII Grid bythe end of the fiscal year, which isn’t technically multidimensional,but we’ll probably send a zip file of individual grids, one pertimestep.

Page 8: PCIC Data Portal 2.0

DemosArchitecture

Bonus

BC PRISM

James Hiebert PCIC Data Portal 2.0

Page 9: PCIC Data Portal 2.0

BC PRISM

2014-02-18

PCIC Data Portal 2.0Demos

BC PRISM

1. The BC PRISM portal is very similar to the BCSD Downscalingportal, with a few minor differences. First of all the map projectionis specific to BC. We’ve used the BC Albers projection, which is alittle more visually appealling (though it does present somechallenges). Secondly, because the PRISM data only consists ofmonthly climatologies, the data volume in the temporal dimension isvery small. For that reason, we elimintated the time subset controls,and chose just to give the user the entire time range.

Page 10: PCIC Data Portal 2.0

DemosArchitecture

Bonus

VIC (Generation 1)

James Hiebert PCIC Data Portal 2.0

Page 11: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Software Components

James Hiebert PCIC Data Portal 2.0

Page 12: PCIC Data Portal 2.0

Software Components

2014-02-18

PCIC Data Portal 2.0Architecture

Software Components

1. One thing that you’ll notice from this diagram is that the data itselfis at the foundation of this software stack. Without the data inplace before hand, essentially nothing else can exist without it. Eventhe metadata in the database comes from the NetCDF files. This iswhy we have been somewhat militant about wanting your data tobe finalized before we begin to work on the portal to publish it.

2. The NetCDF box here is the only thing that just data sitting ondisk. These four boxes (PostgreSQL, ncWMS, pydap, pdp) are alldifferent pieces of software running on the server which respond toincoming web requests. PostgreSQL organizes all of the metadataabout the available data, ncWMS provides the climate visualizationlayers, pydap responds to requests for the actual data, and pdpresponds to all of the requests that build up the user interface. [Doa page load showing the network tools]

Page 13: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Metadata Database

James Hiebert PCIC Data Portal 2.0

Page 14: PCIC Data Portal 2.0

Metadata Database

2014-02-18

PCIC Data Portal 2.0Architecture

Metadata DatabaseMetadata Database

1. This might be a bit too much detail, but try to bear with me. Thisdatabase stores the full relationship strucutre between all of thedata files that we store and want to publish. It tracks all of the fileson disk that we have, all of the different variables that they contain,full ranges for each variable so that we can quickly set color scalesand such for the visualization layers. It stores all of the metadataabout the files such as the timesteps that they contain, what theirgrid parameters are, what models they are from and how they relateto other driving models (for example in the case of an RCM forcedby a GCM). All of these can be grouped into “ensembles” which isa group of rasters that we are publishing together on a single portalpage.

2. The data contained in the schema allows the web application tofunction quickly, because everything is quickly searchable withoutopening up a bunch of files and having to read terrabytes of datajust to determine a few key attributes.

Page 15: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Python Backend

James Hiebert PCIC Data Portal 2.0

Page 16: PCIC Data Portal 2.0

Python Backend

2014-02-18

PCIC Data Portal 2.0Architecture

Python BackendPython Backend

1. We have written a full web application backend in python whichdoes all of the file format translation, all of the databasecommunication and passes all of the metadata on to the webUI tobe interpreted by the user. The application consists of about 2800lines of python code plus 1500 lines of testing code that we havewritten outright. There’s about another 3000 lines of code whichmakes up PyDAP which we have heavily modified.

Page 17: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Python Backend

1 ensemble_name = ’bc_prism ’

portal_config = {

’title’: ’High -Resolution Climatology ’,

’ensemble_name ’: ensemble_name ,

’js_files ’ : wrap_mini ([

6 ’js/prism_demo_map.js’,

’js/prism_demo_controls.js’,

’js/prism_demo_app.js’],

basename=’bc_prism ’, debug=True)

}

11 portal_config = updateConfig(global_config , portal_config)

map_app = wrap_auth(MapApp (** portal_config), required=False)

dsn = dsn + ’?application_name=pdp_prism ’

with session_scope(dsn) as sesh:

16 conf = db_raster_configurator(sesh , "Download Data", 0.1, 0, ensemble_name ,

root_url=global_config[’app_root ’]. rstrip(’/’) + ’/’ +

ensemble_name + ’/data/’

)

data_server = wrap_auth(RasterServer(dsn , conf))

21 catalog_server = RasterCatalog(dsn , conf) #No Auth

menu = PrismEnsembleLister(dsn)

portal = PathDispatcher ([

(’^/map /?.*$’, map_app),

(’^/ catalog /.*$’, catalog_server),

26 (’^/data /.*$’, data_server),

(’^/menu.json.*$’, menu)

]) James Hiebert PCIC Data Portal 2.0

Page 18: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

OPeNDAP and PyDAP

Designed to be a:

“discipline-neutral means of requesting and providing data acrossthe [web]”

Data Access Protocol (DAP)

Open source

Machine-to-machine transfer of scientific data

Mostly supported by US scientific agencies (NOAA, NASA,NSF)

James Hiebert PCIC Data Portal 2.0

Page 19: PCIC Data Portal 2.0

OPeNDAP and PyDAP

Designed to be a:

“discipline-neutral means of requesting and providing data acrossthe [web]”

Data Access Protocol (DAP)

Open source

Machine-to-machine transfer of scientific data

Mostly supported by US scientific agencies (NOAA, NASA,NSF)2

014-02-18

PCIC Data Portal 2.0Architecture

PydapOPeNDAP and PyDAP

1. PyDAP is the component of the data portal that actually providesthe data download services. It’s an implementation of theOPeNDAP protocol which is designed to be a discipline neutralmeans of transferring data across the web. This protocol is opensource and is designed to OS and application independent such thatyou can get data into whatever software you want to use to do yourdata analysis. It’s supported by mostly US scientific agencies suchas NOAA, NASA and the National Science Foundation.

Page 20: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

OPeNDAP and PyDAP

James Hiebert PCIC Data Portal 2.0

Page 21: PCIC Data Portal 2.0

OPeNDAP and PyDAP

2014-02-18

PCIC Data Portal 2.0Architecture

PydapOPeNDAP and PyDAP

1. There are a number of different OPenDAP servers out there, butPyDAP is the one that we use to serve all of the data itself. Itsarchitecture is quite a bit more flexible than some of the otherOpenDAP servers out there. This is a rough layout of thearchitecture. It has a number of “handlers” which are written tointerpret different data formats and translate them to the DAPstructure. Then on the top end, there are numerous “responders”that translate the DAP structure into output formats that the userwants.

2. [describe more specifically which parts are our and which we use]

Page 22: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

How much of Pydap is our code?a

aSource: hg churn

pydap.handlers.pcic: 100%

pydap.handlers.hdf5: 68.0%

pydap.responses.netcdf: 61.5%

pydap.handlers.sql: 12.3%

pydap.handlers.csv: 3.7%

pydap: 2.3%

pydap.responses.xls: 1.3%

pydap.responses.html: ?

James Hiebert PCIC Data Portal 2.0

Page 23: PCIC Data Portal 2.0

How much of Pydap is our code?a

aSource: hg churn

pydap.handlers.pcic: 100%

pydap.handlers.hdf5: 68.0%

pydap.responses.netcdf: 61.5%

pydap.handlers.sql: 12.3%

pydap.handlers.csv: 3.7%

pydap: 2.3%

pydap.responses.xls: 1.3%

pydap.responses.html: ?

2014-02-18

PCIC Data Portal 2.0Architecture

PydapHow much of Pydap is our code?a

aSource: hg churn

1. To give you a bit of an idea of to what degree Pydap was“off-the-shelf”, I ran the command “hg churn” on all of the pydaprepositories, which measures the changes in the repository by linesof code. The fractions shown are the churn of PCIC staff divided bythe total churn of all committers. You can see that we wrote onehandler by ourselves, the hdf and netcdf work is mostly ours, and forthe rest of the modules we only had to make minimal changes.

Page 24: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Big data, big RAM, BadRequest, Oh My!

James Hiebert PCIC Data Portal 2.0

Page 25: PCIC Data Portal 2.0

Big data, big RAM, BadRequest, Oh My!

2014-02-18

PCIC Data Portal 2.0Architecture

PydapBig data, big RAM, BadRequest, Oh My!

1. One of the technical problems that we ran up against was that all ofthe available OPeNDAP data servers load their responses entirelyinto RAM before sending them out. So if you want to serve up largedata sets, the size of your response is limited by your available RAMdivided by the number of concurrent responses that you areprepared to serve. If you try and make a request to, say, THREDDSOPeNDAP server that’s larger than the JVM allocated memory, theuser will just get back a BadRequest error.

2. For some applications this may be fine, or even desirable, but forthe purposes of serving large data sets, the network pipe is usuallythe bottleneck. Rather than annoy and frustrate the user by forcingthem to carve up their data requests to be arbitrarily small, wewanted to allow as large a request as the users were prepared toaccept.

Page 26: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Generators: 70’s tech that works today!

a function which yields execution rather than returning

yields values one at a time, on-demand

low memory footprint

faster; no calling overhead

elegant!

James Hiebert PCIC Data Portal 2.0

Page 27: PCIC Data Portal 2.0

Generators: 70’s tech that works today!

a function which yields execution rather than returning

yields values one at a time, on-demand

low memory footprint

faster; no calling overhead

elegant!

2014-02-18

PCIC Data Portal 2.0Architecture

PydapGenerators: 70’s tech that works today!

1. Enter generators and coroutines. Generators are a programmingcontrol where a function, rather than returning, can yield executionand sort of return values one at a time on-demand. It has theperformance advantage of maintaining a low memory footprint, ifyou want to return something large, you don’t have to do so all atonce, and they tend to be slightly faster, because you avoid a lot ofcalling overhead of stack manipulation.

2. Generators have been around for a good thirty-five years, but havebeen experiencing a bit of a Renaissance lately. If one programs inpython, they are extremely easy to use, and with the advent of bigdata applications, they have a lot of utility.

Page 28: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Generator Example

from i t e r t o o l s import i s l i c edef f i b o n a c c i ( ) :

a , b = 0 , 1whi le True :

y i e l d aa , b = b , a+b

# p r i n t the f i r s t 10 v a l u e s o f the f i b o n a c c i sequencef o r x i n i s l i c e ( f i b o n a c c i ( ) , 1 0 ) :

p r i n t x

James Hiebert PCIC Data Portal 2.0

Page 29: PCIC Data Portal 2.0

Generator Example

from i t e r t o o l s import i s l i c edef f i b o n a c c i ( ) :

a , b = 0 , 1whi le True :

y i e l d aa , b = b , a+b

# p r i n t the f i r s t 10 v a l u e s o f the f i b o n a c c i sequencef o r x i n i s l i c e ( f i b o n a c c i ( ) , 1 0 ) :

p r i n t x2014-02-18

PCIC Data Portal 2.0Architecture

PydapGenerator Example

1. For those who aren’t familiar, here’s a quick example to understandgenerators. Generating a Fibonacci sequence is kind of thequintessential toy example. The generator function, fibonacci(), isdefined at the top. You’ll notice that it’s an infinite loop, becausethe sequence is by definition, infinite. But rather than building upthe values in memory, it just has a simple and elegant “yield”statement right inside the loop. The calling loop down below,actually pulls items from the function, one at a time, and then doeswhatever it needs to do with them. It’s fast, efficient, and actuallyfairly elegant, readable code, too.

2. So you can see, for something like a web application serving bigdatasets, this is perfect, because we can provide a very low latencyresponse, and then stream the data to the user as our high-latencyoperations like disk reads take place.

3. None of the OPeNDAP servers out there supported streaming, somany of the modifications that we made to PyDAP were for it touse generators to stream the responses.

Page 30: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

ncWMS

Off-the-shelf

Visualization of NetCDFrasters

Full featured WMS server

Limitations

File-based layerconfigurations (tediousand error-prone!)

Loads layers serially onstartup (slow!)

Scans layers for ranges(really slow!)

James Hiebert PCIC Data Portal 2.0

Page 31: PCIC Data Portal 2.0

ncWMS

Off-the-shelf

Visualization of NetCDFrasters

Full featured WMS server

Limitations

File-based layerconfigurations (tediousand error-prone!)

Loads layers serially onstartup (slow!)

Scans layers for ranges(really slow!)2

014-02-18

PCIC Data Portal 2.0Architecture

ncWMSncWMS

1. We’re using a modified version ncWMS to provide visualization ofthe climate rasters. It gives us a lot of stuff for free. It’s a fullfeatured Web Mapping Service server that converts netcdf files intotiled images usable on the web. [demo]

2. Unfortunately it has a few limitations that make it non-ideal for usewith big data. To configure a layer, you have to go through thefiles, one-by-one and add them to the list and configure 5-10different attributes. Additionally, when ever you start, re-start theserver, it goes through every single file, in order, scans them todetermine their ranges, so that it can assign a colorbar. This cantake many minutes, possibly hours, and it only gets slower the morelayers you add.

3. David Bronaugh has done some great work making modifications toncWMS to run it off of our metadata database, so that it gets itslist of layers from the database and all of the variable ranges andeverything. This has made it possible to scale our deployment upfrom a handful of demo layers to the full scale of data that we wantto publish at PCIC.

Page 32: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

Mapnik and Basemaps

Create our own basemaps from OpenStreetMap

Maximum flexibility in domain and projection

James Hiebert PCIC Data Portal 2.0

Page 33: PCIC Data Portal 2.0

Mapnik and Basemaps

Create our own basemaps from OpenStreetMap

Maximum flexibility in domain and projection

2014-02-18

PCIC Data Portal 2.0Architecture

BasemapsMapnik and Basemaps

1. A flat image of the climate rasters aren’t that useful, especially ifyou want to look at details in a particular locality. So thanks tosome great work by Basil, we have our own web basemaps based ondata from the OpenStreetMap project. We have the ability togenerate our own basemaps in any projection that we want and forany domain. And we have control over the tile service so we cantweak it for maximum performance.

Page 34: PCIC Data Portal 2.0

DemosArchitecture

Bonus

Metadata DatabasePython BackendPydapncWMSBasemapsFront-end

JavaScript Front-end

2600 lines of JavaScript

Responsible for tying everything together for the web user

Does little to no processing itself / just makes requests tovarious servers

James Hiebert PCIC Data Portal 2.0

Page 35: PCIC Data Portal 2.0

JavaScript Front-end

2600 lines of JavaScript

Responsible for tying everything together for the web user

Does little to no processing itself / just makes requests tovarious servers

2014-02-18

PCIC Data Portal 2.0Architecture

Front-endJavaScript Front-end

1. Finally, the last piece of the software stack is the JavaScriptfront-end that ties everything else together for the user. This isprobably the most finicky and possibly most complex piece of thecode base even though it doesn’t actually provide any functionalityin and of itself. It has be be aware of all of the various services thatare provided, it has to asyncronously make the requests, processthem, display things to the user, and often the results of onerequest affect other things on the page.

2. [Show dataset selection, and how it is a request. Show how datasetselection triggers layer change the loading of layer attributes]. If anyof these things fails, badness ensues.

Page 36: PCIC Data Portal 2.0

DemosArchitecture

BonusAutomated Testing

Automated Testing

James Hiebert PCIC Data Portal 2.0

Page 37: PCIC Data Portal 2.0

Automated Testing

2014-02-18

PCIC Data Portal 2.0Bonus

Automated Testing

1. In our two main repositories, we have about 1500 lines of codespecifically for automated testing of the functionality of both thePCDS data portal and the raster portals. This test suite covers alarge swath of the code base, but is also compact so we can run thefull test suite in less than 5 seconds. This is fast enough that it canbe intergrated directly into your development workflow and you canensure that any changes you make to the code have not negativelyand unintendedly affected any previously programmed functionality.

Page 38: PCIC Data Portal 2.0

DemosArchitecture

BonusAutomated Testing

Automated Testing

Why?

There’s a lot of code and many code paths. Manual testing isinsane, takes days, and isn’t complete.

Provides an “executable specification” for what the softwareshould do

Provides a way to ensure that code changes don’t affectexisting functionality (a.k.a. regression testing)

James Hiebert PCIC Data Portal 2.0

Page 39: PCIC Data Portal 2.0

Automated Testing

Why?

There’s a lot of code and many code paths. Manual testing isinsane, takes days, and isn’t complete.

Provides an “executable specification” for what the softwareshould do

Provides a way to ensure that code changes don’t affectexisting functionality (a.k.a. regression testing)

2014-02-18

PCIC Data Portal 2.0Bonus

Automated TestingAutomated Testing

1. So with a system that provide this much functionality, there are alot of different code paths through it, any of which could be takenfor different user requests. It’s important to test as many of theseas possible, every time you make changes in the system. Tomanually go through all of these–and we did with the release of thePCDS portal a year ago–is meticulous, time consuming and errorprone. Automating this process pays off very quickly both in timeand in code quality.

2. Additionally, the tests provide a sort of “executable specification”,declaring what the various pieces of the code are supposed to do. Ifa tests fails, your code doesn’t meet the spec.

3. Finally, the test suite provides a baseline against which furtherdevelopment cannot regress. It ensures that future changes will notnegatively impact the functionality that we have previouslydeveloped.

4. [demo of pytest]

Page 40: PCIC Data Portal 2.0

DemosArchitecture

BonusAutomated Testing

Questionsand hopefully answers

James Hiebert PCIC Data Portal 2.0