eScience in the Netherlands - Dagstuhlmaterials.dagstuhl.de/files/16/16252/16252.Robvan... · ·...
Transcript of eScience in the Netherlands - Dagstuhlmaterials.dagstuhl.de/files/16/16252/16252.Robvan... · ·...
Career paths
ResearchTechnicalManagerial
Scale…
13
12
11
10 eScience research engineer
eScience research engineer
eScience
coordinator
eScience
specialist
eScience
researcher
eScience
manager
eScience top
researcher
eScience top
specialist
Lessons learnt• Demand-driven: start from the science
• Collaboration, not competition (connected projects, calls)
• Good is good enough
• Generalization (10% ring-fenced)
• Communication communication communication
– Hiring people
– Internal communication
– Project kickoffs
• IP, work place, generalization, co-authorships
– coordinators
– Web sites, demo’s
• Keep challenging the RSEs– Courses, hackathons, sprints, …
– Switch disciplines
eStepThe eScience technology platform
A coherent set of technologies to tackle the grand challenges in eScience
Cross-cutting basic skills
• Code quality and best practices
• Integration of software
• Scaling of software
• Analytics and statistics
• Visualization
NLeSC eScience competences applied in research
1. Optimized data handlingData integration, data base optimization, structured & unstructured data, real time data
2. Big data analyticsStatistics, machine learning, visualization, text mining
3. Efficient computingDistributed & acceleratedcomputing, efficient algorithms
Optimized data handling
Big Data analytics
Efficient computing
Distributed computing
eStep
Accelerated computing
Low power computing
Orchestrated computing
High-performance computing
Natural Language processing
Machine learning
Information visualization
Scientific visualization
Information retrieval
Computer vision
Handling sensor data
Linked data
Information integration
Databases
Data assimilation
• Key expertises
are used in many
projects
• Projects often
use quite a
number of
different
competences and
technologies
eStep Goals
• Prevent fragmentation and duplication
• Promote the exchange and re-use of best practices
• Represent NLeSC’s expertise and knowledge base
• Improve the eScience state of the art with a fundamental eScience research line
eStep
project-specific software
discipline-specific software
enhanced science
overarching software
e-infrastructure
generic libraries, tools, and algorithms
NLeS
C
pro
jects
• Main criteria for integrating
technology in eStep:
– State-of-the-art / best-of-breed?
– Generic and overarching?
– Match with our expertise areas?
– Includes externally developed
software
Open platform!
Our sustainability approach• Prevent duplication, fragmentation
• Build something that is worth sustaining!– Sufficiently generic
– Modular
– High quality
– Must be taken into account from the start
• Enforce software engineering guidelines and best practices
• Educate partners with software carpentry and data carpentry
• Open source / open access, open standards, unless…
• Community coding
• Standardization for software and data formats
• eStep is an open platform
GitHub
Travis CITest and Deploy with Confidence.
Easily sync your GitHub projects with Travis CI
and you’ll be testing your code in minutes!
We run a Jenkins CI instance locally.
Used for private repositories and
repositories requiring HPC middleware.
A Common Workflow @ NLeSC
Open platform for building, shipping
and running distributed applications.
deploy
software
eScience software
• technology.esciencecenter.nl
• Non-technical, targets
general audience
• estep.esciencecenter.nl
• All eScience software
and knowledge you
need, in one place
• Technical, targets
developers, PIs
Knowledge base
• knowledge.esciencecenter.nl
• training and education
• best practices
• tutorials
• white papers
• training resources
• Software development
Checklist available
More info on eStep
technology.esciencecenter.nl
estep.esciencecenter.nl
Cities poorly protected against heat
Three persons in elderly house died due to heat
End of 16 day heat wave
Heat protection plan abandoned
Elderly use heat info call desk massively
Summer in the city example: human thermal comfort
Courtesy Bert Holtslag
Summer in the city example
• Kilometer scale: elevation (AHN2) and land-use data (Kadaster), imagery
for assessing the green vegetation coverage and the soil moisture
content
• Street scale: sky view factor, the building height to street width ratio, the
reflectivity and thermal characteristics of buildings and streets, the
abundance of vegetation
• Network of weather stations and crowd sourcing: wunderground.com
• Special measuring campaigns
• Social media?
• Combine with fine-grained models
Novel hourly forecasting
system for human thermal
comfort in urban areas on
street level
Courtesy Bert Holtslag
Optimized Data Handling Technology
• Distributed sensor networks, multi-model and multi-scale
simulations, data assimilation, data integration, multi-scale
pattern recognition, geographic information systems,
databases, …
• Xenon, NetCDF, HDF5, ROOT, XNAT, OpenDA, Hadoop,
MapReduce, Oracle, MySQL, Postgres, MonetDB,
ElasticSearch, DataVault, JSON, Spark, …
Accelerometer and Behaviour
standingsitting floating
Sta t i c acce le ra t ion
X-flappingglidingflapping flight
Dynamic acce le ra t ion
Heave vertical
Surge forward
Sway sideward
Courtesy Willem Bouten
Embodied Emotion Project• Mapping bodily expression of emotions
– To be downhearted
– Clenching fists
– My heart fills with joy
– My blood is boiling
• Test case: Dutch theatre texts 1600-1830
– Shift in experienced emotions
– Shift in embodiment of emotions
• Approach
– Establishing corpus & standardizing text
– Establishing emotional and bodily vocabularies
– Emotion mining
– Visualizing resultsSource: Nummenmaa et al., 2013
eMetabolomics example
• Use reaction rules to identify compounds in
Mass-spectrometry datasets
• Online at http://www.emetabolomics.org/
• Public and private version, private allows
bigger/longer calculations
• Lars Ridder, Laboratory of Biochemistry,
Wageningen University
Big Data Analytics Technology
• Natural language processing, machine learning, information and
scientific visualization, information retrieval, computer vision
• Matlab, R, NumPy, SciPy, scikit learn, Pandas, Weka, Xtas,
Twiqs.nl, D3, ExtJS, Cesium, Leaflet, OpenLayers, GeoExt,
X3Dom, X3DomExt, Mapnik, CommonSense, …
Efficient Computing
A r i t h m e t i c I n t e n s i t y
O( N )O( log(N) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra
(BLAS3)
Particle Methods
• Smart algorithms can improve performance dramatically
• Power consumption is becoming the bottleneck
• Legacy codes are inefficient on modern architectures
– Need completely different optimizations, algorithms
Efficient Computing Example
• Radio Frequency Interference (RFI) is a huge
problem for many radio astronomy observations
• Caused by
– Lightning, Vehicles, airplanes, satellites, electrical
equipment, GSM, FM Radio, fences, reflection of
wind turbines, …
• Best removed offline
– Complete dataset available
– Good overview / statistics / model
– Can spend compute cycles
• Partner: Astron
Real-time RFI mitigation
• Some pipelines need to run in real time today
– Image-based transient detection (LOFAR/AARTFAAC)
– Pulsar searching (WSRT/Apertif)
• SKA will be entirely real-time
– Data rates simply too high to store
• Novel algorithms with linear computational complexity– Only very little loss in quality
RFI mitigation on accelerators
• Accelerator-based computing
– GPUs, Xeon Phi, …
– Astronomy, ocean modeling, digital
forensics, radar systems, high-
energy physics
• Auto-tuning & runtime compilation
– Generate many codes at run-time,
select most efficient
Pulsar B1919+21 in the Vulpecula nebula.
Pulse profile created with folding and the LOFAR software telescope.Background picture courtesy European Southern Observatory.
0
10
20
30
40
50
60
70
80
Xeon Phi NVIDIA GTXTitan GPU
AMD HD7990GPU
0
2
4
6
8
10
Xeon Phi NVIDIA GTXTitan GPU
AMD HD7990GPU
Performance compared to CPU Power usage compared to CPU
Performance profile of Ocean simulation
POP has a very large codebase
written in Fortran 90
Callgraph obtained using gprof
3 kernels GPU optimized, 20%
improvement
Henk Dijkstra, Ben van
Werkhoven, et al.
Efficient Computing: Distributed
• Water management and climate modeling
• Simulations in astrophysics (AMUSE)
• Digital forensics (NFI)
• Astronomy (LOFAR)
• Text mining (xtas)
• Computational chemistry (Noodles)
• High-energy physics (ROOT & pandas)
Novel domain-specific algorithm for work distribution
Towards 2 km
resolution.
Better than
space filling
curves for
distributed
runs (36%
improvement).
Courtesy
Jason
Maassen
Efficient Computing Technology (2)
• Distributed computing, accelerators (GPUs), low-
power, orchestrated computing, HPC
• Ibis, Aether, Xenon, Magnesium, Osmium,
SmartSockets, MPI, UDT, Galaxy, Knime, Sockets,
Cuda, OpenCL, OpenMP, Thrift, Protobuffers,
ZeroMQ, SSH, GridFTP, Globus, SLURM, PBS,
GridEngine, P2P, OpenFlow, bandwidth on demand,
Docker, Celery, …
Several NLeSC applications require access to
distributed compute and storage resources.
Xenon provides a simple API for this, allowing
rapid development of such applications
Middleware independent: portable & reusable
Xenon: current users• Osmium: webservice on top Xenon
– Magma: eMetabolomics mass spectrometry analysis
– SIMCity: decision support for urban social economic complexity
• eSalsa: distributed climate simulation deployment
• Amuse: multi-model distributed astrophysics
• Via Appia: Pointclouds in archeology
• Vbrowser: distributed file management
• Biomarker: medical data upload tool
• Noodles: workflow tool in Python (Computational chemistry)
Xtas: distributed text analysis • Natural language processing and text mining
– Named entity recognition, sentiment detection, document clustering, topic modeling
• Use as web service or Python library
• Integrates with Elasticsearch for document storage and retrieval
• Developed by Intelligent System Lab Amsterdam (ISLA, UvA) and NLeSC
• Tight integration of external and internal software
• Searching Public Discourse: text analytics pipeline for historical research
• Texcavator: supporting large-scale text mining in the field of digital humanities
• Beyond the book: how international is a work of fiction?
• Semanticizer: entity linking
More info on eStep
technology.esciencecenter.nl
estep.esciencecenter.nl
NLeSC TrainingEssential Skills in Data-Intensive Research: Enabling your Research
25-29 January 2016, SURF Academy, Utrecht (for Life Sciences)
NLeSC, SURFsara, SURFnet and DTL
5 day workshop, with both hands-on and taught components for 1st year PhD students without programming background
Days 1-2 = Dealing with Data (Data Carpentry)Days 3-4 = Software and Programming (Software Carpentry)Day 5 = Introducing the e-Infrastructure
Future courses requested with other domain focus• Humanities & Social Sciences• Environment & Sustainability• Life Sciences• Physics & Beyond
Based on SoftwareCarpenty.org model(NLeSC & SURFsara have trained course leaders)
Dealing with DataData Stewardship and FAIR best practiceFrom Excel to databasesSQL practicalR practical
Computation & AutomationUsing the ShellIntroduction to programming (Python)Github and version control.Unit testingDebuggingDocumentation
Introduction to the e-InfrastructureSURF to introduce the national e-infrastructureReal world examples of e-infrastructure enhanced research from NLeSC
• Make researchers more productive by teaching them basic lab skills for scientific computing
• All lessons are freely available
• Workshops, teacher trainings
• Example lessons
– Version Control and Unit Testing for Scientific Software
– Shell, Git, Scientific Python
– Testing and Continuous Integration with Python
– From Excel to a Database
– Data Management in the Ocean, Weather and Climate Sciences
– Visualizing Your Data on the Web Using D3
– Working With Data on the Web
– Intermediate/Advanced R Lessons
– Programming with GAP
• Develop and teach workshops on the fundamental data skills for research in all domains
• Covering the full lifecycle of data-driven research
• Introductory computational skills for data management and analysis
• Domain-specific lessons, from life and physical sciences to social sciences
• Build on existing knowledge, enabling quick application of new skills to own research
• Examples:
– Ecology
• Data Organization in Spreadsheets, Data Cleaning with OpenRefine, Data Management with SQL, Data
Analysis and Visualization in R, Data Analysis and Visualization in Python
– Genomics
• Introduction to cloud computing for genomics, Introduction to the command line, Data wrangling and
processing, Data analysis in R, Data visualization in R
– Social sciences
• Social sciences text mining
– Biology
– Geospatial data
Coding Style• Nicholas C. Zakas: Why Coding Style Matters• http://coding.smashingmagazine.com/2012/10/25/why-coding-style-matters
• Use is mandatory
• We provide editor configuration
• http://editorconfig.org/
EditorConfig
Conventions & Guidelines• Web development
– General frontend guidelines: https://github.com/bendc/frontend-guidelines
– AngularJS: https://github.com/johnpapa/angular-styleguide
– Airbnb JavaScript Style Guide: https://github.com/airbnb/javascript
• Python
– PEP8: https://www.python.org/dev/peps/pep-0008/
• Java
– Code Conventions for the JavaTM Programming Language (Oracle)
• Google Style Guides: https://github.com/google/styleguide
• Wikipedia: https://en.wikipedia.org/wiki/Coding_conventions
Quality Improvement Tools
• SonarQube: http://www.sonarqube.org
• Code climate: https://codeclimate.com
• Codacy: https://www.codacy.com
• Scrutinizer: https://scrutinizer-ci.com
• Landscape: https://landscape.io
• Coveralls: https://coveralls.io
• See also
– https://github.com/ripienaar/free-for-dev#code-quality
– http://shields.io/
Article about good development practices:
The Joel Test: 12 Steps to Better Code.
Unit & Integration Testing• Guide: Writing Testable Code
• 'Unit Testing Best Practices' and other presentations on http://artofunittesting.com/.
• Continuous integration testing with Travis-CI and Jenkins-CI
• We require at least 70% code coverage
• Java: junit
• Javascript
– Jasmine, a behavior-driven development framework for testing JavaScript code.
– Karma, Runs tests in web browser with code coverage.
– PhantomJS, headless web browser on CI-servers.
• Python
– Unittest, nose and pytest.
• R
– testthat
• Web development
– To interact with web-browsers use Selenium.
– Sauce Labs hosts a matrix of web-browsers and Operating Systems for testing.
– AngularJS applications can be tested with Protractor.
Documentation• Document at multiple levels
– Source code comments
– API documentation
– Installation and usage documentation
• Comments at each level should take into account the different target audiences
• Use Markdown, a readable lightweight markup language that can be converted
to many formats
Version Control• Git and GitHub
• A successful and simple Git branching model:
GitHub Flow• https://guides.github.com/introduction/flow/
• Commit messages are formatted and
formulated in a readable way
• http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html
• http://who-t.blogspot.nl/2009/12/on-commit-messages.html
Releases and packaging
• Tag versions, use github releases
• Semantic versioning
• Keep changelogs
• Packaging is important
– Use packaging that is well known and appropriate for user
community: pypi, npm, maven, docker
• Make your code and data citable: get a DOI (Zenodo)
Support levels• S0: generic software or hibernating software that is currently not used in the
NLeSC project portfolio– Fortran, Python, vBrowser, TwiNL, XNAT, …
– No support, not disseminated
• S1: software where NLeSC maintains expertise on, and that is used in projects,
as well as external software that NLeSC extends and improves– Potree, OpenDA, ElasticSearch
– Support for project partners only
– Contribute improvements back to community
• S2: software developed in-house, where NLeSC is the specialist– Xenon, Magnesium, Osmium, xtas, esiBayes, Aether, …
– Full support for project partners, limited support for Dutch scientific community, best effort for
international community
IP & Software Licenses• NLeSC does not develop IP portfolio
• Ownership of research results is shared property of partners,
including NLeSC
• IP protection is possible
• Software is open source / open access unless agreed otherwise
• Default software license: Apache 2.0
• Deviations possible, discuss with NLeSC MT