Overall Description FIBERSERVICES a joint initiative by a joint initiative by March 2000.
The Big Data Platform Initiative of the EC Joint Research ...
Transcript of The Big Data Platform Initiative of the EC Joint Research ...
The Big Data Platform Initiative of the EC Joint Research Centre
European Commission, Joint Research Centre
Directorate I Competences, Unit I.3 Text and Data Mining EO&SS@BigData Project
Joint Research Centre (JRC)
Data analytics workshop for official statistics (daWos)
Amsterdam. 10/09/2018
URL: https://cidportal.jrc.ec.europa.eu Contact: [email protected]
Outline
• Project background
• JEODPP platform concept
• Data holdings
• Services
• Outreach
• Project evolution
Project background
• Explosion of digital data sources led to the big data paradigm (Volume, Velocity, and Variety of data streams).
• Earth Observation (EO) entering big data thanks Copernicus Sentinel satellites (full, free, and open data).
• JRC task force recommended in late 2014 to start a big data pilot project on EO and Social Sensing.
• Initial state: fragmented approach hampering collaborative working and knowledge sharing.
• Project start: January 2015.
Policy context
• REGULATION (EU) No 377/2014 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL 3/4/14 establishing the Copernicus Programme and repealing Regulation (EU) No 911/2010. [JRC also mentioned in proposed new space programme regulation to enter into force by 1.1.2021]
• Communication of the Commission on Data, information and knowledge management at the European Commission (COM(2016)6626-final)
• Communication from the Commission on the European Cloud Initiative (COM(2016) 178 final): The Commission and participating Member States should develop and deploy a large scale European HPC, data and network infrastructure, including: the establishment of a European Big Data centre, E.g. hosted by JRC for multidisciplinary data but focused on INSPIRE/GEOSS/Copernicus spatial data [COM(2016 178 final].
• Communication from the Commission on Artificial Intelligence for Europe (COM(2018) 237 final).
Project milestones
• 2015: survey of user needs and proposal of solutions addressing their needs; endorsement of the concept of JRC Earth Observation Data and Processing Platform (JEODPP)
• 2016: procurement of hardware and first batch processing service with massive runs
• 2017: release of interactive visualisation/analysis and deployment of remote desktop services
• 2018: multi-petabyte extension, development of machine learning capabilities, JIPlib release, user basis in continuation expansion
Indicators
Decisions
Big data
Big geospatial data for policy
Policy relevant information
Data
Volu
me,
Velo
city,
Variety
atmosphere
marine
land
climate
emergency
security
Exploit data volume, velocity, and variety to generate policy relevant information
• Using FAIR data principles (findable, accessible, interoperable, reusable) • With data mining competence in shared and collaborative environment • Relying on reproducible workflows
directives, legislations, communications, …
Earth Observation, in situ, crowd sourcing, social sensing, text data, web scrapping, …
JRC Big Data Platform: Conceptual representation
Infrastructure
Based on commodity hardware and open-source software stack:
• Storage
• CERN EOS distributed file system
• Currently 5 PiB net capacity
• 2 more PiB net for development/testing
• Processing servers (batch processing)
• 1,400 cores over 35 nodes
• 3 GPU servers
• extensions including further GPU servers in late 2018
JEODPP in
As of September 2017
As of September 2018
Main software stack
Source: Soille et al., Future Generation of Computer Systems, 2017 DOI: 1010.1016/j.future.2017.11.007 (in press)
JEODPP access modes [WIKI Link]
• EOS CIFS mount from desktop client (read-only)
• Netapp CIFS mount (read/write) for data transfer
• Terminal service (remote desktop) https://cidportal.jrc.ec.europa.eu/apps/terminal/
• Document & data sharing based on NextCloud https://cidportal.jrc.ec.europa.eu/apps/cloud/
planned federation with JRCBox
• FTPS for file transfer to EOS
• JHub https://cidportal.jrc.ec.europa.eu/jhub/ for
• interactive visualisation and analysis
• tailored Docker containers for development
JEODPP current space usage
Connecting storage and processing via cloud sharing services
Low-level batch processing
• Running large-scale data processing tasks in a cluster environment
• Docker containers for flexible management of processing environments
• Custom builds for different requirements
• Facilitates upgrades of processing environment (libraries, tools)
• Run through a workload manager
• HTCondor scheduler
• Extensive use for large scale processing/analysis
JEODPP Batch Processing System
Diverse user environments originating from different: • libraries • tools • software • versions • distros: Debian/Centos
Docker images are built based on user requirements
Container-based cluster management
REPOSITORY TAG Info SIZE
jipl_S1toolbox-dev 2.0 snap 4.0 6.269 GB
jipl_S1toolbox-dev 1.0.1 snap 2.0.2 6.282 GB
ghsl_se2cor-dev 1.0 snap 2.0 4.742 GB
critech_ipython_deltares-dev 1.0 python 2.7 6.939 GB
marsec_MCR 1.0 MatLab run time 2015b 3.082 GB
jipl-dev 1.0 3.666 GB
marsec_sumo-dev 1.0 java 1.8 2.842 GB
canhemon_grass-dev 1.0 debian testing, python 3.0 3.397 GB
cloudmask-download v0_2 74994254f754 11 weeks ago 444.8 MB1.0.1 3.421 GB
cloudmask-download 1.0.0 3.421 GB
sentinel-download 1.0 3.121 GB
Examples of batch processing scientific workflows on JEODPP
JEODPP batch processing monitoring
JEODPP Terminal Service via Web https://cidportal.jrc.ec.europa.eu/apps/terminal/
• A pool of Docker containers running next to the data
• Linux desktop environment • Standard software installed
QGIS, GRASS IDL/ENVI, Matlab (personalised licenses) R (R, R Commander, Rstudio) Python, Jupyter-lab, Jupyter-notebook Additions on request
• Relies on HTML5 and runs in FF, IE, and Chrome
• For prototyping, ad hoc products’ analysis/visualisation, and launch batch processing
JEODPP users • 35 use-cases • From 16 units • Across 8 directorates
Interactive visualization and analysis with Jupyter
• Web interface to visualize and analyze any kind of data in a single document called a Jupyter notebook
• Jupyter notebooks integrate live code, equations, visualizations, and narrative text.
• Facilitate knowledge sharing, collaborative working, and reproducible workflows.
• Suitable to non-programmers by integrating GUIs based on widgets (buttons, sliders, etc.).
Jupyter ecosystem
http://jupyter.org/
JupyterLab ecosystem (evolution of Jupyter)
ipyleaflet
https://github.com/ellisonbg/ipyleaflet
ipywidgets and bqplot
https://github.com/jupyter-widgets/ipywidgets https://github.com/bloomberg/bqplot
From big data to interactive rendering and analysis
Source: FGCS, 2017, DOI: 10.1016/j.future.2017.11.007
+ in Situ data
Global Human Settlement Layer with Global Surface Water Occurence on top of Global S1 mosaic
Html export to facilitate outreach (example with ALOS DEM)
Execution of arbitrary python code in interactive mode (e.g. for MSPA)
Takeaway messages
• Exponential growth of data and data sources.
• The big data paradigm is permeating all fields.
• FAIR data principles also applies to data analysis.
• Challenge of turning data into insights facilitated by platforms with data co-located with processing.
• Jupyter notebooks contributes to reproducible analysis as well as knowledge sharing and collaborative working.
• Importance of interactive analysis and visualisation.
• Open standards including open API are needed to avoid platform lock-in.
Project evolution: Big Data Analytics (2019-2020)
• Innovative approaches (AI/machine learning) for combining large amounts of data originating from different sources
• Enabled by the JRC Big Data Platform (JEODPP)
• Initial focus on geospatial data and their combination with other data sources
• Key enabler of data and knowledge sharing across JRC and towards partners
• Link with DIAS (support to DG GROW and possible partnership with WEkEO DIAS)
• Key role of openEO H2020 project (definition of common API)
Thank you for your attention!
EO&SS@BigData pilot project Unit I.3 Text and Data Mining Unit Directorate I Competences
GEO-WEEK, Washington DC, Oct 2017
https://doi.org/10.1016/j.future.2017.11.007 Publication list: https://cidportal.jrc.ec.europa.eu/home/publications