Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data...

35
Big Data Europe : Empowering Communities with Data Technologies Dr Hajira Jabeen, University of Bonn Apache Big Data Europe - Seville

Transcript of Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data...

Page 1: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Big Data Europe : Empowering Communities with Data Technologies

Dr Hajira Jabeen, University of Bonn

Apache Big Data Europe - Seville

Page 2: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Structure◎Evolution of BDE Architecture◎BDE Stack◎Beyond the State of the Art

o Workflows (Support Layer)o User Interfaceo Data Lake (Ontario)o Semantics Analytics Stack

2

Page 3: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Architectural design 1

Page 4: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Architectural design 24

Page 5: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Architectural design 3 (released)

5

Page 6: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Architecture for SC 7 6

Stack

Page 7: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Supported ComponentsSearch/indexing Data processing

Apache Solr Apache Spark

Data acquisition Apache Flink

Apache Flume Semantic Components

Message passing Strabon

Apache Kafka Sextant

Data storage GeoTriples

Hue Silk

Apache Cassandra SEMAGROW

ScyllaDB LIMES

Apache Hive 4Store

Postgis OpenLink Virtuoso

7

Page 8: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

User profiles8

Page 9: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Platform installation◎Manual installation guide◎Using Docker Machine

o On local machine (VirtualBox)o In cloud (AWS, DigitalOcean, Azure)o Bare metal

◎Screencasts

9

Page 10: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Developing a component◎Base Docker images

o Serve as a template for a (Big Data) technology

o Easily extendable custom algorithm/data◎Published components

o Responsibilities divided b/w partnerso Image repositories on GitHubo Automated builds on DockerHubo Documentation on BDE Wiki

10

Page 11: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Deploying a Big Data Stack

◎Stack collection of communicating components to solve a specific problem◎Described in Docker Compose

o Component configurationo Application topology

◎Orchestrator required for initialization processo Components may depend on each othero Components may require manual

intervention

11

Page 12: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

User Interfaces◎Target: Facilitate use of the platform

o User Interface Adaption ◎Available interfaces

o Workflow UIs❖Workflow Builder❖Workflow Monitor

o Swarm UIo Integrator UI

12

Page 13: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

BDE Workflow Builder13

Page 14: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

BDE Workflow Monitor14

Page 15: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Swarm UI15

Page 16: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Integrator UI16

Page 17: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Beyond the State of the Art

17

Page 18: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Smart Big Data

Increase Big Data value by adding meaning to it!

18

Page 19: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Orchestration ◎ Goal: flexible composition of Big Data

pipelines◎ Microservice cooperation using shared

vocabulary ◎ Uses HTTP requests from a web based

frontend◎ Meta information stored in a triple store◎ Microservice uses triple store as service

backend

19

Page 20: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Semantic Data Lake (Ontario)

◎Repository of data in its raw formato Structured, semi-structured, unstructured

◎Schema-lesso No schema is defined on write, it is

defined only on read◎Open to any kind of processing

20

Page 21: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Data Lake 21

Source: Big Data @ Microsoft - Raghu Ramakrishnan

Page 22: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Semantic Data Lake (Ontario)

◎Add a Semantic layer on top of the source datasetso Semantic data is handled as-iso Non-Semantic data is semantically lifted

using existing ontology terms

22

Page 23: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Ontario: Architecture23

Translate and execute Query via Source-specific Access Method

Decompose to Source-specific Entities

Decompose SPARQL Query

Page 24: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

Semantic Analytics Stack (SANSA)

24

Page 25: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

SANSA: Motivation◎ Abundant machine readable structured

information is available (e.g. in RDF)o Across SCs, e.g. Life Science Data

(OpenPhacts)o General: DBpedia, Google knowledge

grapho Social graphs: Facebook, Twitter

◎ Need for scalable querying, inference and machine learningo Link predictiono Knowledge base completiono Predictive analytics

25

Page 26: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview
Page 27: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

SANSA Stack27

Page 28: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

SANSA: Read Write Layer◎ Ingest RDF and OWL data in different formats using Jena / OWL API style interfaces

◎Represent data in multiple formats (e.g. RDD, Data Frames, GraphX, Tensors)

◎Allow transformation among these formats

◎Compute dataset statistics and apply functions to URIs, literals, subjects, objects → Distributed LODStats

28

Page 29: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

SANSA: Query Layer◎To make generic queries efficient and fast using:o Intelligent indexing o Splitting strategieso Distributed Storage o Distributed/ Federated Querying

◎Early work in progress: query evaluation (SPARQL-to-SQL approaches, Virtual Views)

◎Provision of W3C SPARQL compliant endpoint

29

Page 30: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

SANSA: Inference Layer◎W3C Standards for Modelling: RDFS and OWL

◎Parallel in-memory inference via rule-based forward chaining

◎Beyond state of the art: dynamically build a rule dependency graph for a rule set

◎→ Adjustable performance levels

30

Page 31: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

SANSA: ML Layer◎Distributed Machine Learning (ML) algorithms that work on RDF data and make use of its structure / semantics

◎Work in Progress:o Tensor Factorization for e.g. KB completion

(testing stage)o Simple spatiotemporal analytics (idea stage)o Graph Clustering (testing stage)o Association rule mining (evaluation stage)o Semantic Decision trees (idea stage)

31

Page 33: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

33

Page 34: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

BDE vs Hadoop distributions

Hortonworks Cloudera MapR Bigtop BDE

File System HDFS HDFS NFS HDFS HDFSInstallation Native Native Native Native lightweight

virtualizationPlug & play components (no rigid schema)

no no no no yes

High Availability Single failure recovery (yarn)

Single failure recovery (yarn)

Self healing, mult. failure rec.

Single failure recovery (yarn)

Multiple Failure recovery

Cost Commercial Commercial Commercial Free Free

Scaling Freemium Freemium Freemium Free FreeAddition of custom components

Not easy No No No Yes

Integration testing yes yes yes yes --Operating systems Linux Linux Linux Linux AllManagement tool Ambari Cloudera

managerMapR Control system

- Docker swarm UI+ Custom

34

Page 35: Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

BDE vs Hadoop distributions

◎BDE is not built on top of existing distributions

◎Targets o Communitieso Research institutions

◎Bridges scientists and open data◎Multi Tier research efforts towards Smart Data

35