Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data...
-
Upload
bigdataeurope -
Category
Technology
-
view
340 -
download
0
Transcript of Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data...
Big Data Europe : Empowering Communities with Data Technologies
Dr Hajira Jabeen, University of Bonn
Apache Big Data Europe - Seville
Structure◎Evolution of BDE Architecture◎BDE Stack◎Beyond the State of the Art
o Workflows (Support Layer)o User Interfaceo Data Lake (Ontario)o Semantics Analytics Stack
2
Architectural design 1
Architectural design 24
Architectural design 3 (released)
5
Architecture for SC 7 6
Stack
Supported ComponentsSearch/indexing Data processing
Apache Solr Apache Spark
Data acquisition Apache Flink
Apache Flume Semantic Components
Message passing Strabon
Apache Kafka Sextant
Data storage GeoTriples
Hue Silk
Apache Cassandra SEMAGROW
ScyllaDB LIMES
Apache Hive 4Store
Postgis OpenLink Virtuoso
7
User profiles8
Platform installation◎Manual installation guide◎Using Docker Machine
o On local machine (VirtualBox)o In cloud (AWS, DigitalOcean, Azure)o Bare metal
◎Screencasts
9
Developing a component◎Base Docker images
o Serve as a template for a (Big Data) technology
o Easily extendable custom algorithm/data◎Published components
o Responsibilities divided b/w partnerso Image repositories on GitHubo Automated builds on DockerHubo Documentation on BDE Wiki
10
Deploying a Big Data Stack
◎Stack collection of communicating components to solve a specific problem◎Described in Docker Compose
o Component configurationo Application topology
◎Orchestrator required for initialization processo Components may depend on each othero Components may require manual
intervention
11
User Interfaces◎Target: Facilitate use of the platform
o User Interface Adaption ◎Available interfaces
o Workflow UIs❖Workflow Builder❖Workflow Monitor
o Swarm UIo Integrator UI
12
BDE Workflow Builder13
BDE Workflow Monitor14
Swarm UI15
Integrator UI16
Beyond the State of the Art
17
Smart Big Data
Increase Big Data value by adding meaning to it!
18
Orchestration ◎ Goal: flexible composition of Big Data
pipelines◎ Microservice cooperation using shared
vocabulary ◎ Uses HTTP requests from a web based
frontend◎ Meta information stored in a triple store◎ Microservice uses triple store as service
backend
19
Semantic Data Lake (Ontario)
◎Repository of data in its raw formato Structured, semi-structured, unstructured
◎Schema-lesso No schema is defined on write, it is
defined only on read◎Open to any kind of processing
20
Data Lake 21
Source: Big Data @ Microsoft - Raghu Ramakrishnan
Semantic Data Lake (Ontario)
◎Add a Semantic layer on top of the source datasetso Semantic data is handled as-iso Non-Semantic data is semantically lifted
using existing ontology terms
22
Ontario: Architecture23
Translate and execute Query via Source-specific Access Method
Decompose to Source-specific Entities
Decompose SPARQL Query
Semantic Analytics Stack (SANSA)
24
SANSA: Motivation◎ Abundant machine readable structured
information is available (e.g. in RDF)o Across SCs, e.g. Life Science Data
(OpenPhacts)o General: DBpedia, Google knowledge
grapho Social graphs: Facebook, Twitter
◎ Need for scalable querying, inference and machine learningo Link predictiono Knowledge base completiono Predictive analytics
25
SANSA Stack27
SANSA: Read Write Layer◎ Ingest RDF and OWL data in different formats using Jena / OWL API style interfaces
◎Represent data in multiple formats (e.g. RDD, Data Frames, GraphX, Tensors)
◎Allow transformation among these formats
◎Compute dataset statistics and apply functions to URIs, literals, subjects, objects → Distributed LODStats
28
SANSA: Query Layer◎To make generic queries efficient and fast using:o Intelligent indexing o Splitting strategieso Distributed Storage o Distributed/ Federated Querying
◎Early work in progress: query evaluation (SPARQL-to-SQL approaches, Virtual Views)
◎Provision of W3C SPARQL compliant endpoint
29
SANSA: Inference Layer◎W3C Standards for Modelling: RDFS and OWL
◎Parallel in-memory inference via rule-based forward chaining
◎Beyond state of the art: dynamically build a rule dependency graph for a rule set
◎→ Adjustable performance levels
30
SANSA: ML Layer◎Distributed Machine Learning (ML) algorithms that work on RDF data and make use of its structure / semantics
◎Work in Progress:o Tensor Factorization for e.g. KB completion
(testing stage)o Simple spatiotemporal analytics (idea stage)o Graph Clustering (testing stage)o Association rule mining (evaluation stage)o Semantic Decision trees (idea stage)
31
33
BDE vs Hadoop distributions
Hortonworks Cloudera MapR Bigtop BDE
File System HDFS HDFS NFS HDFS HDFSInstallation Native Native Native Native lightweight
virtualizationPlug & play components (no rigid schema)
no no no no yes
High Availability Single failure recovery (yarn)
Single failure recovery (yarn)
Self healing, mult. failure rec.
Single failure recovery (yarn)
Multiple Failure recovery
Cost Commercial Commercial Commercial Free Free
Scaling Freemium Freemium Freemium Free FreeAddition of custom components
Not easy No No No Yes
Integration testing yes yes yes yes --Operating systems Linux Linux Linux Linux AllManagement tool Ambari Cloudera
managerMapR Control system
- Docker swarm UI+ Custom
34
BDE vs Hadoop distributions
◎BDE is not built on top of existing distributions
◎Targets o Communitieso Research institutions
◎Bridges scientists and open data◎Multi Tier research efforts towards Smart Data
35