Multiplatform Spark solution for Graph datasources by Javier Dominguez
-
Upload
big-data-spain -
Category
Technology
-
view
140 -
download
0
Transcript of Multiplatform Spark solution for Graph datasources by Javier Dominguez
![Page 1: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/1.jpg)
![Page 2: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/2.jpg)
17 NOV 2016 @ BIG DATA SPAIN
@StratioBD
MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES
Multiplatform Spark solution for Graph datasourcess, Stratio Stratio
Javier Domínguez
![Page 3: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/3.jpg)
Javier Dominguez Montes
CTO SKILLS
PROFILE
JAVIER DOMÍNGUEZ
Studied computer engineering at the ULPGC. He is passionate about Scala, Python and all Big Data technologies
and is currently part of the Data Science team at Stratio Big Data,
working with ML algorithms, profiling analysis based around Spark.
![Page 4: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/4.jpg)
LET'S HAVE FUN!
![Page 5: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/5.jpg)
INDEX
1
2
3
4
INTRODUCTION
MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES
DEMO
THE END
Graph use cases Results
What's next?
Dataset
Main process explanation
Notebooks show off
DataStores
Machine learning
Business example
![Page 6: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/6.jpg)
INTRODUCTION
@StratioBD
![Page 7: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/7.jpg)
500 GB - 2 TB
4 TB - 8 TB
20 GB - 100 GB
80’S 2000 2010 2015 2020
CUSTOMER DATA WILL GROW OVER 100X
100 TB
> 10 PB
![Page 8: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/8.jpg)
VALUE IS THE DATA VALUE IS UNDERSTANDING THE DATA
![Page 9: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/9.jpg)
DO NOT STAY ON THE SURFACE OF KNOWLEDGE
![Page 10: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/10.jpg)
MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES
• Graph use cases
• DataStores
• Machine learning
@StratioBD
![Page 11: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/11.jpg)
Example of how to exploit a massive database from different stages and through several graph technologies
MACHINE LEARNING LIFE CYCLE WITH BIG DATA
![Page 12: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/12.jpg)
Machine Learning life cycle
Show how a data sciencist is able to take advantage of a Graph Database through different datasources and technologies thanks to our solution.
Use as a example a masive dataset.
Query the datasource from different technologies like:
• GraphX• GraphFrames• Neo4j
And finally apply Machine Learning over our information!
BIG DATA SPAIN USE CASE
![Page 13: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/13.jpg)
USE CASES
![Page 14: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/14.jpg)
USE CASES
Making use of a masive graph datasource implies make batch queries over it.We will need to maken them with our distributed technologies... The easier the better
Batch Queries
Motifs filter example
import org.graphframes._val g: GraphFrame = Graph(usersRdd,relationshipsRdd0)
// Search for pairs of vertices with edges in both directions between themval motifs: Dataframe = g.find("(person_1)-[relation]->(person_2); (person_2)-[abilities]->(technology)")motifs.show()
// More complex queries can be expressed by applying filters.motifs.filter("person_1.name = 'Javier' AND technology.name = 'Neo4j'")
![Page 15: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/15.jpg)
Most of our clients or teammates will need to have fast and easy access to the information.We would need a way to make easy queries and of course a graphic representation of our data!
We would need of course microservices like REST operations over our datastore.
Online queries
USE CASES
![Page 16: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/16.jpg)
DATASTORES
![Page 17: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/17.jpg)
SparkApache Spark is a fast and generic engine for large-scale data processing.
GraphX
Spark API for the management and distributed calculation of graphs. It comes with a great variety of graph algorithms: Connected componentes PageRank Triangle count SVD++
GraphFramesIt aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding and highly expressive graph queries.
DATASTORES
![Page 18: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/18.jpg)
Neo4j
Neo4j is a highly scalable native graph database that leverages data relationships as first-class entities.Big data alone used to be enough, but enterprise leaders need more than just volumes of information to make bottom-line decisions. You need real-time insights into how data is related.
DATASTORES
![Page 19: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/19.jpg)
MACHINE LEARNING
![Page 20: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/20.jpg)
MACHINE LEARNING
It's possible to quickly and automatically produce models that can analyze bigger, more complex data and deliver faster, more accurate results – even on a very large scale. The result? High-value predictions that can guide better decisions and smart actions in real time without human intervention.
Machine learning
SVD
Will relate all the existing object in our dataset and infer possible new behaviors.
![Page 21: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/21.jpg)
DEMO
• Dataset
• Main process explanation
• Notebooks show off
@StratioBD
![Page 22: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/22.jpg)
STRATIO INTELLIGENCE
Integration of different Open Source libraries of distributed machine learning algorithms.
Development environment adapted to each data scientist.
Real-time decision based on models based on machine learning algorithms
Integrated with all components of the Stratio Big Data Platform
Comprehensive knowledge lifecycle management
![Page 23: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/23.jpg)
DATASET
![Page 24: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/24.jpg)
Freebase aimed to create a global resource that allowed people (and machines) to access common information more effectively.
This model is based on the idea of converting the declarations of the resources in expressions with the subject-predicate-object which are called triplets.
Subject: It's the resource, what we are describing.Predicate: Could be a property or a relationship with the object value. Object value: Propertie's value or the related subject.
<'Cristiano Ronaldo'> <'Scores in 2014/2015'> 61 .<'Cristiano Ronaldo'> <'Born in'> 'Portugal' .
Freebase Google
Total triplets: 1.9 Billion
DATASET
![Page 25: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/25.jpg)
PROCESS EXPLANATION
![Page 26: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/26.jpg)
PROCESS EXPLANATION
Transforms
CastRDF Dataset
GraphFrames Batch query
Neo4jGraphXExtracts sample & transforms Online
query
![Page 27: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/27.jpg)
SVDK-core
Decomposition Strongly connected graph
Apply algorithms
Behavior Inference
Graph
Subject equality
PROCESS EXPLANATION
![Page 28: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/28.jpg)
A k-core of a graph G is a maximal connected subgraph of G in which all vertices have degree at least k. Equivalently, it is one of the connected components of the subgraph of G formed by repeatedly deleting all vertices of degree less than k.
Objective
Remove all nodes with fewer connections.At the end, we want only the most representative and connected elements in our grah.In our use case we used K = 5.
K-Core process
PROCESS EXPLANATION
![Page 29: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/29.jpg)
NOTEBOOKS SHOW OFF
![Page 30: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/30.jpg)
BUSINESS EXAMPLE
![Page 31: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/31.jpg)
Jaccard Graph Clustering
Node Clusterization based on concrete relations optimized for Big Data environments.
We've developed an straightforward functionality which is able to detect patterns and clusterize data in a graph database thanks to daily machine learning processes.
Neo4j
Scala Graph functionalities
Jaccard Indexation
Connected Componentes
Java
HDFS / Parquet
Spark / GraphX
40BJaccard distance calculation
in everyday process
400Knodes graph clustering
BANK USE CASE
![Page 32: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/32.jpg)
THE END
• Results
• What's next?
@StratioBD
![Page 33: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/33.jpg)
WHAT'S NEXT?
Semantic search engineInclude ElasticSearch for making text searchs as a search engine.
Apply more Machine Learning algorithms
• Connected components: As we've already done, try to cluster information thanks to their relationships.• PageRank: Measure the importance of a subject.• Triangle counting: Check posible triangle relationships inside our dataset to avoid redundancy.
New Graph use cases
• Fraud detection• Recommendation System • Profiling
![Page 34: Multiplatform Spark solution for Graph datasources by Javier Dominguez](https://reader031.fdocuments.net/reader031/viewer/2022030317/586fa11c1a28abcc238b6a15/html5/thumbnails/34.jpg)
THANK YOU
UNITED STATESTel: (+1) 408 5998830
EUROPETel: (+34) 91 828 64 73
www.stratio.com
@StratioBD