Using Spark for Timeseries Graph Analytics ved

30
Using Spark for Timeseries Graph Analytics SIGMOID 12 TH MEETUP -Ved Mulkalwar [email protected] 9564211606

Transcript of Using Spark for Timeseries Graph Analytics ved

Page 1: Using Spark for Timeseries Graph Analytics ved

Using Spark for Timeseries Graph AnalyticsSIGMOID 12

TH MEETUP

-Ved [email protected]

Page 2: Using Spark for Timeseries Graph Analytics ved

Content

Introduction to sample problem statement Which Graph database is used and why Installing Titan Titan with Cassandra The Gremlin Cassandra script: A way to store data in cassandra

from Titan Gremlin Accessing Titan with Spark

Page 3: Using Spark for Timeseries Graph Analytics ved

Introducing which kind of time-series problems that can be solved using graph analytics and introducing the problem statement which I have been working on solving.

Page 4: Using Spark for Timeseries Graph Analytics ved

The dynamics of a complex system is usually recorded in the form of time series. In recent years, the visibility graph algorithm and the horizontal visibility graph algorithm have been recently introduced as the mapping between time series and complex networks. Transforming time series into the graphs, the algorithms allows applying the methods of graph theoretical tools for characterizing time series, opening the possibility of building fruitful connections between time series analysis, nonlinear dynamics, and graph theory.

Page 5: Using Spark for Timeseries Graph Analytics ved

The problem statement which I have been working on: Our initial goal was finding anomalies in the given timeseries. We

started with slicing the data into small meaningfull parts so that the data becomes smaller and contiguous data points having the same property get clubbed together.

Later we created a graph with these parts and tried to find the parts having similar properties. Doing this will single out anomalies which was the goal.

Page 6: Using Spark for Timeseries Graph Analytics ved

Which graph database I used and why i.e. Difference between graphX, neo4j and titan , why we used titan

Page 7: Using Spark for Timeseries Graph Analytics ved

Titan vs graphX The fundamental difference between Titan and GraphX lies in

how they persistence data and how they process data, Titan by default persists data (vertices and edges and properties) to a distributed data store in the form of tables bound to a specific schema. This schema can be stored in Berkley DB tables, Cassandra tables or Hbase tables.

In the case of Titan the graph is stored as vertices in a vertex table and edges in an edge table.

GraphX has no real persistence layer (yet?), yes it can persist to HDFS files, but it cannot persist to a distributed datastore in a common schema like form as in Titan’s case.

A graph only exists in GraphX when it is loaded into memory off raw data and interpreted as Graph RDDs, Titan stores the graph permanently.

The other major difference between the two solutions is how they process graph data.

By default, GraphX solves queries via distributed processing on many nodes in parallel where possible as opposed to Titan processing pipelines on a single node. Titan can also take advantage of parallel processing via Faunus/HDFS if necessary.

Page 8: Using Spark for Timeseries Graph Analytics ved

Neo4j vs titan The primary difference between Titan and Neo4j is scalability:

Titan can distribute the graph across multiple machines (using either Cassandra or Hbase as the storage backend) which does three things:

1) It allows Titan to store really, really large graphs by scaling out 2) It allows Titan to handle massive write loads 3) No single point of failure for very high availability. While

Neo4j's HA setup gives you redundancy through replication, death of the master in such a setup leads to temporary service interruption while a new master is elected.

This allows Titan to scale to large deployments with thousands of concurrent users as demonstrated in the benchmark:

http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/

Neo4j has been around much longer and is therefore more "mature" in some regards:

1) Integrated with an external lucene index gives it more sophisticated search capabilities

2) More integration points into other programming languages and development frameworks (e.g. spring)

3) Has been used by more people Titan is a native blueprints implementation and benefits from all

the features that come with the Tinkerpop stack. Titan does not support cypher but only the gremlin standard.

Page 9: Using Spark for Timeseries Graph Analytics ved

Intro to Titan graph db and how to use it with cassandra & spark

Page 10: Using Spark for Timeseries Graph Analytics ved

Installing Titan Downloaded the latest prebuilt version (0.9.0-M2) of Titan at

s3.thinkaurelius.com/downloads/titan/titan-0.9.0-M2-hadoop1.zip. Carry out the following steps to ensure that Titan is installed on

each node in the cluster:

Page 11: Using Spark for Timeseries Graph Analytics ved

Now, use the Linux su (switch user) command to change to the root account, and move the install to the /usr/local/ location. Change the file and group membership of the install to the hadoop user, and create a symbolic link called titan so that the current Titan release can be referred to as the simplified path called /usr/local/titan:

Page 12: Using Spark for Timeseries Graph Analytics ved

Titan with Cassandra

In this section, the Cassandra NoSQL database will be used as a storage mechanism for Titan. Although it does not use Hadoop, it is a large-scale, cluster-based database in its own right, and can scale to very large cluster sizes. A graph will be created, and stored in Cassandra using the Titan Gremlin shell. It will then be checked using Gremlin, and the stored data will be checked in Cassandra. The raw Titan Cassandra graph-based data will then be accessed from Spark. The first step then will be to install Cassandra on each node in the cluster.

Page 13: Using Spark for Timeseries Graph Analytics ved

Install Cassandra on all the nodes Set up the Cassandra configuration under /etc/cassandra/conf by

altering the cassandra.yaml file:

Install Cassandra on all the nodes Set up the Cassandra configuration under /etc/cassandra/conf by

altering the cassandra.yaml file:

Page 14: Using Spark for Timeseries Graph Analytics ved

Log files can be found under /var/log/cassandra, and the data is stored under /var/lib/cassandra. The nodetool command can be used on any Cassandra node to check the status of the Cassandra cluster:

The Cassandra CQL shell command called cqlsh can be used to access the cluster, and create objects. The shell is invoked next, and it shows that Cassandra version 2.0.13 is installed:

Page 15: Using Spark for Timeseries Graph Analytics ved

The Cassandra query language next shows a key space called keyspace1 that is being created and used via the CQL shell:

Page 16: Using Spark for Timeseries Graph Analytics ved

The Gremlin Cassandra script

The interactive Titan Gremlin shell can be found within the bin directory of the Titan install, as shown here. Once started, it offers a Gremlin prompt:

Page 17: Using Spark for Timeseries Graph Analytics ved

The following script will be entered using the Gremlin shell. The first section of the script defines the configuration in terms of the storage (Cassandra), the port number, and the keyspace name that is to be used:

Page 18: Using Spark for Timeseries Graph Analytics ved

Next define the generic vertex properties' name and age for the graph to be created using the Management System. It then commits the management system changes:

Page 19: Using Spark for Timeseries Graph Analytics ved

Now, six vertices are added to the graph. Each one is given a numeric label to represent its identity. Each vertex is given an age and name value:

Page 20: Using Spark for Timeseries Graph Analytics ved

Finally, the graph edges are added to join the vertices together. Each edge has a relationship value. Once created, the changes are committed to store them to Titan, and therefore Cassandra:

Page 21: Using Spark for Timeseries Graph Analytics ved

This results in a simple person-based graph, shown in the following figure

Page 22: Using Spark for Timeseries Graph Analytics ved

This graph can then be tested in Titan via the Gremlin shell using a similar script to the previous one. Just enter the following script at the gremlin> prompt, as was shown previously. It uses the same initial six lines to create the titanGraph configuration, but it then creates a graph traversal variable g.

The graph traversal variable can be used to check the graph contents. Using the ValueMap option, it is possible to search for the graph nodes called Mike and Flo. They have been successfully found here:

Page 23: Using Spark for Timeseries Graph Analytics ved

Using the Cassandra CQL shell, and the Titan keyspace, it can be seen that a number of Titan tables have been created in Cassandra:

It can also be seen that the data exists in the edgestore table within Cassandra:

This assures us that a Titan graph has been created in the Gremlin shell, and is stored in Cassandra. Now, I will try to access the data from Spark.

Page 24: Using Spark for Timeseries Graph Analytics ved

Accessing Titan with Spark

So far, Titan 0.9.0-M2 has been installed, and the graphs have successfully been created using Cassandra as backend storage options. These graphs have been created using Gremlin-based scripts. In this section, a properties file will be used via a Gremlin script to process a Titan-based graph using Apache Spark. Cassandra will be used with Titan as backend storage option.

Page 25: Using Spark for Timeseries Graph Analytics ved

The following figure, shows the architecture used in this section.

Page 26: Using Spark for Timeseries Graph Analytics ved

Let us examine a properties file that can be used to connect to Cassandra as a storage backend for Titan. It contains sections for Cassandra, Apache Spark, and the Hadoop Gremlin configuration. My Cassandra properties file is called cassandra.properties, and it looks like this

Page 27: Using Spark for Timeseries Graph Analytics ved

Into Titan code

The following necessary TinkerPop and Aurelius classes that will be used:

Page 28: Using Spark for Timeseries Graph Analytics ved

Scala code

Page 29: Using Spark for Timeseries Graph Analytics ved

Output:

Page 30: Using Spark for Timeseries Graph Analytics ved

Thank You