Data Structure Graph DMZ #DMZone

49

Click here to load reader

Transcript of Data Structure Graph DMZ #DMZone

Page 1: Data Structure Graph DMZ #DMZone

Data Structure Graphs

An overview

Presentation by @dougneedham

Page 2: Data Structure Graph DMZ #DMZone

Introduction

@dougneedham

Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now Data Scientist.

Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.

I have a strong relational/traditional background.

Perpetual Student

Learning new things challenges our assumptions. Forces us to take a new perspective on “old” problems. Eventually maybe even shows us that there is a better way to solve a problem.

Page 3: Data Structure Graph DMZ #DMZone

Introducing Data Structure Graphs

Data Structure Graph Level 1 (DSG-L1)– This is roughly like an Entity Relationship Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.

A DSG-L1 can show you where you are going to have the most interesting query performance of your tables.

Data Structure Graph Level 2 (DSG-L2) – Each Vertex in this graph is an application. Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow diagrams.

A DSG-L2 can show you where the most amount of work is going on in your Enterprise.

Data Structure Graph Dependency (DSG-D) – Each vertex is a job, script, program, or process that is dependent on something happening in sequence before it can do its work.

A DSG-D can show you the sequence of events that need to take place in order for something to be completed.

Page 4: Data Structure Graph DMZ #DMZone

Definition

A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph.

A group of atomic entities.

Related to each other.

Stored in a repository.

Moved from one persistence layer to another.

Rendered as a Graph.

Page 5: Data Structure Graph DMZ #DMZone

In summary: Social Network analysis applied to data modeling.

Data modeling is a topic we are all familiar with here at data modeling zone.

Social Network analysis is, perhaps, something new.

So a little background on the topic we may not be familiar with.

Page 6: Data Structure Graph DMZ #DMZone

What is Social Network Analysis?

“Social network analysis (SNA) is a strategy for investigating social structures through the use of network and graph theories.

It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them.

Examples of social structures commonly visualized through social network analysis include

social media networks,

friendship and

acquaintance networks,

kinship,

disease transmission, and

sexual relationships.

These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines.” – Wikipedia

https://en.wikipedia.org/wiki/Social_network_analysis

Page 7: Data Structure Graph DMZ #DMZone

Example From wiki:

"Kencf0618FacebookNetwork" by Kencf0618 -Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -https://commons.wikimedia.org/wiki/File:Kencf0618FacebookNetwork.jpg#/media/File:Kencf0618FacebookNetwork.jpg

Page 8: Data Structure Graph DMZ #DMZone

A little History

The 7 Bridges of Konigsberg

Every tome on Graph theory or Network analysis devotes a small portion of there time to the 7 Bridges of Konigsberg.

If I don’t cover this with you, the gods of mathematics will strike me down, and never allow me to do analysis again in the future.

Page 9: Data Structure Graph DMZ #DMZone

The Bridges

Page 10: Data Structure Graph DMZ #DMZone

The Problem

Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people would wonder if one particular route was more efficient than another.

Eventually Leonhard Euler was brought into the debate about the efficiency problem.

Euler used Vertices to represent the land masses and edges (or arcs, at the time) to represent bridges. He realized the odd number of edges per vertex made the problem unsolvable.

Sarada Herke provides for one of the best explanations of the solution Solution to Konigsberg

Basically the solution is that a vertex must have an even number of edges in order to make it possible to start from one vertex, and arrive at the point of origin without crossing any edge twice. Essentially, the number of bridges must be an even number. (more details in the above video)

And here is the cool thing about mathematicians. If we tell you something is impossible, we have to tell you why in a way you can understand it. But he also invented the branch of mathematics today we call Graph Theory.

http://en.wikipedia.org/wiki/Leonhard_Euler

Page 11: Data Structure Graph DMZ #DMZone

A few terms

Stand back, we are going to talk about math!

Basically we are talking about a bunch of dots joined together by lines

Vertex – Dot on a graph

Edge – Line connecting the two points

Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a path. If you label your edges, and you have multiple edges with the same label in a Graph you can quite easily identify walks, paths, and cycles through your graph.

A lot of things are networks if you look at them the right way.

Mark Newman has done a number of really cool presentations, available on YouTube about Network analysis.

https://www.youtube.com/watch?v=lETt7IcDWLI

Page 12: Data Structure Graph DMZ #DMZone

More terms What is a path?

Shortest path – How are two vertices connected?

Longest Path – Tracing the flow of an interesting item through a large collection of applications.

Directed Graphs – or Digraphs

If you rearrange things how does the layout affect understanding?

This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU

Page 13: Data Structure Graph DMZ #DMZone

Final terms Centrality – Hub and Authority

This is almost a whole topic by itself, since there are different types of Centrality:

Degree Centrality, Eigenvector Centrality, PageRank, etc…

Longest Path – Tracing the flow of an interesting item through a large collection of applications.

Power law.

What is a path?

Centrality – Hub and Authority

This is almost a whole topic by itself, since there are different types of Centrality:

Degree Centrality, Eigenvector Centrality, PageRank, etc…

Transitivity

Homophily – how things are similar

Directed Graphs – or Digraphs

Contagion – How do things “spread” through a network?

Let’s rearrange things, how does the layout affect understanding?

Order of a graph – number of vertices

Size of the graph – number of edges

This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU

Page 14: Data Structure Graph DMZ #DMZone

The Math doesn’t change.

One thing I like about Graphs –

The Math does not change.

The math behind Graph theory can be a little intense, but it does not change regardless of the scale of the graph.

Once you understand how to “do the math” on a small graph, those same Math's apply to a Graph whether it is a graph of the people in this room, or a graph of the people on this planet.

Page 15: Data Structure Graph DMZ #DMZone

Before we get to the analysis we must collect data.

Dbeaver can reverse engineer an ERD.

Point it at the source system, select a few options, then you have a diagram.

I wrote a small piece of Python code to translate the XML to a file suitable for import into Gephi.

One small caveat: the Foreign keys have to be defined for Dbeaver to work. If the foreign keys are not defined the output file will need to be modified.

Also, some aggregate or summary tables may not help your visualization.

This is subjective, so it is at the discretion of the person reviewing the diagram.

If you remove tables from the graph, please provide documentation such that the visualization can be compared to the reality of your data model with no discrepancies.

Url for Dbeaver is here: https://dbeaver.jkiss.org/

(This section is a little hand-wavy I know but the tool, or method for creating the file for import into Gephi is largely irrelevant.)

Page 16: Data Structure Graph DMZ #DMZone

Gephi http://gephi.github.io/

From the website: “Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.”

We are going to use data from generated from my book: Data Structure Graphs.

These are inspired by my experience consulting, but do not represent an actual data model, or etl process.

The following slides are for a DSG Level 2 (Etl process).

Page 17: Data Structure Graph DMZ #DMZone

Gephi Startup

Page 18: Data Structure Graph DMZ #DMZone

New Project, Data Table, Import data.

Page 19: Data Structure Graph DMZ #DMZone

Load as “Edges Table” Source, Target (required)

Page 20: Data Structure Graph DMZ #DMZone

Choose Create Missing Nodes

Page 21: Data Structure Graph DMZ #DMZone

After a few calculations and layout runs

Page 22: Data Structure Graph DMZ #DMZone

PageRank – Which application is most important?

Page 23: Data Structure Graph DMZ #DMZone

A few more tweaks

Page 24: Data Structure Graph DMZ #DMZone

Where is that Node with the highest PageRank?

Page 25: Data Structure Graph DMZ #DMZone

Now things get interesting:

New metrics for our data model follow.

Remember all those metrics we defined earlier?

Here are many of them:

Page 26: Data Structure Graph DMZ #DMZone

Data Table

Page 27: Data Structure Graph DMZ #DMZone

Configure Labels

Page 28: Data Structure Graph DMZ #DMZone

Labeled by degree count

Page 29: Data Structure Graph DMZ #DMZone

Change some of the coloring

Page 30: Data Structure Graph DMZ #DMZone

Visualization

Page 31: Data Structure Graph DMZ #DMZone

Export to Excel

Page 32: Data Structure Graph DMZ #DMZone

Finally, here we are.

Within a Data Architecture there are lots of moving pieces. ETL, FTP, SFTP, Web-Services, External data feeds. Data moving into Data Marts, and Data Warehouses. Data Moving between applications.

Let’s imagine how to visualize this using the information we just gained.

Page 33: Data Structure Graph DMZ #DMZone

Data Structure Graphs

Today, there are a few tools like ERWin, and SQL Developer that begin to organize visualizations in this manner.

Very few of them allow you to perform analysis on the visualization.

As you find new tools that do this, please let me know.

I would love to evaluate those tools and see what interesting metrics can be arrived at from new tools.

Page 34: Data Structure Graph DMZ #DMZone

Dijkstra's algorithm

Some of you may have heard of Dijkstra’s algorithm.

It is a method for finding the shortest path between two nodes on a Graph.

This is a great optimization technique, but what if you need to find the longest path?

What “Edge_Label” has the most influence on my organization?

Iterate through each Edge_Label, create a subgraph that consists of only the nodes this Edge_Label touches, then calculate the diameter of that Graph.

The Edge_Label that is longest has the most “impact” on your organization.

This is mostly applied to Data Structure Graph Level 2.

Page 35: Data Structure Graph DMZ #DMZone

Now let’s answer some questions.

Which table is “most important” to ensure you are importing to build a data warehouse?

The tables with the higher centrality measures.

For an operational system these will also be the tables that have the most queries written against them.

These will be your bottlenecks for any system.

Is this data model optimized for reading or writing?

What is the density of the data model?

The higher density is optimized for write, lower density is optimized for read.

Page 36: Data Structure Graph DMZ #DMZone

Barabasi-Albert model and Scale free networks.

Preferential attachment.

There are a few different models available for analysis and prediction of networks.

A Barabsi-Albert model can be summarized as a “rich get richer” model. In other words, the more connected a node is, when new nodes are added, they are more than likely connected to these well connected nodes.

This suspiciously sounds similar to our data modeling concepts related to conformed dimensions.

My suspicion is there are many data models that fit this model.

Please send me some anonymized data models. I want to research this more.

Page 37: Data Structure Graph DMZ #DMZone

Some theoretical thoughts.

Let’s assume we have an equation for the growth of every table we have collected from our little topological study above(more on this in a couple slides).

Let us further assume we have a graph of the same tables.

Can you do anything interesting with this?

The derivative of each equation shows us the growth rate of the table.

What happens if we plug that derivative in the entropy equation for the graph?

What would this represent?

Could this be considered an valuation method?

A way to put a dollar value on a data model?

If you try it, let me know what you find out.

Page 38: Data Structure Graph DMZ #DMZone

Apply the theory.

Using a few metrics from each table we can do some clustering.

Take the number of columns of a table, the centrality measure, and the growth rate you have a vector for each table.

Doing some simple cosine similarity on these vectors will tell you mathematically which tables are similar.

Is this finding consistent with expectations?

If not should the model be adjusted?

What does this result say to you?

Page 39: Data Structure Graph DMZ #DMZone

Deriving the growth rate of each table.

Little R demonstration to follow.

Using a design methodology like the data vault mandates that every table have date timestamps for when the data is loaded.

Collect how many records are loaded per day.

A calculation that represents the growth formula for each table can be derived with R.

Using the growth rate, centrality, and the width of a table (number of columns) you can do cosine similarity to determine the tables that are mathematically similar to each other.

Using this information you may be able to reallocate the infrastructure that the data warehouse sits on.

Is every table stored on the same disk storage media? Does it need to be?

How about caching? Using these metrics alone you can make a well informed decision about your storage platform.

The following image is a small topological representation of this process.

This is still slightly theoretical, and I welcome having a conversation with anyone that may want to know more.

Again, send me anonymized data. Hopefully along with the Data Structure Graph you generated from your data.

Page 40: Data Structure Graph DMZ #DMZone

This is what the topology may look like.

Page 41: Data Structure Graph DMZ #DMZone

Consider the following:

If you need assistance, contact me directly (I am easy to find @dougneedham)

Network/Graph Analysis is cool.

It can show you some interesting things about your data that you may not have considered.

Page 42: Data Structure Graph DMZ #DMZone

What did I leave out?

Graphs that change over time – What happens when you remove a single Edge or Vertex?

Comparing two networks – If you have the same number of edges and nodes, are two graphs the same?

Contagion – How will data spread through the network. (Since a DSG represents different types of Edges based on Edge_Label, Contagion should not affect the entire network). This is also commonly known as data lineage. If you don’t have a tool that does it, with a bit of metadata management this can be derived from a Data Structure Graph Level 2

Page 43: Data Structure Graph DMZ #DMZone

Other Analysis

What else can be done with Social Network Analysis?

How about risk exposure to banks?

http://www.federalreserve.gov/newsevents/speech/yellen20130104a.htm

Page 44: Data Structure Graph DMZ #DMZone

A little history

Page 45: Data Structure Graph DMZ #DMZone

One other cool bit of Math

How many reports can your dimensional data model support?

Do you have the situation where people want to create a project out of a report, rather than do a proper data model design up front?

Here is some help.

The upper bound of the total number of reports that a conformed dimension data model can support is calculated by:

Calculate the number of selectable columns in each dimension (2𝑐 − 1)

Create the adjacency matrix for the dimensions to facts

A bit of multiplication.

More details here: http://bit.ly/MeasuringDimensionalModels

Page 46: Data Structure Graph DMZ #DMZone

Graphs are Cool!

Help me.

Please send me anonymized data.

In order to present more about how the mathematics of Graph theory, and social network analysis can be applied in general to the application of data modeling, I need more data.

This is a fascinating topic, if you want to reach out to me directly I can be reached at: [email protected]

Here is my GitHub for the code and data from the book, and examples: http://bit.ly/DataStructureGraph_github

Page 47: Data Structure Graph DMZ #DMZone

https://dougneedham.shinyapps.io/DataStructureGraphHard to see, I know, but the top diagram is the “master graph”, the bottom image is a single Edge_Label. You can see how an individual data entity flows through an organization.

Page 48: Data Structure Graph DMZ #DMZone

My bookGoes through a number of examples for doing an Graph analysis of a fictional organization.

Page 49: Data Structure Graph DMZ #DMZone

Final Thoughts – Questions?