Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

27
Building a Graph Database in Neo4j with Spark & Spark SQL to Gain New Insights from Log Data ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST, HORTONWORKS RACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO

Transcript of Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Page 1: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Building a Graph Database in Neo4j with Spark & Spark SQL to Gain New Insights from Log Data

ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST, HORTONWORKSRACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO

Page 2: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Overview Overview of business problem

Introduction to TiVo and TiVo data Challenges with using data to optimize UI navigation

Solution Demo Next Steps Questions

Page 3: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

TiVoBackground and context

TiVo is a discovery platform for integrated entertainment Multiple ways to find content

TiVo Roamio Demo Video

TiVo collects ~2 million logs a day from their boxes User action events, TiVo action events, inventory events These events are “memory-less” and don’t know what happened

prior to the event

Page 4: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data
Page 5: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data
Page 6: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data
Page 7: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data
Page 8: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

MotivationTiVo Business Initiatives Help users get to content they want faster Help users discover new content easier Is feature X important to discovering and getting to content? (ex: Is the guide

still used to find content?

Challenge Measuring a KPI for initiatives in a log stream that is “memory-less” Identifying events or a pattern of events that impacts the KPIs

Data Objective – “Path Analysis” Analysis that answers questions around the navigational paths users take to

get to or from defined start and/or end points

Page 9: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Architecture Challenges Traditional data platform that was sample-based and SQL-based Relational databases

Page 10: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Technical Challenges Relational Databases Challenges

Little flexibility in “click path” definition Decisions about defining “paths” are made during processing step Many business assumptions have to be made with little insight

Page 11: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Solution

Page 12: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Solution

Graph Database (Neo4j) Relationships are first-class citizens Simple abstractions Enable sophisticated models

“Path Analysis”

Page 13: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

PrototypeOne Day Graph Info

Edges: 57K relationshipsNodes:

135 UI or “screen” nodes12 “watch content” nodes

Page 14: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Size reduction (2K times)

70 GB log data 35MB Neo DB (all nodes & edges) 1 Day Oct 1, 2015

LOGS70+ GB

UI nodesWatch nodesEdges

35 MB

Page 15: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Screens, Transitions, and Content

Screen eventsRemote button press eventsWatch content events

Page 16: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Sample Graph

Live Movie My Shows

TiVo Central(Home)

Switched to

Switched to

Switched to

Page 17: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Architecture Overview

Page 18: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

What’s captured in the graph? Node (UI)

Name Timestamp

Node (Watch) Type, e.g. Recorded Genre, e.g. TV Show Timestamp

Edge Average Time Total number of keys pressed Key sequence

e.g. Home Up/Down Select/Play Total number of times path taken Unique number of users taking this path

Page 19: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

What’s captured in the graph?

TiVo Central

1/1/2016 0.4s average time3 keys pressedHome-Down-Select50 times path used27 unique users

Live Movie

1/1/2016

Page 20: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Raw Log File example…1444809715713072|Watch|live|WBINDT|MV|506|EP019641150097...1444809715812909|Key|HOME1444880816123454|UI|TivoCentralScreen1444809716234553|Key|DOWN1444809716354363|Key|SELECT1444809716518701|Trick3|PLAY|116|1|100|-11444809719888072|Key|PLAY1444809719889072|Trick3|PLAY|119|1|100|-11444809726966880|Watch|rec|WFXTDT|SH|508|...…

Page 21: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Filtered Log...Watch: LIVE MOVIE Key: HOME UI: TIVO HOME Key: DOWN Key: SELECT Key: PLAYWatch: REC SHOW…

Edge

Node

Node

EdgeNode

Same day

Page 22: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Algorithm Overview (1 of 2)1. Filter for desired events

• Remove non-Screen, non-Watch, non-Key events2. Session-ize and order logs to reflect Screen/Watch/Edge events3. Define display for Key Press events - two formats

• Normal: SELECT & UP x 2 & GUIDE & SELECT • Compact: TIVO & 9 KEYS & SELECT

4. Generate an Edge if transition < max time set by stakeholders (e.g. 5 min)• For all logs find the following sequence: Node X - timestamp x (start time)

key A key B key C

Node Y - timestamp y (end time)

Page 23: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Algorithm Overview (2 of 2)

5. For each unique node-edge-node calculate:1. Average transition time

2. Number of transitions

3. Number of unique transitions

4. Number of keys pressed

5. Key sequence (normal or compact)

6. Export results to CSV files

Page 24: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

DEMO

What is the most popular path people take to get to content? live vs. recorded

What percent of total paths are most popular? What path is most popular? Overall? Unique? What app is most popular? What percent of total paths involve the Guide screen?

Page 25: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Business Advantages

Measure KPIs for time to content and content discovery

Optimize KPIs (understanding user behavior that impacts the KPIs)

Enhance A/B Testing by helping to answer “why?” Simplify user experience across products Increase engagement with new content Understand feature usage interactions not only as a

mutually exclusive experience

Page 26: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Future Work

Deploy to production -- multi-day queries Add relationships and nodes for feature usage Classify paths (“discovery” or “known

destination”) Exploratory analysis

Page 27: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Thanks! @RobHryniewicz @Bayesbabe